@crawlee/http 3.13.3-beta.9 → 3.13.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
|
@@ -29,15 +29,15 @@ JSONData extends Dictionary = any> = RequestHandler<FileDownloadCrawlingContext<
|
|
|
29
29
|
*
|
|
30
30
|
* Since `FileDownload` uses raw HTTP requests to download the files, it is very fast and bandwidth-efficient.
|
|
31
31
|
* However, it doesn't parse the content - if you need to e.g. extract data from the downloaded files,
|
|
32
|
-
* you might need to use {@
|
|
32
|
+
* you might need to use {@link CheerioCrawler}, {@link PuppeteerCrawler} or {@link PlaywrightCrawler} instead.
|
|
33
33
|
*
|
|
34
|
-
* `FileCrawler` downloads each URL using a plain HTTP request and then invokes the user-provided {@
|
|
34
|
+
* `FileCrawler` downloads each URL using a plain HTTP request and then invokes the user-provided {@link FileDownloadOptions.requestHandler} where the user can specify what to do with the downloaded data.
|
|
35
35
|
*
|
|
36
|
-
* The source URLs are represented using {@
|
|
36
|
+
* The source URLs are represented using {@link Request} objects that are fed from {@link RequestList} or {@link RequestQueue} instances provided by the {@link FileDownloadOptions.requestList} or {@link FileDownloadOptions.requestQueue} constructor options, respectively.
|
|
37
37
|
*
|
|
38
|
-
* If both {@
|
|
38
|
+
* If both {@link FileDownloadOptions.requestList} and {@link FileDownloadOptions.requestQueue} are used, the instance first processes URLs from the {@link RequestList} and automatically enqueues all of them to {@link RequestQueue} before it starts their processing. This ensures that a single URL is not crawled multiple times.
|
|
39
39
|
*
|
|
40
|
-
* The crawler finishes when there are no more {@
|
|
40
|
+
* The crawler finishes when there are no more {@link Request} objects to crawl.
|
|
41
41
|
*
|
|
42
42
|
* We can use the `preNavigationHooks` to adjust `gotOptions`:
|
|
43
43
|
*
|
|
@@ -49,7 +49,7 @@ JSONData extends Dictionary = any> = RequestHandler<FileDownloadCrawlingContext<
|
|
|
49
49
|
* ]
|
|
50
50
|
* ```
|
|
51
51
|
*
|
|
52
|
-
* New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the {@
|
|
52
|
+
* New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the {@link AutoscaledPool} class. All {@link AutoscaledPool} configuration options can be passed to the `autoscaledPoolOptions` parameter of the `FileCrawler` constructor. For user convenience, the `minConcurrency` and `maxConcurrency` {@link AutoscaledPool} options are available directly in the `FileCrawler` constructor.
|
|
53
53
|
*
|
|
54
54
|
* ## Example usage
|
|
55
55
|
*
|
|
@@ -74,9 +74,9 @@ export declare class FileDownload extends HttpCrawler<FileDownloadCrawlingContex
|
|
|
74
74
|
private streamRequestHandler;
|
|
75
75
|
}
|
|
76
76
|
/**
|
|
77
|
-
* Creates new {@
|
|
78
|
-
* This instance can then serve as a `requestHandler` of your {@
|
|
79
|
-
* Defaults to the {@
|
|
77
|
+
* Creates new {@link Router} instance that works based on request labels.
|
|
78
|
+
* This instance can then serve as a `requestHandler` of your {@link FileDownload}.
|
|
79
|
+
* Defaults to the {@link FileDownloadCrawlingContext}.
|
|
80
80
|
*
|
|
81
81
|
* > Serves as a shortcut for using `Router.create<FileDownloadCrawlingContext>()`.
|
|
82
82
|
*
|
|
@@ -10,15 +10,15 @@ const index_1 = require("../index");
|
|
|
10
10
|
*
|
|
11
11
|
* Since `FileDownload` uses raw HTTP requests to download the files, it is very fast and bandwidth-efficient.
|
|
12
12
|
* However, it doesn't parse the content - if you need to e.g. extract data from the downloaded files,
|
|
13
|
-
* you might need to use {@
|
|
13
|
+
* you might need to use {@link CheerioCrawler}, {@link PuppeteerCrawler} or {@link PlaywrightCrawler} instead.
|
|
14
14
|
*
|
|
15
|
-
* `FileCrawler` downloads each URL using a plain HTTP request and then invokes the user-provided {@
|
|
15
|
+
* `FileCrawler` downloads each URL using a plain HTTP request and then invokes the user-provided {@link FileDownloadOptions.requestHandler} where the user can specify what to do with the downloaded data.
|
|
16
16
|
*
|
|
17
|
-
* The source URLs are represented using {@
|
|
17
|
+
* The source URLs are represented using {@link Request} objects that are fed from {@link RequestList} or {@link RequestQueue} instances provided by the {@link FileDownloadOptions.requestList} or {@link FileDownloadOptions.requestQueue} constructor options, respectively.
|
|
18
18
|
*
|
|
19
|
-
* If both {@
|
|
19
|
+
* If both {@link FileDownloadOptions.requestList} and {@link FileDownloadOptions.requestQueue} are used, the instance first processes URLs from the {@link RequestList} and automatically enqueues all of them to {@link RequestQueue} before it starts their processing. This ensures that a single URL is not crawled multiple times.
|
|
20
20
|
*
|
|
21
|
-
* The crawler finishes when there are no more {@
|
|
21
|
+
* The crawler finishes when there are no more {@link Request} objects to crawl.
|
|
22
22
|
*
|
|
23
23
|
* We can use the `preNavigationHooks` to adjust `gotOptions`:
|
|
24
24
|
*
|
|
@@ -30,7 +30,7 @@ const index_1 = require("../index");
|
|
|
30
30
|
* ]
|
|
31
31
|
* ```
|
|
32
32
|
*
|
|
33
|
-
* New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the {@
|
|
33
|
+
* New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the {@link AutoscaledPool} class. All {@link AutoscaledPool} configuration options can be passed to the `autoscaledPoolOptions` parameter of the `FileCrawler` constructor. For user convenience, the `minConcurrency` and `maxConcurrency` {@link AutoscaledPool} options are available directly in the `FileCrawler` constructor.
|
|
34
34
|
*
|
|
35
35
|
* ## Example usage
|
|
36
36
|
*
|
|
@@ -130,9 +130,9 @@ class FileDownload extends index_1.HttpCrawler {
|
|
|
130
130
|
}
|
|
131
131
|
exports.FileDownload = FileDownload;
|
|
132
132
|
/**
|
|
133
|
-
* Creates new {@
|
|
134
|
-
* This instance can then serve as a `requestHandler` of your {@
|
|
135
|
-
* Defaults to the {@
|
|
133
|
+
* Creates new {@link Router} instance that works based on request labels.
|
|
134
|
+
* This instance can then serve as a `requestHandler` of your {@link FileDownload}.
|
|
135
|
+
* Defaults to the {@link FileDownloadCrawlingContext}.
|
|
136
136
|
*
|
|
137
137
|
* > Serves as a shortcut for using `Router.create<FileDownloadCrawlingContext>()`.
|
|
138
138
|
*
|
|
@@ -21,7 +21,7 @@ export type HttpErrorHandler<UserData extends Dictionary = any, // with default
|
|
|
21
21
|
JSONData extends JsonValue = any> = ErrorHandler<HttpCrawlingContext<UserData, JSONData>>;
|
|
22
22
|
export interface HttpCrawlerOptions<Context extends InternalHttpCrawlingContext = InternalHttpCrawlingContext> extends BasicCrawlerOptions<Context> {
|
|
23
23
|
/**
|
|
24
|
-
* An alias for {@
|
|
24
|
+
* An alias for {@link HttpCrawlerOptions.requestHandler}
|
|
25
25
|
* Soon to be removed, use `requestHandler` instead.
|
|
26
26
|
* @deprecated
|
|
27
27
|
*/
|
|
@@ -54,7 +54,7 @@ export interface HttpCrawlerOptions<Context extends InternalHttpCrawlingContext
|
|
|
54
54
|
* ```
|
|
55
55
|
*
|
|
56
56
|
* Modyfing `pageOptions` is supported only in Playwright incognito.
|
|
57
|
-
* See {@
|
|
57
|
+
* See {@link PrePageCreateHook}
|
|
58
58
|
*/
|
|
59
59
|
preNavigationHooks?: InternalHttpHook<Context>[];
|
|
60
60
|
/**
|
|
@@ -80,7 +80,7 @@ export interface HttpCrawlerOptions<Context extends InternalHttpCrawlingContext
|
|
|
80
80
|
* Sadly, there are some websites which use invalid headers. Those are encoded using the UTF-8 encoding.
|
|
81
81
|
* If those sites actually use a different encoding, the response will be corrupted. You can use
|
|
82
82
|
* `suggestResponseEncoding` to fall back to a certain encoding, if you know that your target website uses it.
|
|
83
|
-
* To force a certain encoding, disregarding the response headers, use {@
|
|
83
|
+
* To force a certain encoding, disregarding the response headers, use {@link HttpCrawlerOptions.forceResponseEncoding}
|
|
84
84
|
* ```
|
|
85
85
|
* // Will fall back to windows-1250 encoding if none found
|
|
86
86
|
* suggestResponseEncoding: 'windows-1250'
|
|
@@ -90,7 +90,7 @@ export interface HttpCrawlerOptions<Context extends InternalHttpCrawlingContext
|
|
|
90
90
|
/**
|
|
91
91
|
* By default this crawler will extract correct encoding from the HTTP response headers. Use `forceResponseEncoding`
|
|
92
92
|
* to force a certain encoding, disregarding the response headers.
|
|
93
|
-
* To only provide a default for missing encodings, use {@
|
|
93
|
+
* To only provide a default for missing encodings, use {@link HttpCrawlerOptions.suggestResponseEncoding}
|
|
94
94
|
* ```
|
|
95
95
|
* // Will force windows-1250 encoding even if headers say otherwise
|
|
96
96
|
* forceResponseEncoding: 'windows-1250'
|
|
@@ -160,7 +160,7 @@ Crawler = HttpCrawler<any>> extends CrawlingContext<Crawler, UserData> {
|
|
|
160
160
|
*/
|
|
161
161
|
waitForSelector(selector: string, timeoutMs?: number): Promise<void>;
|
|
162
162
|
/**
|
|
163
|
-
* Returns Cheerio handle for `page.content()`, allowing to work with the data same way as with {@
|
|
163
|
+
* Returns Cheerio handle for `page.content()`, allowing to work with the data same way as with {@link CheerioCrawler}.
|
|
164
164
|
* When provided with the `selector` argument, it will throw if it's not available.
|
|
165
165
|
*
|
|
166
166
|
* **Example usage:**
|
|
@@ -183,20 +183,20 @@ JSONData extends JsonValue = any> = RequestHandler<HttpCrawlingContext<UserData,
|
|
|
183
183
|
* or from a dynamic queue of URLs enabling recursive crawling of websites.
|
|
184
184
|
*
|
|
185
185
|
* It is very fast and efficient on data bandwidth. However, if the target website requires JavaScript
|
|
186
|
-
* to display the content, you might need to use {@
|
|
186
|
+
* to display the content, you might need to use {@link PuppeteerCrawler} or {@link PlaywrightCrawler} instead,
|
|
187
187
|
* because it loads the pages using full-featured headless Chrome browser.
|
|
188
188
|
*
|
|
189
189
|
* This crawler downloads each URL using a plain HTTP request and doesn't do any HTML parsing.
|
|
190
190
|
*
|
|
191
|
-
* The source URLs are represented using {@
|
|
192
|
-
* {@
|
|
193
|
-
* or {@
|
|
191
|
+
* The source URLs are represented using {@link Request} objects that are fed from
|
|
192
|
+
* {@link RequestList} or {@link RequestQueue} instances provided by the {@link HttpCrawlerOptions.requestList}
|
|
193
|
+
* or {@link HttpCrawlerOptions.requestQueue} constructor options, respectively.
|
|
194
194
|
*
|
|
195
|
-
* If both {@
|
|
196
|
-
* the instance first processes URLs from the {@
|
|
197
|
-
* to {@
|
|
195
|
+
* If both {@link HttpCrawlerOptions.requestList} and {@link HttpCrawlerOptions.requestQueue} are used,
|
|
196
|
+
* the instance first processes URLs from the {@link RequestList} and automatically enqueues all of them
|
|
197
|
+
* to {@link RequestQueue} before it starts their processing. This ensures that a single URL is not crawled multiple times.
|
|
198
198
|
*
|
|
199
|
-
* The crawler finishes when there are no more {@
|
|
199
|
+
* The crawler finishes when there are no more {@link Request} objects to crawl.
|
|
200
200
|
*
|
|
201
201
|
* We can use the `preNavigationHooks` to adjust `gotOptions`:
|
|
202
202
|
*
|
|
@@ -211,15 +211,15 @@ JSONData extends JsonValue = any> = RequestHandler<HttpCrawlingContext<UserData,
|
|
|
211
211
|
* By default, this crawler only processes web pages with the `text/html`
|
|
212
212
|
* and `application/xhtml+xml` MIME content types (as reported by the `Content-Type` HTTP header),
|
|
213
213
|
* and skips pages with other content types. If you want the crawler to process other content types,
|
|
214
|
-
* use the {@
|
|
214
|
+
* use the {@link HttpCrawlerOptions.additionalMimeTypes} constructor option.
|
|
215
215
|
* Beware that the parsing behavior differs for HTML, XML, JSON and other types of content.
|
|
216
|
-
* For details, see {@
|
|
216
|
+
* For details, see {@link HttpCrawlerOptions.requestHandler}.
|
|
217
217
|
*
|
|
218
218
|
* New requests are only dispatched when there is enough free CPU and memory available,
|
|
219
|
-
* using the functionality provided by the {@
|
|
220
|
-
* All {@
|
|
219
|
+
* using the functionality provided by the {@link AutoscaledPool} class.
|
|
220
|
+
* All {@link AutoscaledPool} configuration options can be passed to the `autoscaledPoolOptions`
|
|
221
221
|
* parameter of the constructor. For user convenience, the `minConcurrency` and `maxConcurrency`
|
|
222
|
-
* {@
|
|
222
|
+
* {@link AutoscaledPool} options are available directly in the constructor.
|
|
223
223
|
*
|
|
224
224
|
* **Example usage:**
|
|
225
225
|
*
|
|
@@ -247,7 +247,7 @@ JSONData extends JsonValue = any> = RequestHandler<HttpCrawlingContext<UserData,
|
|
|
247
247
|
export declare class HttpCrawler<Context extends InternalHttpCrawlingContext<any, any, HttpCrawler<Context>>> extends BasicCrawler<Context> {
|
|
248
248
|
readonly config: Configuration;
|
|
249
249
|
/**
|
|
250
|
-
* A reference to the underlying {@
|
|
250
|
+
* A reference to the underlying {@link ProxyConfiguration} class that manages the crawler's proxies.
|
|
251
251
|
* Only available if used by the crawler.
|
|
252
252
|
*/
|
|
253
253
|
proxyConfiguration?: ProxyConfiguration;
|
|
@@ -496,9 +496,9 @@ interface RequestFunctionOptions {
|
|
|
496
496
|
gotOptions: OptionsInit;
|
|
497
497
|
}
|
|
498
498
|
/**
|
|
499
|
-
* Creates new {@
|
|
500
|
-
* This instance can then serve as a `requestHandler` of your {@
|
|
501
|
-
* Defaults to the {@
|
|
499
|
+
* Creates new {@link Router} instance that works based on request labels.
|
|
500
|
+
* This instance can then serve as a `requestHandler` of your {@link HttpCrawler}.
|
|
501
|
+
* Defaults to the {@link HttpCrawlingContext}.
|
|
502
502
|
*
|
|
503
503
|
* > Serves as a shortcut for using `Router.create<HttpCrawlingContext>()`.
|
|
504
504
|
*
|
|
@@ -36,20 +36,20 @@ const HTTP_OPTIMIZED_AUTOSCALED_POOL_OPTIONS = {
|
|
|
36
36
|
* or from a dynamic queue of URLs enabling recursive crawling of websites.
|
|
37
37
|
*
|
|
38
38
|
* It is very fast and efficient on data bandwidth. However, if the target website requires JavaScript
|
|
39
|
-
* to display the content, you might need to use {@
|
|
39
|
+
* to display the content, you might need to use {@link PuppeteerCrawler} or {@link PlaywrightCrawler} instead,
|
|
40
40
|
* because it loads the pages using full-featured headless Chrome browser.
|
|
41
41
|
*
|
|
42
42
|
* This crawler downloads each URL using a plain HTTP request and doesn't do any HTML parsing.
|
|
43
43
|
*
|
|
44
|
-
* The source URLs are represented using {@
|
|
45
|
-
* {@
|
|
46
|
-
* or {@
|
|
44
|
+
* The source URLs are represented using {@link Request} objects that are fed from
|
|
45
|
+
* {@link RequestList} or {@link RequestQueue} instances provided by the {@link HttpCrawlerOptions.requestList}
|
|
46
|
+
* or {@link HttpCrawlerOptions.requestQueue} constructor options, respectively.
|
|
47
47
|
*
|
|
48
|
-
* If both {@
|
|
49
|
-
* the instance first processes URLs from the {@
|
|
50
|
-
* to {@
|
|
48
|
+
* If both {@link HttpCrawlerOptions.requestList} and {@link HttpCrawlerOptions.requestQueue} are used,
|
|
49
|
+
* the instance first processes URLs from the {@link RequestList} and automatically enqueues all of them
|
|
50
|
+
* to {@link RequestQueue} before it starts their processing. This ensures that a single URL is not crawled multiple times.
|
|
51
51
|
*
|
|
52
|
-
* The crawler finishes when there are no more {@
|
|
52
|
+
* The crawler finishes when there are no more {@link Request} objects to crawl.
|
|
53
53
|
*
|
|
54
54
|
* We can use the `preNavigationHooks` to adjust `gotOptions`:
|
|
55
55
|
*
|
|
@@ -64,15 +64,15 @@ const HTTP_OPTIMIZED_AUTOSCALED_POOL_OPTIONS = {
|
|
|
64
64
|
* By default, this crawler only processes web pages with the `text/html`
|
|
65
65
|
* and `application/xhtml+xml` MIME content types (as reported by the `Content-Type` HTTP header),
|
|
66
66
|
* and skips pages with other content types. If you want the crawler to process other content types,
|
|
67
|
-
* use the {@
|
|
67
|
+
* use the {@link HttpCrawlerOptions.additionalMimeTypes} constructor option.
|
|
68
68
|
* Beware that the parsing behavior differs for HTML, XML, JSON and other types of content.
|
|
69
|
-
* For details, see {@
|
|
69
|
+
* For details, see {@link HttpCrawlerOptions.requestHandler}.
|
|
70
70
|
*
|
|
71
71
|
* New requests are only dispatched when there is enough free CPU and memory available,
|
|
72
|
-
* using the functionality provided by the {@
|
|
73
|
-
* All {@
|
|
72
|
+
* using the functionality provided by the {@link AutoscaledPool} class.
|
|
73
|
+
* All {@link AutoscaledPool} configuration options can be passed to the `autoscaledPoolOptions`
|
|
74
74
|
* parameter of the constructor. For user convenience, the `minConcurrency` and `maxConcurrency`
|
|
75
|
-
* {@
|
|
75
|
+
* {@link AutoscaledPool} options are available directly in the constructor.
|
|
76
76
|
*
|
|
77
77
|
* **Example usage:**
|
|
78
78
|
*
|
|
@@ -123,7 +123,7 @@ class HttpCrawler extends basic_1.BasicCrawler {
|
|
|
123
123
|
value: config
|
|
124
124
|
});
|
|
125
125
|
/**
|
|
126
|
-
* A reference to the underlying {@
|
|
126
|
+
* A reference to the underlying {@link ProxyConfiguration} class that manages the crawler's proxies.
|
|
127
127
|
* Only available if used by the crawler.
|
|
128
128
|
*/
|
|
129
129
|
Object.defineProperty(this, "proxyConfiguration", {
|
|
@@ -699,9 +699,9 @@ function parseContentTypeFromResponse(response) {
|
|
|
699
699
|
};
|
|
700
700
|
}
|
|
701
701
|
/**
|
|
702
|
-
* Creates new {@
|
|
703
|
-
* This instance can then serve as a `requestHandler` of your {@
|
|
704
|
-
* Defaults to the {@
|
|
702
|
+
* Creates new {@link Router} instance that works based on request labels.
|
|
703
|
+
* This instance can then serve as a `requestHandler` of your {@link HttpCrawler}.
|
|
704
|
+
* Defaults to the {@link HttpCrawlingContext}.
|
|
705
705
|
*
|
|
706
706
|
* > Serves as a shortcut for using `Router.create<HttpCrawlingContext>()`.
|
|
707
707
|
*
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@crawlee/http",
|
|
3
|
-
"version": "3.13.3
|
|
3
|
+
"version": "3.13.3",
|
|
4
4
|
"description": "The scalable web crawling and scraping library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.",
|
|
5
5
|
"engines": {
|
|
6
6
|
"node": ">=16.0.0"
|
|
@@ -55,9 +55,9 @@
|
|
|
55
55
|
"dependencies": {
|
|
56
56
|
"@apify/timeout": "^0.3.0",
|
|
57
57
|
"@apify/utilities": "^2.7.10",
|
|
58
|
-
"@crawlee/basic": "3.13.3
|
|
59
|
-
"@crawlee/types": "3.13.3
|
|
60
|
-
"@crawlee/utils": "3.13.3
|
|
58
|
+
"@crawlee/basic": "3.13.3",
|
|
59
|
+
"@crawlee/types": "3.13.3",
|
|
60
|
+
"@crawlee/utils": "3.13.3",
|
|
61
61
|
"@types/content-type": "^1.1.5",
|
|
62
62
|
"cheerio": "1.0.0-rc.12",
|
|
63
63
|
"content-type": "^1.0.4",
|
|
@@ -75,5 +75,5 @@
|
|
|
75
75
|
}
|
|
76
76
|
}
|
|
77
77
|
},
|
|
78
|
-
"gitHead": "
|
|
78
|
+
"gitHead": "279cadbd3cd6342f36cc4d841e07b999e472420d"
|
|
79
79
|
}
|