@ioodev/nodescraper 1.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +106 -0
- package/LICENSE +21 -0
- package/README.md +449 -0
- package/index.js +3 -0
- package/package.json +57 -0
- package/src/NodeScraper.js +534 -0
- package/src/constants.js +46 -0
- package/src/utils.js +105 -0
- package/types/index.d.ts +152 -0
package/CHANGELOG.md
ADDED
|
@@ -0,0 +1,106 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to this project are documented in this file.
|
|
4
|
+
The format follows [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
5
|
+
and this project follows [Semantic Versioning](https://semver.org/).
|
|
6
|
+
|
|
7
|
+
## [1.1.1] — 2026-06-19
|
|
8
|
+
|
|
9
|
+
### Changed
|
|
10
|
+
|
|
11
|
+
- **Package renamed from `@riodevnet/nodescraper` to `@ioodev/nodescraper`.**
|
|
12
|
+
npm scopes are tied 1:1 to an account/organization name and cannot be
|
|
13
|
+
renamed in place, so this is a fresh publish under the `@ioodev` scope
|
|
14
|
+
rather than an update to the old package. No code or API changes.
|
|
15
|
+
- Updated all `github.com/riodevnet/...` references (README, `package.json`
|
|
16
|
+
`repository`/`homepage`/`bugs`, default User-Agent string in
|
|
17
|
+
`src/constants.js`, TypeScript header comment) to `github.com/ioodev/...`.
|
|
18
|
+
|
|
19
|
+
### Compatibility
|
|
20
|
+
|
|
21
|
+
No functional changes. If you depend on `@riodevnet/nodescraper`, see the
|
|
22
|
+
README's "Migrating from `@riodevnet/nodescraper`" section — it's a
|
|
23
|
+
drop-in rename, just change the import/install path to `@ioodev/nodescraper`.
|
|
24
|
+
|
|
25
|
+
## [1.1.0] — 2026-06-19
|
|
26
|
+
|
|
27
|
+
A bug-fix-and-feature release. Every existing v1.0 method keeps its name
|
|
28
|
+
and return shape — code written against 1.0 keeps working — but several
|
|
29
|
+
returned values are now *correct* where they previously weren't, and a
|
|
30
|
+
number of long-missing capabilities (error visibility, raw-HTML loading,
|
|
31
|
+
JSON-LD, etc.) have been added.
|
|
32
|
+
|
|
33
|
+
### Fixed
|
|
34
|
+
|
|
35
|
+
- **`keywords()` / `viewport()` returned untrimmed strings.** Splitting
|
|
36
|
+
`"example, domain, test"` on `,` used to yield `["example", " domain",
|
|
37
|
+
" test"]` (note the leading spaces). Entries are now trimmed and empty
|
|
38
|
+
entries are dropped.
|
|
39
|
+
- **`link_details()` returned `[""]` instead of `[]`** for `rel` on links
|
|
40
|
+
with no `rel` attribute, which broke `.includes('nofollow')`-style
|
|
41
|
+
checks on those links. `rel` is now reliably `[]` when absent.
|
|
42
|
+
- **Failed loads were completely silent.** `init()` swallowed every error
|
|
43
|
+
(network failure, timeout, 404/403/500 responses, invalid URLs) and just
|
|
44
|
+
left `soup` as `null`, with no way to find out *why*. `init()` now
|
|
45
|
+
records the failure on `this.error` / `this.statusCode`, exposed via
|
|
46
|
+
`getError()` and `getStatusCode()`, and an explicit `isLoaded()` check.
|
|
47
|
+
- **No default `User-Agent`.** Axios' default UA string causes many real
|
|
48
|
+
sites to return 403 or an empty body. NodeScraper now sends a realistic
|
|
49
|
+
browser-like `User-Agent` by default (overridable via `options.userAgent`
|
|
50
|
+
or `options.headers`).
|
|
51
|
+
- **`filter()` could throw on a malformed selector** (e.g. a typo'd
|
|
52
|
+
attribute/class selector) and crash the caller. It now fails soft and
|
|
53
|
+
returns `null`, consistent with every other getter.
|
|
54
|
+
- **No protocol allow-list.** `_isValidUrl()` accepted any URL that the
|
|
55
|
+
`URL` constructor could parse, including non-HTTP schemes. Targets are
|
|
56
|
+
now restricted to `http:`/`https:` by default (configurable via
|
|
57
|
+
`options.allowedProtocols`), failing fast with a clear error instead of
|
|
58
|
+
relying on the underlying HTTP client to reject it.
|
|
59
|
+
|
|
60
|
+
### Added
|
|
61
|
+
|
|
62
|
+
- `loadHTML(html)` — parse a raw HTML string synchronously, with no HTTP
|
|
63
|
+
request. Useful for tests or HTML obtained some other way.
|
|
64
|
+
- `meta(name, attr)` — generic meta tag reader for any `name`/`property`.
|
|
65
|
+
- `lang()`, `robots()`, `favicon()` — new metadata getters. `favicon()`
|
|
66
|
+
resolves to an absolute URL using the page URL as the base.
|
|
67
|
+
- `jsonLd()` — extracts and parses every `<script type="application/ld+json">`
|
|
68
|
+
block on the page, skipping malformed ones.
|
|
69
|
+
- `text()` — whitespace-normalized visible body text.
|
|
70
|
+
- `html()` — the raw HTML of the last successful load.
|
|
71
|
+
- `viewport_object()` — viewport directives parsed into a key/value object.
|
|
72
|
+
- `toJSON()` — a ready-to-serialize snapshot of the most commonly used fields.
|
|
73
|
+
- `link_details()` / `image_details()` now include an `absolute_url` field
|
|
74
|
+
resolved against the page URL, alongside the original (possibly relative)
|
|
75
|
+
`url`.
|
|
76
|
+
- Constructor `options`: `timeout`, `userAgent`, `headers`, `maxRedirects`,
|
|
77
|
+
`allowedProtocols`, `throwOnError`.
|
|
78
|
+
- `NodeScraper.scrape(url, options)` and `NodeScraper.scrapeAll(urls, options)`
|
|
79
|
+
static convenience methods for one-line and concurrent scraping.
|
|
80
|
+
- TypeScript declarations (`types/index.d.ts`), referenced via the
|
|
81
|
+
package's `types` field.
|
|
82
|
+
- Test suite (`node --test`) covering metadata extraction, link/image
|
|
83
|
+
details, `filter()`, and `init()` against a local HTTP server (404,
|
|
84
|
+
redirects, UA-blocking, timeouts, unsupported protocols).
|
|
85
|
+
- Runnable examples under `examples/`.
|
|
86
|
+
|
|
87
|
+
### Changed
|
|
88
|
+
|
|
89
|
+
- Project reorganized into `src/` (implementation), `test/`, `examples/`,
|
|
90
|
+
and `types/`, with `index.js` at the root re-exporting `src/NodeScraper.js`
|
|
91
|
+
for a stable import path. See the README's "Project Structure" section.
|
|
92
|
+
- `package.json` gained `engines`, `repository`, `homepage`, `bugs`,
|
|
93
|
+
`files`, `exports`, and `types` fields.
|
|
94
|
+
|
|
95
|
+
### Compatibility
|
|
96
|
+
|
|
97
|
+
No breaking changes. All v1.0 method signatures and return *types* are
|
|
98
|
+
unchanged; only the contents of a few previously-incorrect return values
|
|
99
|
+
were fixed (see above). If your code relied on the untrimmed keyword/viewport
|
|
100
|
+
strings or on `rel: ['']`, double-check those spots.
|
|
101
|
+
|
|
102
|
+
## [1.0.0]
|
|
103
|
+
|
|
104
|
+
Initial release: metadata extraction (title, description, Open Graph,
|
|
105
|
+
Twitter Card, canonical, CSRF, etc.), heading/list/image/link extraction,
|
|
106
|
+
and the `filter()` custom DOM query helper.
|
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025 SnakyScraper
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,449 @@
|
|
|
1
|
+
# 🕸️ NodeScraper
|
|
2
|
+
|
|
3
|
+
**NodeScraper** is a fast and flexible Node.js web scraping toolkit built on [Axios](https://www.npmjs.com/package/axios) and [Cheerio](https://www.npmjs.com/package/cheerio). It gives you a small, predictable API for pulling structured metadata and HTML out of a page — titles, Open Graph/Twitter Card tags, JSON-LD, headings, lists, images, links, and arbitrary DOM fragments — with clean, consistent return values.
|
|
4
|
+
|
|
5
|
+
> Fast. Clean. JavaScript-style scraping. 🕸️⚡
|
|
6
|
+
|
|
7
|
+

|
|
8
|
+

|
|
9
|
+

|
|
10
|
+
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
## Table of Contents
|
|
14
|
+
|
|
15
|
+
- [What's new in 1.1.0](#-whats-new-in-110)
|
|
16
|
+
- [Features](#-features)
|
|
17
|
+
- [Installation](#-installation)
|
|
18
|
+
- [Quick start](#-quick-start)
|
|
19
|
+
- [Error handling](#-error-handling)
|
|
20
|
+
- [API reference](#-api-reference)
|
|
21
|
+
- [Custom DOM filtering](#-custom-dom-filtering)
|
|
22
|
+
- [TypeScript](#-typescript)
|
|
23
|
+
- [Project structure](#-project-structure)
|
|
24
|
+
- [Testing](#-testing)
|
|
25
|
+
- [Examples](#-examples)
|
|
26
|
+
- [Migrating from 1.0.x](#-migrating-from-10x)
|
|
27
|
+
- [Migrating from @riodevnet/nodescraper](#-migrating-from-riodevnetnodescraper)
|
|
28
|
+
- [Contributing](#-contributing)
|
|
29
|
+
- [License](#-license)
|
|
30
|
+
|
|
31
|
+
---
|
|
32
|
+
|
|
33
|
+
## 📦 Package renamed: `@riodevnet/nodescraper` → `@ioodev/nodescraper`
|
|
34
|
+
|
|
35
|
+
As of **v1.1.1**, this package is published under the `@ioodev` scope. npm
|
|
36
|
+
scopes are tied to the account/organization name and can't be renamed
|
|
37
|
+
in-place, so this is a fresh publish under the new scope rather than an
|
|
38
|
+
update to the old one. If you're on `@riodevnet/nodescraper`, switch to
|
|
39
|
+
`@ioodev/nodescraper` — the API is unchanged. See
|
|
40
|
+
[Migrating from `@riodevnet`](#-migrating-from-riodevnetnodescraper) below.
|
|
41
|
+
|
|
42
|
+
## 🆕 What's new in 1.1.0
|
|
43
|
+
|
|
44
|
+
This release fixes several real bugs and adds capabilities that were
|
|
45
|
+
missing from 1.0 — full details in [`CHANGELOG.md`](./CHANGELOG.md).
|
|
46
|
+
|
|
47
|
+
**Fixed**
|
|
48
|
+
- `keywords()` / `viewport()` no longer return untrimmed strings (`" domain"` → `"domain"`).
|
|
49
|
+
- `link_details().rel` is `[]` instead of `['']` when a link has no `rel` attribute.
|
|
50
|
+
- Failed loads are no longer silent — `getError()` / `getStatusCode()` tell you *why* a scrape failed (network error, timeout, 404/403/500, invalid URL).
|
|
51
|
+
- A realistic default `User-Agent` is now sent, so sites that block the bare Axios UA no longer fail with no explanation.
|
|
52
|
+
- `filter()` fails soft (`null`) instead of throwing on a malformed selector.
|
|
53
|
+
- URLs are restricted to `http:`/`https:` by default, failing fast with a clear error.
|
|
54
|
+
|
|
55
|
+
**Added**
|
|
56
|
+
- `loadHTML(html)` — parse a raw HTML string with no network request.
|
|
57
|
+
- `meta()`, `lang()`, `robots()`, `favicon()`, `jsonLd()`, `text()`, `html()`, `viewport_object()`, `toJSON()`.
|
|
58
|
+
- `absolute_url` field on `link_details()` / `image_details()`.
|
|
59
|
+
- Constructor options: `timeout`, `userAgent`, `headers`, `maxRedirects`, `allowedProtocols`, `throwOnError`.
|
|
60
|
+
- `NodeScraper.scrape()` / `NodeScraper.scrapeAll()` static convenience methods.
|
|
61
|
+
- TypeScript declarations, a real test suite, and runnable examples.
|
|
62
|
+
|
|
63
|
+
Nothing here is a breaking change to method names or return *shapes* — see [Migrating from 1.0.x](#-migrating-from-10x) if you depended on the buggy behavior.
|
|
64
|
+
|
|
65
|
+
---
|
|
66
|
+
|
|
67
|
+
## 🚀 Features
|
|
68
|
+
|
|
69
|
+
- ✅ Page metadata: title, description, keywords, author, charset, lang, robots, favicon, and more
|
|
70
|
+
- ✅ Open Graph, Twitter Card, canonical, CSRF token, and JSON-LD structured data
|
|
71
|
+
- ✅ HTML extraction: `h1`–`h6`, `p`, `ul`, `ol`, images, links — with absolute URLs resolved for you
|
|
72
|
+
- ✅ Powerful `filter()` method with class/ID/tag selectors for arbitrary DOM fragments
|
|
73
|
+
- ✅ Clear error reporting (`getError()`, `getStatusCode()`, `isLoaded()`) instead of silent failures
|
|
74
|
+
- ✅ Load from a live URL **or** from a raw HTML string (`loadHTML()`) — easy to test and reuse
|
|
75
|
+
- ✅ Configurable timeout, headers, User-Agent, redirects, and allowed protocols
|
|
76
|
+
- ✅ One-line single/batch scraping via `NodeScraper.scrape()` / `scrapeAll()`
|
|
77
|
+
- ✅ Ships with TypeScript declarations
|
|
78
|
+
- ✅ Zero-dependency test suite using Node's built-in test runner
|
|
79
|
+
|
|
80
|
+
---
|
|
81
|
+
|
|
82
|
+
## 📦 Installation
|
|
83
|
+
|
|
84
|
+
```bash
|
|
85
|
+
npm install @ioodev/nodescraper
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
> Requires Node.js 16 or later.
|
|
89
|
+
|
|
90
|
+
---
|
|
91
|
+
|
|
92
|
+
## 🛠️ Quick start
|
|
93
|
+
|
|
94
|
+
```js
|
|
95
|
+
const NodeScraper = require("@ioodev/nodescraper");
|
|
96
|
+
|
|
97
|
+
(async () => {
|
|
98
|
+
const scraper = new NodeScraper("https://example.com");
|
|
99
|
+
await scraper.init();
|
|
100
|
+
|
|
101
|
+
if (!scraper.isLoaded()) {
|
|
102
|
+
console.error("Scrape failed:", scraper.getError().message);
|
|
103
|
+
return;
|
|
104
|
+
}
|
|
105
|
+
|
|
106
|
+
console.log(scraper.title()); // "Welcome to Example.com"
|
|
107
|
+
console.log(scraper.description()); // "This is the example meta description."
|
|
108
|
+
console.log(scraper.h1()); // ["Welcome", "Latest News"]
|
|
109
|
+
console.log(scraper.open_graph()); // { "og:title": "...", "og:description": "...", ... }
|
|
110
|
+
|
|
111
|
+
// One call, every common field:
|
|
112
|
+
console.log(scraper.toJSON());
|
|
113
|
+
})();
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
Or with the one-line convenience wrapper:
|
|
117
|
+
|
|
118
|
+
```js
|
|
119
|
+
const scraper = await NodeScraper.scrape("https://example.com");
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
---
|
|
123
|
+
|
|
124
|
+
## ⚠️ Error handling
|
|
125
|
+
|
|
126
|
+
Unlike 1.0, failures are no longer silent. After `init()`, always check
|
|
127
|
+
`isLoaded()` (or `getError()`) before calling the getters:
|
|
128
|
+
|
|
129
|
+
```js
|
|
130
|
+
const scraper = await NodeScraper.scrape("https://example.com/maybe-missing");
|
|
131
|
+
|
|
132
|
+
if (!scraper.isLoaded()) {
|
|
133
|
+
console.error(scraper.getError().message); // e.g. "Request failed with status code 404"
|
|
134
|
+
console.error(scraper.getStatusCode()); // 404, or null for network-level failures
|
|
135
|
+
} else {
|
|
136
|
+
console.log(scraper.title());
|
|
137
|
+
}
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
If you'd rather handle failures with try/catch, pass `throwOnError: true`:
|
|
141
|
+
|
|
142
|
+
```js
|
|
143
|
+
try {
|
|
144
|
+
const scraper = await NodeScraper.scrape(url, { throwOnError: true });
|
|
145
|
+
console.log(scraper.title());
|
|
146
|
+
} catch (err) {
|
|
147
|
+
console.error("Scrape failed:", err.message);
|
|
148
|
+
}
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
When no document is loaded (before `init()`/`loadHTML()`, or after a failed
|
|
152
|
+
load), every getter returns `null` rather than throwing — it's always safe
|
|
153
|
+
to call them, you just won't get data back.
|
|
154
|
+
|
|
155
|
+
---
|
|
156
|
+
|
|
157
|
+
## 🧪 API reference
|
|
158
|
+
|
|
159
|
+
### Constructor
|
|
160
|
+
|
|
161
|
+
```js
|
|
162
|
+
new NodeScraper(url, options);
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
| Option | Type | Default | Description |
|
|
166
|
+
|---------------------|------------|-----------------------------------|------------------------------------------------------------|
|
|
167
|
+
| `timeout` | `number` | `10000` | Request timeout, in ms. |
|
|
168
|
+
| `userAgent` | `string` | a realistic browser-like UA | Sent as the `User-Agent` header. |
|
|
169
|
+
| `headers` | `object` | `{}` | Extra headers merged into the request. |
|
|
170
|
+
| `maxRedirects` | `number` | `5` | Maximum redirects to follow. |
|
|
171
|
+
| `allowedProtocols` | `string[]` | `['http:', 'https:']` | Protocols accepted by the URL validator. |
|
|
172
|
+
| `throwOnError` | `boolean` | `false` | If `true`, `init()` rejects instead of recording the error. |
|
|
173
|
+
|
|
174
|
+
### Loading
|
|
175
|
+
|
|
176
|
+
```js
|
|
177
|
+
await scraper.init(); // fetch `url` and parse the response
|
|
178
|
+
scraper.loadHTML(htmlString); // parse a raw HTML string, no network request
|
|
179
|
+
scraper.isLoaded(); // boolean
|
|
180
|
+
scraper.getError(); // Error | null
|
|
181
|
+
scraper.getStatusCode(); // number | null
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
### Page metadata
|
|
185
|
+
|
|
186
|
+
```js
|
|
187
|
+
scraper.title();
|
|
188
|
+
scraper.description();
|
|
189
|
+
scraper.keywords(); // string[] | null, trimmed
|
|
190
|
+
scraper.keyword_string(); // raw "keywords" content attribute
|
|
191
|
+
scraper.charset();
|
|
192
|
+
scraper.lang(); // <html lang="...">
|
|
193
|
+
scraper.canonical();
|
|
194
|
+
scraper.content_type();
|
|
195
|
+
scraper.author();
|
|
196
|
+
scraper.csrf_token();
|
|
197
|
+
scraper.image(); // shorthand for og:image
|
|
198
|
+
scraper.favicon(); // absolute URL
|
|
199
|
+
scraper.robots();
|
|
200
|
+
scraper.viewport(); // string[] | null, e.g. ["width=device-width", "initial-scale=1"]
|
|
201
|
+
scraper.viewport_string(); // raw content attribute
|
|
202
|
+
scraper.viewport_object(); // { width: "device-width", "initial-scale": "1" }
|
|
203
|
+
scraper.meta("theme-color"); // any meta[name=...] (pass attr: 'property' for meta[property=...])
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
### Open Graph, Twitter Card & JSON-LD
|
|
207
|
+
|
|
208
|
+
```js
|
|
209
|
+
scraper.open_graph(); // all known og:* properties
|
|
210
|
+
scraper.open_graph("og:title"); // a single property
|
|
211
|
+
|
|
212
|
+
scraper.twitter_card();
|
|
213
|
+
scraper.twitter_card("twitter:title");
|
|
214
|
+
|
|
215
|
+
scraper.jsonLd(); // parsed array of every <script type="application/ld+json"> block
|
|
216
|
+
```
|
|
217
|
+
|
|
218
|
+
### Headings, text & lists
|
|
219
|
+
|
|
220
|
+
```js
|
|
221
|
+
scraper.h1(); scraper.h2(); scraper.h3();
|
|
222
|
+
scraper.h4(); scraper.h5(); scraper.h6();
|
|
223
|
+
scraper.p();
|
|
224
|
+
|
|
225
|
+
scraper.text(); // normalized, whitespace-collapsed visible body text
|
|
226
|
+
scraper.html(); // raw HTML of the last successful load
|
|
227
|
+
|
|
228
|
+
scraper.ul(); // flattened <li> text from every <ul>
|
|
229
|
+
scraper.ol(); // flattened <li> text from every <ol>
|
|
230
|
+
```
|
|
231
|
+
|
|
232
|
+
### Images & links
|
|
233
|
+
|
|
234
|
+
```js
|
|
235
|
+
scraper.images(); // string[] of img src
|
|
236
|
+
scraper.image_details(); // [{ url, absolute_url, alt_text, title }]
|
|
237
|
+
|
|
238
|
+
scraper.links(); // string[] of href
|
|
239
|
+
scraper.link_details();
|
|
240
|
+
// [{ url, absolute_url, protocol, text, title, target, rel,
|
|
241
|
+
// is_nofollow, is_ugc, is_noopener, is_noreferrer }]
|
|
242
|
+
```
|
|
243
|
+
|
|
244
|
+
### Convenience
|
|
245
|
+
|
|
246
|
+
```js
|
|
247
|
+
scraper.toJSON();
|
|
248
|
+
// { url, statusCode, title, description, canonical, lang, charset, robots,
|
|
249
|
+
// keywords, author, image, favicon, openGraph, twitterCard,
|
|
250
|
+
// headings: { h1, h2, h3 }, linkCount, imageCount }
|
|
251
|
+
|
|
252
|
+
NodeScraper.scrape(url, options); // Promise<NodeScraper>
|
|
253
|
+
NodeScraper.scrapeAll(urls, options); // Promise<NodeScraper[]>, concurrent
|
|
254
|
+
```
|
|
255
|
+
|
|
256
|
+
---
|
|
257
|
+
|
|
258
|
+
## 🔍 Custom DOM filtering
|
|
259
|
+
|
|
260
|
+
Use `filter()` to target specific elements and pull nested content out of them.
|
|
261
|
+
|
|
262
|
+
```js
|
|
263
|
+
// Single element
|
|
264
|
+
scraper.filter({
|
|
265
|
+
element: "div",
|
|
266
|
+
attributes: { id: "main" },
|
|
267
|
+
extract: [".title", "#description", "p"],
|
|
268
|
+
});
|
|
269
|
+
|
|
270
|
+
// Multiple elements
|
|
271
|
+
scraper.filter({
|
|
272
|
+
element: "div",
|
|
273
|
+
attributes: { class: "card" },
|
|
274
|
+
multiple: true,
|
|
275
|
+
extract: ["h1", ".subtitle", "#meta"],
|
|
276
|
+
});
|
|
277
|
+
|
|
278
|
+
// Plain text instead of HTML
|
|
279
|
+
scraper.filter({
|
|
280
|
+
element: "p",
|
|
281
|
+
attributes: { class: "dark-text" },
|
|
282
|
+
multiple: true,
|
|
283
|
+
returnHtml: false,
|
|
284
|
+
});
|
|
285
|
+
```
|
|
286
|
+
|
|
287
|
+
- `extract` accepts tag names, class selectors (`.title`), or ID selectors (`#meta`).
|
|
288
|
+
- Output keys are normalized: `.title` → `class__title`, `#meta` → `id__meta`.
|
|
289
|
+
- With no `extract`, you get the matched element's inner HTML (`returnHtml: true`, the default) or trimmed text (`returnHtml: false`).
|
|
290
|
+
- An invalid selector or no match returns `null` (or `[]` for `multiple: true`) — it never throws.
|
|
291
|
+
|
|
292
|
+
---
|
|
293
|
+
|
|
294
|
+
## 📘 TypeScript
|
|
295
|
+
|
|
296
|
+
Type declarations ship with the package (`types/index.d.ts`, wired up via
|
|
297
|
+
`package.json#types`) — no `@types/` package needed:
|
|
298
|
+
|
|
299
|
+
```ts
|
|
300
|
+
import NodeScraper, { ScraperSnapshot, LinkDetails } from "@ioodev/nodescraper";
|
|
301
|
+
|
|
302
|
+
const scraper = new NodeScraper("https://example.com");
|
|
303
|
+
await scraper.init();
|
|
304
|
+
|
|
305
|
+
const snapshot: ScraperSnapshot | null = scraper.toJSON();
|
|
306
|
+
const links: LinkDetails[] | null = scraper.link_details();
|
|
307
|
+
```
|
|
308
|
+
|
|
309
|
+
---
|
|
310
|
+
|
|
311
|
+
## 📁 Project structure
|
|
312
|
+
|
|
313
|
+
```
|
|
314
|
+
nodescraper/
|
|
315
|
+
├── .github/
|
|
316
|
+
│ └── workflows/
|
|
317
|
+
│ └── test.yml # CI: runs the test suite on push/PR across Node 16–22
|
|
318
|
+
├── examples/
|
|
319
|
+
│ ├── 01-basic-usage.js
|
|
320
|
+
│ ├── 02-custom-filter.js
|
|
321
|
+
│ ├── 03-batch-scraping.js
|
|
322
|
+
│ └── 04-json-ld-and-extras.js
|
|
323
|
+
├── src/
|
|
324
|
+
│ ├── NodeScraper.js # main class — all implementation lives here
|
|
325
|
+
│ ├── constants.js # default UA, timeout, OG/Twitter property lists
|
|
326
|
+
│ └── utils.js # small pure helpers (URL validation, trimming, etc.)
|
|
327
|
+
├── test/
|
|
328
|
+
│ ├── fixtures/
|
|
329
|
+
│ │ └── sample.html # HTML fixture used by the test suite
|
|
330
|
+
│ ├── helpers/
|
|
331
|
+
│ │ └── test-server.js # local HTTP server (200/404/redirect/403/slow routes)
|
|
332
|
+
│ └── nodescraper.test.js # the test suite itself
|
|
333
|
+
├── types/
|
|
334
|
+
│ └── index.d.ts # TypeScript declarations
|
|
335
|
+
├── index.js # entry point — re-exports src/NodeScraper.js
|
|
336
|
+
├── package.json
|
|
337
|
+
├── CHANGELOG.md
|
|
338
|
+
├── README.md
|
|
339
|
+
└── LICENSE
|
|
340
|
+
```
|
|
341
|
+
|
|
342
|
+
`index.js` stays a thin re-export so `require("@ioodev/nodescraper")`
|
|
343
|
+
keeps working exactly as before; all real logic lives under `src/`, which
|
|
344
|
+
keeps the public entry point stable while leaving room to split the
|
|
345
|
+
implementation further (e.g. a `src/extractors/` folder) without touching
|
|
346
|
+
how consumers import the package.
|
|
347
|
+
|
|
348
|
+
---
|
|
349
|
+
|
|
350
|
+
## 🧪 Testing
|
|
351
|
+
|
|
352
|
+
The test suite uses Node's built-in test runner — no extra dev dependency
|
|
353
|
+
required.
|
|
354
|
+
|
|
355
|
+
```bash
|
|
356
|
+
npm test # run the suite once
|
|
357
|
+
npm run test:watch # re-run on file changes
|
|
358
|
+
```
|
|
359
|
+
|
|
360
|
+
It covers:
|
|
361
|
+
- Metadata/OG/Twitter/JSON-LD extraction against a fixture page
|
|
362
|
+
- The bug fixes above (trimmed keywords/viewport, empty `rel`)
|
|
363
|
+
- `filter()`, including the malformed-selector fail-soft path
|
|
364
|
+
- `init()` against a local HTTP server: 200, 404, redirects, UA-blocking, timeouts, and rejected protocols
|
|
365
|
+
- `loadHTML()`, `toJSON()`, and the `scrape()` / `scrapeAll()` static helpers
|
|
366
|
+
|
|
367
|
+
---
|
|
368
|
+
|
|
369
|
+
## 💡 Examples
|
|
370
|
+
|
|
371
|
+
Runnable scripts live in [`examples/`](./examples):
|
|
372
|
+
|
|
373
|
+
```bash
|
|
374
|
+
npm run example:basic # metadata + toJSON()
|
|
375
|
+
npm run example:filter # filter() single/multiple/text-vs-html
|
|
376
|
+
npm run example:batch # scrapeAll() with custom headers/timeout
|
|
377
|
+
npm run example:extras # loadHTML(), jsonLd(), favicon(), meta()
|
|
378
|
+
```
|
|
379
|
+
|
|
380
|
+
---
|
|
381
|
+
|
|
382
|
+
## 🔁 Migrating from 1.0.x
|
|
383
|
+
|
|
384
|
+
No method was renamed or removed, so existing calls keep working as-is.
|
|
385
|
+
Two return values changed because they were bugs, not intentional API:
|
|
386
|
+
|
|
387
|
+
| Method | 1.0.x | 1.1.0 |
|
|
388
|
+
|----------------------------------|--------------------------------------|---------------------------------------|
|
|
389
|
+
| `keywords()` / `viewport()` | entries could have leading spaces | entries are trimmed |
|
|
390
|
+
| `link_details()[i].rel` | `['']` when no `rel` attribute | `[]` when no `rel` attribute |
|
|
391
|
+
|
|
392
|
+
If your code special-cased either of those (e.g. `.map(k => k.trim())` on
|
|
393
|
+
`keywords()`, or checked `rel.length === 1 && rel[0] === ''`), you can drop
|
|
394
|
+
that workaround.
|
|
395
|
+
|
|
396
|
+
Everything else — `loadHTML()`, `meta()`, `lang()`, `robots()`, `favicon()`,
|
|
397
|
+
`jsonLd()`, `text()`, `html()`, `viewport_object()`, `toJSON()`,
|
|
398
|
+
`absolute_url` fields, constructor `options`, and the static `scrape()` /
|
|
399
|
+
`scrapeAll()` helpers — is purely additive.
|
|
400
|
+
|
|
401
|
+
---
|
|
402
|
+
|
|
403
|
+
## 📦 Migrating from `@riodevnet/nodescraper`
|
|
404
|
+
|
|
405
|
+
This package used to be published as `@riodevnet/nodescraper`. The code,
|
|
406
|
+
API, and version history are the same — only the npm scope changed.
|
|
407
|
+
|
|
408
|
+
```diff
|
|
409
|
+
- npm install @riodevnet/nodescraper
|
|
410
|
+
+ npm install @ioodev/nodescraper
|
|
411
|
+
```
|
|
412
|
+
|
|
413
|
+
```diff
|
|
414
|
+
- const NodeScraper = require("@riodevnet/nodescraper");
|
|
415
|
+
+ const NodeScraper = require("@ioodev/nodescraper");
|
|
416
|
+
```
|
|
417
|
+
|
|
418
|
+
Update any `package.json` dependency entries the same way, then reinstall.
|
|
419
|
+
`@riodevnet/nodescraper` is not getting further updates — please move to
|
|
420
|
+
`@ioodev/nodescraper` for new fixes and features.
|
|
421
|
+
|
|
422
|
+
---
|
|
423
|
+
|
|
424
|
+
## 🤝 Contributing
|
|
425
|
+
|
|
426
|
+
Contributions are welcome! Found a bug or want to request a feature?
|
|
427
|
+
Please open an [issue](https://github.com/ioodev/nodescraper/issues) or
|
|
428
|
+
submit a pull request. Run `npm test` before submitting — CI runs the same
|
|
429
|
+
suite across Node 16, 18, 20, and 22.
|
|
430
|
+
|
|
431
|
+
---
|
|
432
|
+
|
|
433
|
+
## 📄 License
|
|
434
|
+
|
|
435
|
+
MIT License © 2025–2026 — NodeScraper
|
|
436
|
+
|
|
437
|
+
---
|
|
438
|
+
|
|
439
|
+
## 🔗 Related Projects
|
|
440
|
+
|
|
441
|
+
- [Axios](https://axios-http.com/)
|
|
442
|
+
- [Cheerio](https://cheerio.js.org/)
|
|
443
|
+
- [Node.js](https://nodejs.org/)
|
|
444
|
+
|
|
445
|
+
---
|
|
446
|
+
|
|
447
|
+
## 💡 Why NodeScraper?
|
|
448
|
+
|
|
449
|
+
> Think of it as your JavaScript web detective — fast, efficient, and precise.
|
package/index.js
ADDED
package/package.json
ADDED
|
@@ -0,0 +1,57 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "@ioodev/nodescraper",
|
|
3
|
+
"version": "1.1.1",
|
|
4
|
+
"description": "NodeScraper is a fast and flexible Node.js web scraping toolkit built using Axios and Cheerio. It provides an intuitive interface for extracting structured HTML and metadata from websites — with clean and consistent outputs.",
|
|
5
|
+
"main": "index.js",
|
|
6
|
+
"types": "types/index.d.ts",
|
|
7
|
+
"files": [
|
|
8
|
+
"index.js",
|
|
9
|
+
"src",
|
|
10
|
+
"types",
|
|
11
|
+
"README.md",
|
|
12
|
+
"CHANGELOG.md",
|
|
13
|
+
"LICENSE"
|
|
14
|
+
],
|
|
15
|
+
"exports": {
|
|
16
|
+
".": "./index.js",
|
|
17
|
+
"./package.json": "./package.json"
|
|
18
|
+
},
|
|
19
|
+
"engines": {
|
|
20
|
+
"node": ">=16"
|
|
21
|
+
},
|
|
22
|
+
"scripts": {
|
|
23
|
+
"test": "node --test",
|
|
24
|
+
"test:watch": "node --test --watch",
|
|
25
|
+
"example:basic": "node examples/01-basic-usage.js",
|
|
26
|
+
"example:filter": "node examples/02-custom-filter.js",
|
|
27
|
+
"example:batch": "node examples/03-batch-scraping.js",
|
|
28
|
+
"example:extras": "node examples/04-json-ld-and-extras.js"
|
|
29
|
+
},
|
|
30
|
+
"keywords": [
|
|
31
|
+
"nodescraper",
|
|
32
|
+
"scraper",
|
|
33
|
+
"scraping",
|
|
34
|
+
"web-scraping",
|
|
35
|
+
"html-parser",
|
|
36
|
+
"cheerio",
|
|
37
|
+
"axios",
|
|
38
|
+
"metadata",
|
|
39
|
+
"open-graph",
|
|
40
|
+
"seo"
|
|
41
|
+
],
|
|
42
|
+
"author": "Rio Agung Purnomo",
|
|
43
|
+
"license": "MIT",
|
|
44
|
+
"type": "commonjs",
|
|
45
|
+
"repository": {
|
|
46
|
+
"type": "git",
|
|
47
|
+
"url": "git+https://github.com/ioodev/nodescraper.git"
|
|
48
|
+
},
|
|
49
|
+
"homepage": "https://github.com/ioodev/nodescraper#readme",
|
|
50
|
+
"bugs": {
|
|
51
|
+
"url": "https://github.com/ioodev/nodescraper/issues"
|
|
52
|
+
},
|
|
53
|
+
"dependencies": {
|
|
54
|
+
"axios": "^1.10.0",
|
|
55
|
+
"cheerio": "^1.1.0"
|
|
56
|
+
}
|
|
57
|
+
}
|