sitemap-xml-parser 1.0.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -8,15 +8,63 @@ Parses sitemap XML files and returns all listed URLs. Supports sitemap index fil
8
8
  npm install sitemap-xml-parser
9
9
  ```
10
10
 
11
+ ## CLI
12
+
13
+ Run without installing via `npx`:
14
+
15
+ ```sh
16
+ npx sitemap-xml-parser <url> [options]
17
+ ```
18
+
19
+ Or, after installing globally (`npm install -g sitemap-xml-parser`):
20
+
21
+ ```sh
22
+ sitemap-xml-parser <url> [options]
23
+ ```
24
+
25
+ Fetched URLs are printed to stdout, one per line. Errors are printed to stderr. See [Options](#options) for available flags.
26
+
27
+ ## Examples
28
+
29
+ ```sh
30
+ # Print all URLs
31
+ npx sitemap-xml-parser https://example.com/sitemap.xml
32
+
33
+ # Count URLs
34
+ npx sitemap-xml-parser https://example.com/sitemap.xml --count
35
+
36
+ # Filter by substring
37
+ npx sitemap-xml-parser https://example.com/sitemap.xml --filter "blog"
38
+
39
+ # Filter and count
40
+ npx sitemap-xml-parser https://example.com/sitemap.xml --filter "blog" --count
41
+
42
+ # Output as TSV
43
+ npx sitemap-xml-parser https://example.com/sitemap.xml --tsv > urls.tsv
44
+
45
+ # Save URLs to a file, errors to a log
46
+ npx sitemap-xml-parser https://example.com/sitemap.xml > urls.txt 2> errors.log
47
+ ```
48
+
49
+ ## Options
50
+
51
+ | Option | Type | Default | Description |
52
+ |-------------|------------|---------|-----------------------------------------------------------------------------|
53
+ | `delay` | `number` | `1000` | Milliseconds to wait between batches when following a sitemap index. Default is 1000 to avoid overloading the target server; set to `0` to disable. CLI: `--delay` |
54
+ | `limit` | `number` | `10` | Number of child sitemaps to fetch concurrently per batch. CLI: `--limit` |
55
+ | `timeout` | `number` | `30000` | Milliseconds before a request is aborted. CLI: `--timeout` |
56
+ | `onError` | `function` | — | Called as `onError(url, error)` when a URL fails. The URL is skipped regardless. **Library only.** |
57
+ | `onEntry` | `function` | — | Called as `onEntry(entry)` each time a URL entry is parsed. `entry` has the same shape as the objects returned by `fetch()`. **Library only.** |
58
+ | `filter` | `string` | — | Only output URLs whose `loc` contains the given string (substring match). Can be combined with `--count` or `--tsv`. **CLI only.** |
59
+ | `tsv` | — | — | Output results as tab-separated values. Prints a header row (`loc`, `lastmod`, `changefreq`, `priority`) followed by one row per entry. Missing fields are output as empty strings. **CLI only.** |
60
+ | `count` | — | — | Print only the total number of URLs instead of listing them. **CLI only.** |
61
+
11
62
  ## Usage
12
63
 
13
64
  ```js
14
65
  const SitemapXMLParser = require('sitemap-xml-parser');
15
66
 
16
- const parser = new SitemapXMLParser('https://example.com/sitemap.xml', {
17
- delay: 3000,
18
- limit: 5,
19
- });
67
+ const parser = new SitemapXMLParser('https://example.com/sitemap.xml');
20
68
 
21
69
  (async () => {
22
70
  const urls = await parser.fetch();
@@ -38,17 +86,6 @@ const parser = new SitemapXMLParser('https://example.com/sitemap.xml', {
38
86
  });
39
87
  ```
40
88
 
41
- ## Options
42
-
43
- | Option | Type | Default | Description |
44
- |-------------|------------|---------|-----------------------------------------------------------------------------|
45
- | `delay` | `number` | `3000` | Milliseconds to wait between batches when following a sitemap index. CLI: `--delay` |
46
- | `limit` | `number` | `5` | Number of child sitemaps to fetch concurrently per batch. CLI: `--limit` |
47
- | `timeout` | `number` | `30000` | Milliseconds before a request is aborted. CLI: `--timeout` |
48
- | `onError` | `function` | — | Called as `onError(url, error)` when a URL fails. The URL is skipped regardless. **API only.** |
49
- | `--help` | — | — | Prints usage information and exits. **CLI only.** |
50
- | `--timeout` | — | — | Same as the `timeout` option above, in milliseconds. **CLI only.** |
51
-
52
89
  ## Return value
53
90
 
54
91
  `fetch()` resolves to an array of URL entry objects. Each object reflects the fields present in the sitemap:
@@ -65,40 +102,10 @@ const parser = new SitemapXMLParser('https://example.com/sitemap.xml', {
65
102
  ]
66
103
  ```
67
104
 
68
- Fields other than `loc` (`lastmod`, `changefreq`, `priority`, etc.) are included only when present in the source XML.
69
-
70
- ## CLI
71
-
72
- Run without installing via `npx`:
73
-
74
- ```sh
75
- npx sitemap-xml-parser <url> [options]
76
- ```
77
-
78
- Or, after installing globally (`npm install -g sitemap-xml-parser`):
105
+ All field values are arrays (xml2js convention). Use `entry.loc[0]` to get the URL string, `entry.lastmod?.[0]` for optional fields, and so on.
79
106
 
80
- ```sh
81
- sitemap-xml-parser <url> [options]
82
- ```
83
-
84
- Fetched URLs are printed to stdout, one per line. Errors are printed to stderr. See [Options](#options) for available flags.
85
-
86
- ### Examples
87
-
88
- ```sh
89
- # Print all URLs
90
- npx sitemap-xml-parser https://example.com/sitemap.xml
91
-
92
- # No delay, higher concurrency
93
- npx sitemap-xml-parser https://example.com/sitemap.xml --delay 0 --limit 10
94
-
95
- # Save URLs to a file, errors to a log
96
- npx sitemap-xml-parser https://example.com/sitemap.xml > urls.txt 2> errors.log
97
-
98
- # Custom timeout
99
- npx sitemap-xml-parser https://example.com/sitemap.xml --timeout 10000
100
- ```
107
+ Fields other than `loc` (`lastmod`, `changefreq`, `priority`, etc.) are included only when present in the source XML.
101
108
 
102
109
  ## Limitations
103
110
 
104
- - **HTTP redirects are not followed.** Responses with status codes 301, 302, or other 3xx are treated as errors. If your sitemap URL redirects, use the final destination URL directly.
111
+ - **HTTP redirects are followed up to 5 times.** Status codes 301, 302, 303, 307, and 308 are handled automatically by following the `Location` header (relative URLs are resolved against the current URL). If the redirect chain exceeds 5 hops, an error is raised via `onError`.
package/bin/cli.js CHANGED
@@ -8,9 +8,12 @@ function printUsage() {
8
8
  'Usage: sitemap-xml-parser <url> [options]',
9
9
  '',
10
10
  'Options:',
11
- ' --delay <ms> Delay between batches in milliseconds (default: 3000)',
12
- ' --limit <n> Concurrent fetches per batch (default: 5)',
11
+ ' --delay <ms> Delay between batches in milliseconds (default: 1000)',
12
+ ' --limit <n> Concurrent fetches per batch (default: 10)',
13
13
  ' --timeout <ms> Request timeout in milliseconds (default: 30000)',
14
+ ' --filter <str> Only output URLs that contain <str>',
15
+ ' --tsv Output as tab-separated values with a header row',
16
+ ' --count Print only the total number of URLs',
14
17
  ' --help Show this help message',
15
18
  '',
16
19
  ].join('\n'));
@@ -18,14 +21,27 @@ function printUsage() {
18
21
 
19
22
  function parseArgs(argv) {
20
23
  const args = argv.slice(2);
21
- const opts = { delay: 3000, limit: 5, timeout: 30000 };
24
+ const opts = { delay: 1000, limit: 10, timeout: 30000 };
22
25
  let url = null;
26
+ let tsv = false;
27
+ let count = false;
28
+ let filter = null;
23
29
 
24
30
  for (let i = 0; i < args.length; i++) {
25
31
  const arg = args[i];
26
32
  if (arg === '--help' || arg === '-h') {
27
33
  printUsage();
28
34
  process.exit(0);
35
+ } else if (arg === '--tsv') {
36
+ tsv = true;
37
+ } else if (arg === '--count') {
38
+ count = true;
39
+ } else if (arg === '--filter') {
40
+ if (++i >= args.length) {
41
+ process.stderr.write(`Error: --filter requires a value\n`);
42
+ process.exit(1);
43
+ }
44
+ filter = args[i];
29
45
  } else if (arg === '--delay') {
30
46
  if (++i >= args.length) {
31
47
  process.stderr.write(`Error: --delay requires a value\n`);
@@ -76,24 +92,53 @@ function parseArgs(argv) {
76
92
  process.exit(1);
77
93
  }
78
94
 
79
- return { url, opts };
95
+ return { url, opts, tsv, count, filter };
80
96
  }
81
97
 
82
98
  (async () => {
83
- const { url, opts } = parseArgs(process.argv);
99
+ const { url, opts, tsv, count, filter } = parseArgs(process.argv);
100
+
101
+ const red = process.stderr.isTTY ? '\x1b[31m' : '';
102
+ const reset = process.stderr.isTTY ? '\x1b[0m' : '';
103
+
104
+ if (tsv && !count) {
105
+ process.stdout.write('loc\tlastmod\tchangefreq\tpriority\n');
106
+ }
84
107
 
85
108
  let hasError = false;
109
+ let filteredCount = 0;
110
+
111
+ // onEntry is only skipped when count mode has no filter (result.length is sufficient).
112
+ const needsOnEntry = !count || filter !== null;
113
+
86
114
  const parser = new SitemapXMLParser(url, {
87
115
  ...opts,
116
+ onEntry: needsOnEntry ? (entry) => {
117
+ const loc = entry.loc?.[0] ?? '';
118
+ if (filter !== null && !loc.includes(filter)) return;
119
+
120
+ if (count) {
121
+ filteredCount++;
122
+ return;
123
+ }
124
+
125
+ if (tsv) {
126
+ const lastmod = entry.lastmod?.[0] ?? '';
127
+ const changefreq = entry.changefreq?.[0] ?? '';
128
+ const priority = entry.priority?.[0] ?? '';
129
+ process.stdout.write(`${loc}\t${lastmod}\t${changefreq}\t${priority}\n`);
130
+ } else {
131
+ process.stdout.write(loc + '\n');
132
+ }
133
+ } : null,
88
134
  onError: (failedUrl, err) => {
89
135
  hasError = true;
90
- process.stderr.write(`Error: ${failedUrl} ${err.message}\n`);
136
+ const msg = err.message.replace(/\r?\n/g, ' ').trim();
137
+ process.stderr.write(`${red}Error: ${failedUrl} — ${msg}${reset}\n`);
91
138
  },
92
139
  });
93
140
 
94
- const entries = await parser.fetch();
95
- for (const entry of entries) {
96
- process.stdout.write(entry.loc[0] + '\n');
97
- }
141
+ const result = await parser.fetch();
142
+ if (count) process.stdout.write((filter !== null ? filteredCount : result.length) + '\n');
98
143
  if (hasError) process.exit(1);
99
144
  })();
package/index.d.ts ADDED
@@ -0,0 +1,19 @@
1
+ export interface SitemapEntry {
2
+ loc: string[];
3
+ lastmod?: string[];
4
+ changefreq?: string[];
5
+ priority?: string[];
6
+ }
7
+
8
+ export interface SitemapOptions {
9
+ delay?: number;
10
+ limit?: number;
11
+ timeout?: number;
12
+ onError?: (url: string, error: Error) => void;
13
+ onEntry?: (entry: SitemapEntry) => void;
14
+ }
15
+
16
+ export default class SitemapXMLParser {
17
+ constructor(url: string, options?: SitemapOptions);
18
+ fetch(): Promise<SitemapEntry[]>;
19
+ }
package/lib/sitemap.js CHANGED
@@ -10,10 +10,11 @@ const { URL } = require('url');
10
10
  class SitemapXMLParser {
11
11
  constructor(url, options = {}) {
12
12
  this.siteMapUrl = url;
13
- this.delayTime = options.delay ?? 3000;
14
- this.limit = options.limit ?? 5;
13
+ this.delayTime = options.delay ?? 1000;
14
+ this.limit = options.limit ?? 10;
15
15
  this.timeout = options.timeout ?? 30000;
16
16
  this.onError = options.onError || null;
17
+ this.onEntry = options.onEntry || null;
17
18
  this.urlArray = [];
18
19
  this.parser = new xml2js.Parser();
19
20
  }
@@ -57,6 +58,7 @@ class SitemapXMLParser {
57
58
  for (const entry of xml.urlset.url) {
58
59
  if (entry && entry.loc?.[0]) {
59
60
  this.urlArray.push(entry);
61
+ if (this.onEntry) this.onEntry(entry);
60
62
  }
61
63
  }
62
64
  }
@@ -64,50 +66,80 @@ class SitemapXMLParser {
64
66
 
65
67
  /**
66
68
  * Fetch body from URL using http/https.
69
+ * Follows redirects (301/302/303/307/308) up to 5 times.
67
70
  * Decompresses gzip automatically when the URL ends with .gz.
68
71
  * Returns null and calls onError on failure.
69
72
  */
70
73
  getBodyFromURL(url) {
74
+ return this._fetchWithRedirect(url, url, 0);
75
+ }
76
+
77
+ _fetchWithRedirect(originalUrl, currentUrl, redirectCount) {
71
78
  return new Promise((resolve) => {
79
+ let settled = false;
80
+ const failOnce = (url, err) => {
81
+ if (settled) return;
82
+ settled = true;
83
+ this._handleError(url, err);
84
+ resolve(null);
85
+ };
86
+
72
87
  let parsedUrl;
73
88
  try {
74
- parsedUrl = new URL(url);
89
+ parsedUrl = new URL(currentUrl);
75
90
  } catch (err) {
76
- this._handleError(url, err);
77
- resolve(null);
91
+ failOnce(originalUrl, err);
78
92
  return;
79
93
  }
80
94
 
81
95
  const ext = path.extname(parsedUrl.pathname);
82
96
  const transport = parsedUrl.protocol === 'https:' ? https : http;
83
97
 
84
- const req = transport.get(url, (res) => {
98
+ const req = transport.get(currentUrl, (res) => {
99
+ const REDIRECT_CODES = [301, 302, 303, 307, 308];
100
+ if (REDIRECT_CODES.includes(res.statusCode)) {
101
+ res.resume();
102
+ const location = res.headers['location'];
103
+ if (!location) {
104
+ failOnce(originalUrl, new Error(`HTTP ${res.statusCode} with no Location header`));
105
+ return;
106
+ }
107
+ if (redirectCount >= 5) {
108
+ failOnce(originalUrl, new Error('Too many redirects (max 5)'));
109
+ return;
110
+ }
111
+ settled = true;
112
+ const nextUrl = new URL(location, currentUrl).href;
113
+ resolve(this._fetchWithRedirect(originalUrl, nextUrl, redirectCount + 1));
114
+ return;
115
+ }
116
+
85
117
  if (res.statusCode < 200 || res.statusCode >= 300) {
86
118
  res.resume();
87
- this._handleError(url, new Error(`HTTP ${res.statusCode}`));
88
- resolve(null);
119
+ failOnce(originalUrl, new Error(`HTTP ${res.statusCode}`));
89
120
  return;
90
121
  }
91
122
  const chunks = [];
123
+ const contentEncoding = res.headers['content-encoding'];
92
124
  res.on('data', chunk => chunks.push(chunk));
93
125
  res.on('end', () => {
94
126
  const buf = Buffer.concat(chunks);
95
- if (ext === '.gz') {
127
+ if (ext === '.gz' || contentEncoding === 'gzip') {
96
128
  zlib.gunzip(buf, (err, result) => {
97
129
  if (err) {
98
- this._handleError(url, err);
99
- resolve(null);
130
+ failOnce(originalUrl, err);
100
131
  } else {
132
+ settled = true;
101
133
  resolve(result.toString());
102
134
  }
103
135
  });
104
136
  } else {
137
+ settled = true;
105
138
  resolve(buf.toString());
106
139
  }
107
140
  });
108
141
  res.on('error', (err) => {
109
- this._handleError(url, err);
110
- resolve(null);
142
+ failOnce(originalUrl, err);
111
143
  });
112
144
  });
113
145
 
@@ -116,8 +148,7 @@ class SitemapXMLParser {
116
148
  });
117
149
 
118
150
  req.on('error', (err) => {
119
- this._handleError(url, err);
120
- resolve(null);
151
+ failOnce(originalUrl, err);
121
152
  });
122
153
  });
123
154
  }
package/package.json CHANGED
@@ -1,23 +1,25 @@
1
1
  {
2
2
  "name": "sitemap-xml-parser",
3
- "version": "1.0.0",
3
+ "version": "1.2.0",
4
4
  "description": "Parses sitemap XML files and returns all listed URLs. Supports sitemap index files and gzip (.gz) compression.",
5
5
  "main": "index.js",
6
+ "types": "index.d.ts",
6
7
  "bin": {
7
8
  "sitemap-xml-parser": "bin/cli.js"
8
9
  },
9
10
  "files": [
10
11
  "index.js",
12
+ "index.d.ts",
11
13
  "lib",
12
14
  "bin"
13
15
  ],
14
16
  "engines": {
15
- "node": ">=18"
17
+ "node": ">=20"
16
18
  },
17
19
  "scripts": {
18
20
  "test": "node test/test.js"
19
21
  },
20
- "keywords": ["sitemap", "xml", "parse", "gzip", "sitemap-index", "cli"],
22
+ "keywords": ["sitemap", "xml", "parse", "gzip", "sitemap-index", "cli", "tsv"],
21
23
  "author": "shinkawax",
22
24
  "license": "MIT",
23
25
  "repository": {