sitemap-xml-parser 1.1.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -8,6 +8,57 @@ Parses sitemap XML files and returns all listed URLs. Supports sitemap index fil
8
8
  npm install sitemap-xml-parser
9
9
  ```
10
10
 
11
+ ## CLI
12
+
13
+ Run without installing via `npx`:
14
+
15
+ ```sh
16
+ npx sitemap-xml-parser <url> [options]
17
+ ```
18
+
19
+ Or, after installing globally (`npm install -g sitemap-xml-parser`):
20
+
21
+ ```sh
22
+ sitemap-xml-parser <url> [options]
23
+ ```
24
+
25
+ Fetched URLs are printed to stdout, one per line. Errors are printed to stderr. See [Options](#options) for available flags.
26
+
27
+ ## Examples
28
+
29
+ ```sh
30
+ # Print all URLs
31
+ npx sitemap-xml-parser https://example.com/sitemap.xml
32
+
33
+ # Count URLs
34
+ npx sitemap-xml-parser https://example.com/sitemap.xml --count
35
+
36
+ # Filter by substring
37
+ npx sitemap-xml-parser https://example.com/sitemap.xml --filter "blog"
38
+
39
+ # Filter and count
40
+ npx sitemap-xml-parser https://example.com/sitemap.xml --filter "blog" --count
41
+
42
+ # Output as TSV
43
+ npx sitemap-xml-parser https://example.com/sitemap.xml --tsv > urls.tsv
44
+
45
+ # Save URLs to a file, errors to a log
46
+ npx sitemap-xml-parser https://example.com/sitemap.xml > urls.txt 2> errors.log
47
+ ```
48
+
49
+ ## Options
50
+
51
+ | Option | Type | Default | Description |
52
+ |-------------|------------|---------|-----------------------------------------------------------------------------|
53
+ | `delay` | `number` | `1000` | Milliseconds to wait between batches when following a sitemap index. Default is 1000 to avoid overloading the target server; set to `0` to disable. CLI: `--delay` |
54
+ | `limit` | `number` | `10` | Number of child sitemaps to fetch concurrently per batch. CLI: `--limit` |
55
+ | `timeout` | `number` | `30000` | Milliseconds before a request is aborted. CLI: `--timeout` |
56
+ | `onError` | `function` | — | Called as `onError(url, error)` when a URL fails. The URL is skipped regardless. **Library only.** |
57
+ | `onEntry` | `function` | — | Called as `onEntry(entry)` each time a URL entry is parsed. `entry` has the same shape as the objects returned by `fetch()`. **Library only.** |
58
+ | `filter` | `string` | — | Only output URLs whose `loc` contains the given string (substring match). Can be combined with `--count` or `--tsv`. **CLI only.** |
59
+ | `tsv` | — | — | Output results as tab-separated values. Prints a header row (`loc`, `lastmod`, `changefreq`, `priority`) followed by one row per entry. Missing fields are output as empty strings. **CLI only.** |
60
+ | `count` | — | — | Print only the total number of URLs instead of listing them. **CLI only.** |
61
+
11
62
  ## Usage
12
63
 
13
64
  ```js
@@ -35,17 +86,6 @@ const parser = new SitemapXMLParser('https://example.com/sitemap.xml', {
35
86
  });
36
87
  ```
37
88
 
38
- ## Options
39
-
40
- | Option | Type | Default | Description |
41
- |-------------|------------|---------|-----------------------------------------------------------------------------|
42
- | `delay` | `number` | `3000` | Milliseconds to wait between batches when following a sitemap index. Default is 3000 to avoid overloading the target server; set to `0` to disable. CLI: `--delay` |
43
- | `limit` | `number` | `5` | Number of child sitemaps to fetch concurrently per batch. CLI: `--limit` |
44
- | `timeout` | `number` | `30000` | Milliseconds before a request is aborted. CLI: `--timeout` |
45
- | `onError` | `function` | — | Called as `onError(url, error)` when a URL fails. The URL is skipped regardless. **Library only.** |
46
- | `onEntry` | `function` | — | Called as `onEntry(entry)` each time a URL entry is parsed. `entry` has the same shape as the objects returned by `fetch()`. **Library only.** |
47
- | `tsv` | — | — | Output results as tab-separated values. Prints a header row (`loc`, `lastmod`, `changefreq`, `priority`) followed by one row per entry. Missing fields are output as empty strings. **CLI only.** |
48
-
49
89
  ## Return value
50
90
 
51
91
  `fetch()` resolves to an array of URL entry objects. Each object reflects the fields present in the sitemap:
@@ -66,44 +106,6 @@ All field values are arrays (xml2js convention). Use `entry.loc[0]` to get the U
66
106
 
67
107
  Fields other than `loc` (`lastmod`, `changefreq`, `priority`, etc.) are included only when present in the source XML.
68
108
 
69
- ## CLI
70
-
71
- Run without installing via `npx`:
72
-
73
- ```sh
74
- npx sitemap-xml-parser <url> [options]
75
- ```
76
-
77
- Or, after installing globally (`npm install -g sitemap-xml-parser`):
78
-
79
- ```sh
80
- sitemap-xml-parser <url> [options]
81
- ```
82
-
83
- Fetched URLs are printed to stdout, one per line. Errors are printed to stderr. See [Options](#options) for available flags.
84
-
85
- ### Examples
86
-
87
- ```sh
88
- # Print all URLs
89
- npx sitemap-xml-parser https://example.com/sitemap.xml
90
-
91
- # No delay, higher concurrency
92
- npx sitemap-xml-parser https://example.com/sitemap.xml --delay 0 --limit 10
93
-
94
- # Save URLs to a file, errors to a log
95
- npx sitemap-xml-parser https://example.com/sitemap.xml > urls.txt 2> errors.log
96
-
97
- # Custom timeout
98
- npx sitemap-xml-parser https://example.com/sitemap.xml --timeout 10000
99
-
100
- # Output as TSV (includes lastmod, changefreq, priority)
101
- npx sitemap-xml-parser https://example.com/sitemap.xml --tsv
102
-
103
- # Save TSV to a file
104
- npx sitemap-xml-parser https://example.com/sitemap.xml --tsv > urls.tsv
105
- ```
106
-
107
109
  ## Limitations
108
110
 
109
111
  - **HTTP redirects are followed up to 5 times.** Status codes 301, 302, 303, 307, and 308 are handled automatically by following the `Location` header (relative URLs are resolved against the current URL). If the redirect chain exceeds 5 hops, an error is raised via `onError`.
package/bin/cli.js CHANGED
@@ -8,10 +8,12 @@ function printUsage() {
8
8
  'Usage: sitemap-xml-parser <url> [options]',
9
9
  '',
10
10
  'Options:',
11
- ' --delay <ms> Delay between batches in milliseconds (default: 3000)',
12
- ' --limit <n> Concurrent fetches per batch (default: 5)',
11
+ ' --delay <ms> Delay between batches in milliseconds (default: 1000)',
12
+ ' --limit <n> Concurrent fetches per batch (default: 10)',
13
13
  ' --timeout <ms> Request timeout in milliseconds (default: 30000)',
14
+ ' --filter <str> Only output URLs that contain <str>',
14
15
  ' --tsv Output as tab-separated values with a header row',
16
+ ' --count Print only the total number of URLs',
15
17
  ' --help Show this help message',
16
18
  '',
17
19
  ].join('\n'));
@@ -19,9 +21,11 @@ function printUsage() {
19
21
 
20
22
  function parseArgs(argv) {
21
23
  const args = argv.slice(2);
22
- const opts = { delay: 3000, limit: 5, timeout: 30000 };
24
+ const opts = { delay: 1000, limit: 10, timeout: 30000 };
23
25
  let url = null;
24
26
  let tsv = false;
27
+ let count = false;
28
+ let filter = null;
25
29
 
26
30
  for (let i = 0; i < args.length; i++) {
27
31
  const arg = args[i];
@@ -30,6 +34,14 @@ function parseArgs(argv) {
30
34
  process.exit(0);
31
35
  } else if (arg === '--tsv') {
32
36
  tsv = true;
37
+ } else if (arg === '--count') {
38
+ count = true;
39
+ } else if (arg === '--filter') {
40
+ if (++i >= args.length) {
41
+ process.stderr.write(`Error: --filter requires a value\n`);
42
+ process.exit(1);
43
+ }
44
+ filter = args[i];
33
45
  } else if (arg === '--delay') {
34
46
  if (++i >= args.length) {
35
47
  process.stderr.write(`Error: --delay requires a value\n`);
@@ -80,33 +92,45 @@ function parseArgs(argv) {
80
92
  process.exit(1);
81
93
  }
82
94
 
83
- return { url, opts, tsv };
95
+ return { url, opts, tsv, count, filter };
84
96
  }
85
97
 
86
98
  (async () => {
87
- const { url, opts, tsv } = parseArgs(process.argv);
99
+ const { url, opts, tsv, count, filter } = parseArgs(process.argv);
88
100
 
89
101
  const red = process.stderr.isTTY ? '\x1b[31m' : '';
90
102
  const reset = process.stderr.isTTY ? '\x1b[0m' : '';
91
103
 
92
- if (tsv) {
104
+ if (tsv && !count) {
93
105
  process.stdout.write('loc\tlastmod\tchangefreq\tpriority\n');
94
106
  }
95
107
 
96
108
  let hasError = false;
109
+ let filteredCount = 0;
110
+
111
+ // onEntry is only skipped when count mode has no filter (result.length is sufficient).
112
+ const needsOnEntry = !count || filter !== null;
113
+
97
114
  const parser = new SitemapXMLParser(url, {
98
115
  ...opts,
99
- onEntry: (entry) => {
116
+ onEntry: needsOnEntry ? (entry) => {
117
+ const loc = entry.loc?.[0] ?? '';
118
+ if (filter !== null && !loc.includes(filter)) return;
119
+
120
+ if (count) {
121
+ filteredCount++;
122
+ return;
123
+ }
124
+
100
125
  if (tsv) {
101
- const loc = entry.loc?.[0] ?? '';
102
126
  const lastmod = entry.lastmod?.[0] ?? '';
103
127
  const changefreq = entry.changefreq?.[0] ?? '';
104
128
  const priority = entry.priority?.[0] ?? '';
105
129
  process.stdout.write(`${loc}\t${lastmod}\t${changefreq}\t${priority}\n`);
106
130
  } else {
107
- process.stdout.write(entry.loc[0] + '\n');
131
+ process.stdout.write(loc + '\n');
108
132
  }
109
- },
133
+ } : null,
110
134
  onError: (failedUrl, err) => {
111
135
  hasError = true;
112
136
  const msg = err.message.replace(/\r?\n/g, ' ').trim();
@@ -114,6 +138,7 @@ function parseArgs(argv) {
114
138
  },
115
139
  });
116
140
 
117
- await parser.fetch();
141
+ const result = await parser.fetch();
142
+ if (count) process.stdout.write((filter !== null ? filteredCount : result.length) + '\n');
118
143
  if (hasError) process.exit(1);
119
144
  })();
package/lib/sitemap.js CHANGED
@@ -10,8 +10,8 @@ const { URL } = require('url');
10
10
  class SitemapXMLParser {
11
11
  constructor(url, options = {}) {
12
12
  this.siteMapUrl = url;
13
- this.delayTime = options.delay ?? 3000;
14
- this.limit = options.limit ?? 5;
13
+ this.delayTime = options.delay ?? 1000;
14
+ this.limit = options.limit ?? 10;
15
15
  this.timeout = options.timeout ?? 30000;
16
16
  this.onError = options.onError || null;
17
17
  this.onEntry = options.onEntry || null;
@@ -120,10 +120,11 @@ class SitemapXMLParser {
120
120
  return;
121
121
  }
122
122
  const chunks = [];
123
+ const contentEncoding = res.headers['content-encoding'];
123
124
  res.on('data', chunk => chunks.push(chunk));
124
125
  res.on('end', () => {
125
126
  const buf = Buffer.concat(chunks);
126
- if (ext === '.gz') {
127
+ if (ext === '.gz' || contentEncoding === 'gzip') {
127
128
  zlib.gunzip(buf, (err, result) => {
128
129
  if (err) {
129
130
  failOnce(originalUrl, err);
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "sitemap-xml-parser",
3
- "version": "1.1.0",
3
+ "version": "1.2.0",
4
4
  "description": "Parses sitemap XML files and returns all listed URLs. Supports sitemap index files and gzip (.gz) compression.",
5
5
  "main": "index.js",
6
6
  "types": "index.d.ts",
@@ -19,7 +19,7 @@
19
19
  "scripts": {
20
20
  "test": "node test/test.js"
21
21
  },
22
- "keywords": ["sitemap", "xml", "parse", "gzip", "sitemap-index", "cli"],
22
+ "keywords": ["sitemap", "xml", "parse", "gzip", "sitemap-index", "cli", "tsv"],
23
23
  "author": "shinkawax",
24
24
  "license": "MIT",
25
25
  "repository": {