@akotliar/sitemap-qa 1.0.0-alpha.3 → 1.0.0-alpha.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -11,14 +11,39 @@ Sitemap-QA is a command-line tool that automatically discovers, parses, and anal
11
11
 
12
12
  ---
13
13
 
14
+ ## 📑 Table of Contents
15
+
16
+ - [Why Sitemap-QA?](#-why-sitemap-qa)
17
+ - [Quick Start](#-quick-start)
18
+ - [Installation](#installation)
19
+ - [Basic Usage](#basic-usage)
20
+ - [Features](#-features)
21
+ - [Automatic Sitemap Discovery](#automatic-sitemap-discovery)
22
+ - [Risk Detection Patterns](#risk-detection-patterns)
23
+ - [Customizing Risks](#customizing-risks)
24
+ - [Output Formats](#output-formats)
25
+ - [CLI Commands](#-cli-commands)
26
+ - [analyze](#analyze-command)
27
+ - [init](#init-command)
28
+ - [Configuration](#-configuration)
29
+ - [Configuration Options](#configuration-options)
30
+
31
+ - [License](#-license)
32
+ - [Acknowledgments](#-acknowledgments)
33
+ - [Support](#-support)
34
+
35
+ ---
36
+
14
37
  ## 🎯 Why Sitemap-QA?
15
38
 
16
- Unlike SEO-focused sitemap validators, Sitemap-QA is designed specifically for **QA validation and risk detection**:
39
+ Unlike SEO-focused sitemap validators, Sitemap-QA is designed specifically for **QA validation and risk detection** using a **Policy-as-Code** approach:
17
40
 
18
41
  - ✅ **Detect environment leakage** — Find staging, dev, or test URLs that shouldn't be in production sitemaps
19
42
  - ✅ **Identify exposed admin paths** — Catch `/admin`, `/dashboard`, and internal routes in public indexes
20
- - ✅ **Flag sensitive parameters** — Detect API keys, tokens, or passwords in sitemap URLs
21
- - ✅ **Validate domain consistency** — Find protocol mismatches and subdomain issues
43
+ - ✅ **Flag sensitive files** — Detect database backups, environment files, and archives
44
+ - ✅ **Domain Consistency** — Automatically flag URLs that point to external or incorrect domains (handles `www.` normalization)
45
+ - ✅ **Acceptable Patterns (Allowlist)** — Exclude known safe URLs from being flagged as risks
46
+ - ✅ **Fully Customizable** — Define your own risk categories and patterns using Literal, Glob, or Regex matching
22
47
  - ✅ **Fast and automated** — Analyze thousands of URLs in seconds with detailed reports
23
48
 
24
49
  Perfect for CI/CD pipelines, pre-release validation, and security audits.
@@ -37,14 +62,17 @@ npm install -g @akotliar/sitemap-qa@alpha
37
62
  ### Basic Usage
38
63
 
39
64
  ```bash
40
- # Analyze a website's sitemap
65
+ # Step 1: Initialize a configuration file (optional but recommended)
66
+ sitemap-qa init
67
+
68
+ # Step 2: Analyze a website's sitemap
41
69
  sitemap-qa analyze https://example.com
42
70
 
43
- # Generate JSON output for CI/CD
44
- sitemap-qa analyze https://example.com --output json > report.json
71
+ # Generate JSON output only for CI/CD
72
+ sitemap-qa analyze https://example.com --output json
45
73
 
46
- # Increase verbosity for debugging
47
- sitemap-qa analyze https://example.com --verbose
74
+ # Use a custom configuration file
75
+ sitemap-qa analyze https://example.com --config ./custom-config.yaml
48
76
  ```
49
77
 
50
78
  ---
@@ -55,67 +83,102 @@ sitemap-qa analyze https://example.com --verbose
55
83
  - Checks `robots.txt` for sitemap declarations
56
84
  - Tests standard paths (`/sitemap.xml`, `/sitemap_index.xml`, etc.)
57
85
  - Recursively follows sitemap indexes
58
- - Handles multiple sitemaps and formats
59
- - Detects and processes malformed sitemap indexes (sitemaps listed in `<url>` blocks instead of `<sitemap>` blocks)
86
+ - Handles multiple sitemaps in supported formats (XML, compressed XML `.xml.gz`, and dynamically generated/PHP-based sitemaps)
60
87
 
61
88
  ### Risk Detection Patterns
62
89
 
63
- | Risk Category | Severity | Examples | Can Be Excluded |
64
- |--------------|----------|----------|-----------------|
65
- | **Environment Leakage** | High | `staging.example.com`, `/dev/`, `/test/` | ✅ Via patterns |
66
- | **Admin Paths** | High | `/admin`, `/dashboard`, `/config`, `/console` | ✅ Via patterns |
67
- | **Internal Content** | Medium | `/internal` paths | Via patterns |
68
- | **Sensitive Parameters** | High | `?token=`, `?apikey=`, `?password=` | Via patterns |
69
- | **Test Content** | Medium | `/test-`, `sample-`, `demo-` | Via patterns |
70
- | **Protocol Inconsistency** | Medium | HTTP URLs in HTTPS sitemaps | Always detected |
71
- | **Domain Mismatch** | Medium | Different domains in sitemap | ❌ Always detected |
90
+ The tool comes with a set of default policies, but you can fully customize them in your `sitemap-qa.yaml`.
91
+
92
+ | Risk Category | Description | Example Patterns |
93
+ |--------------|-------------|------------------|
94
+ | **Security & Admin** | Detects exposed administrative interfaces and sensitive configuration files. | `**/admin/**`, `**/.env*`, `/wp-admin` |
95
+ | **Environment Leakage** | Finds staging or development URLs that shouldn't be in production sitemaps. | `**/staging.**`, `**/dev.**` |
96
+ | **Sensitive Files** | Flags database backups, archives, and other sensitive file types. | `**/*.{sql,bak,zip,tar}`, `**/*.tar.gz` |
97
+ | **Domain Consistency** | Detects URLs that don't match the target domain (ignoring `www.` differences). | `example.com` vs `other.com` |
98
+
99
+ ### Customizing Risks
100
+
101
+ You can add your own categories and patterns to the `sitemap-qa.yaml` file. Patterns support `literal`, `glob`, and `regex` matching. See the [Configuration](#-configuration) section for details.
72
102
 
73
103
 
74
104
  ### Output Formats
75
105
 
76
106
  #### HTML Report (Interactive)
77
107
  The HTML report provides an interactive, visually appealing view with:
78
- - Expandable/collapsible sections by severity
79
- - Download buttons to export all URLs per category
108
+ - Expandable/collapsible sections by category
109
+ - Download buttons to export all URLs per finding
80
110
  - Clean, modern design with hover effects
81
111
  - Portable single-file format
82
112
 
83
113
  #### JSON Report (Machine-Readable)
84
114
  ```json
85
115
  {
86
- "analysis_metadata": {
87
- "base_url": "https://example.com",
88
- "tool_version": "1.0.0",
89
- "analysis_type": "rule-based analysis",
90
- "analysis_timestamp": "2025-12-11T00:00:00.000Z",
91
- "execution_time_ms": 4523
116
+ "metadata": {
117
+ "generatedAt": "2025-12-24T12:00:00.000Z",
118
+ "durationMs": 1240
119
+ },
120
+ "summary": {
121
+ "totalUrls": 895,
122
+ "totalRisks": 2,
123
+ "urlsWithRisksCount": 1,
124
+ "ignoredUrlsCount": 5
92
125
  },
93
- "sitemaps_discovered": [
94
- "https://example.com/sitemap.xml"
95
- ],
96
- "suspicious_groups": [
126
+ "findings": [
97
127
  {
98
- "category": "environment_leakage",
99
- "severity": "high",
100
- "count": 3,
101
- "rationale": "Production sitemap contains staging URLs",
102
- "sample_urls": ["..."],
103
- "recommended_action": "Verify sitemap generation excludes non-production environments"
128
+ "loc": "https://example.com/admin/login",
129
+ "risks": [
130
+ {
131
+ "category": "Security & Admin",
132
+ "pattern": "**/admin/**",
133
+ "type": "glob",
134
+ "reason": "Administrative interfaces should not be publicly indexed."
135
+ }
136
+ ]
104
137
  }
105
- ],
106
- "summary": {
107
- "high_severity_count": 2,
108
- "medium_severity_count": 1,
109
- "low_severity_count": 0,
110
- "total_risky_urls": 8,
111
- "overall_status": "issues_found"
112
- }
138
+ ]
113
139
  }
114
140
  ```
115
141
 
116
142
  ---
117
143
 
118
- ## 🛠️ CLI Options
144
+ ## 🛠️ CLI Commands
145
+
146
+ Sitemap-QA provides two main commands: `init` and `analyze`.
147
+
148
+
149
+ ### init Command
150
+
151
+ Initialize a default `sitemap-qa.yaml` configuration file in the current directory.
152
+
153
+ ```
154
+ Usage: sitemap-qa init [options]
155
+
156
+ Initialize a default sitemap-qa.yaml configuration file
157
+
158
+ Options:
159
+ -h, --help Display help for command
160
+ ```
161
+
162
+ #### Example
163
+
164
+ ```bash
165
+ # Create a default configuration file
166
+ sitemap-qa init
167
+
168
+ # This creates sitemap-qa.yaml with:
169
+ # - Default risk policies (Security & Admin, Environment Leakage, Sensitive Files)
170
+ # - Example acceptable patterns
171
+ # - Default output settings
172
+ ```
173
+
174
+ **Note:** The `init` command will fail if `sitemap-qa.yaml` already exists in the current directory to prevent accidental overwrites.
175
+
176
+ ---
177
+
178
+
179
+ ### analyze Command
180
+
181
+ Analyze a website's sitemap for quality issues and security risks.
119
182
 
120
183
  ```
121
184
  Usage: sitemap-qa analyze <url> [options]
@@ -126,90 +189,84 @@ Arguments:
126
189
  url Base URL of the website to analyze
127
190
 
128
191
  Options:
129
- --timeout <seconds> HTTP request timeout in seconds (default: 30)
130
- --output <format> Output format: html or json (default: "html")
131
- --output-dir <path> Output directory for reports (default: "./sitemap-qa/report")
132
- --output-file <path> Custom output filename
133
- --accepted-patterns <list> Comma-separated patterns to exclude from risk detection
134
- --verbose Enable verbose logging
192
+ -c, --config <path> Path to sitemap-qa.yaml configuration file
193
+ -o, --output <format> Output format: json, html, or all (default: "all")
194
+ -d, --out-dir <path> Output directory for reports (default: ".")
135
195
  -h, --help Display help for command
136
196
  ```
137
197
 
138
- ### Examples
198
+ #### Examples
139
199
 
140
200
  ```bash
141
- # Basic analysis with HTML report (default)
201
+ # Basic analysis with both HTML and JSON reports (default)
142
202
  sitemap-qa analyze https://example.com
143
203
 
144
- # JSON output for CI/CD integration
204
+ # JSON output only
145
205
  sitemap-qa analyze https://example.com --output json
146
206
 
147
- # Custom output directory
148
- sitemap-qa analyze https://example.com --output-dir ./reports
207
+ # HTML output only
208
+ sitemap-qa analyze https://example.com --output html
149
209
 
150
- # Exclude specific URL patterns from detection
151
- sitemap-qa analyze https://example.com --accepted-patterns "internal-*,test-*"
210
+ # Custom output directory
211
+ sitemap-qa analyze https://example.com --out-dir ./reports
152
212
 
153
- # Increase timeout for slow servers
154
- sitemap-qa analyze https://example.com --timeout 60
213
+ # Use a specific configuration file
214
+ sitemap-qa analyze https://example.com --config ./custom-config.yaml
155
215
 
156
- # Verbose mode for debugging
157
- sitemap-qa analyze https://example.com --verbose
216
+ # Combine options
217
+ sitemap-qa analyze https://example.com --config ./custom-config.yaml --output json --out-dir ./reports
158
218
  ```
159
219
 
160
- ---
161
-
162
220
  ## 🔧 Configuration
163
221
 
164
- Create a `.sitemap-qa.config.json` file in your project root or `~/.sitemap-qa/config.json` for global settings:
165
-
166
- ```json
167
- {
168
- "timeout": 30,
169
- "concurrency": 10,
170
- "outputFormat": "html",
171
- "outputDir": "./sitemap-qa/report",
172
- "verbose": false,
173
- "acceptedPatterns": [
174
- "test-*",
175
- "staging-*"
176
- ]
177
- }
222
+ Create a `sitemap-qa.yaml` file in your project root to define your monitoring policies and tool settings:
223
+
224
+ ```yaml
225
+ # Tool Settings
226
+ # Default outDir is "."; this example uses a custom reports directory
227
+ outDir: "./sitemap-qa/report" # custom output directory
228
+ outputFormat: "all" # Options: json, html, all
229
+ enforceDomainConsistency: true # Flag URLs from other domains
230
+
231
+ # Monitoring Policies
232
+ acceptable_patterns:
233
+ - type: "literal"
234
+ value: "/acceptable-path"
235
+ reason: "Example of an acceptable path that should not be flagged."
236
+ - type: "glob"
237
+ value: "**/public-docs/**"
238
+ reason: "Public documentation is always acceptable."
239
+
240
+ policies:
241
+ - category: "Security & Admin"
242
+ patterns:
243
+ - type: "glob"
244
+ value: "**/admin/**"
245
+ reason: "Administrative interfaces should not be publicly indexed."
246
+ - type: "literal"
247
+ value: "/wp-admin"
248
+ reason: "WordPress admin paths are common attack vectors."
249
+ - type: "regex"
250
+ value: ".*\\.php$"
251
+ reason: "PHP file detected"
178
252
  ```
179
253
 
180
254
  ### Configuration Options
181
255
 
182
256
  | Option | Type | Default | Description |
183
257
  |--------|------|---------|-------------|
184
- | `timeout` | number | `30` | HTTP request timeout in seconds (1-300) |
185
- | `concurrency` | number | `10` | Number of concurrent HTTP requests |
186
- | `outputFormat` | string | `"html"` | Output format: `"html"` or `"json"` |
187
- | `outputDir` | string | `"./sitemap-qa/report"` | Directory for generated reports |
188
- | `verbose` | boolean | `false` | Enable detailed logging |
189
- | `acceptedPatterns` | string[] | `[]` | URL patterns to exclude from risk detection |
258
+ | `outDir` | string | `"."` | Directory for generated reports (current working directory by default) |
259
+ | `outputFormat` | string | `"all"` | Report types to generate: `json`, `html`, or `all` |
260
+ | `enforceDomainConsistency` | boolean | `true` | If true, flags URLs that don't match the root sitemap domain (ignoring `www.`) |
261
+ | `acceptable_patterns` | array | `[]` | List of patterns to exclude from risk analysis |
262
+ | `policies` | array | `[]` | List of monitoring policies with patterns |
190
263
 
191
- ### Accepted Patterns
192
264
 
193
- Exclude specific URLs from risk detection using wildcard patterns:
265
+ **Priority:** CLI options > Project config (`sitemap-qa.yaml`) > Defaults
194
266
 
195
- ```json
196
- {
197
- "acceptedPatterns": [
198
- "testing-*", // Matches: test-player, test-player-stats
199
- "https://example.com/admin/*", // Matches: any URL under /admin/
200
- "/special-case" // Matches: exact path segment
201
- ]
202
- }
203
- ```
204
267
 
205
- **Pattern Syntax:**
206
- - Use `*` as a wildcard (matches any characters within a path segment)
207
- - Patterns are case-insensitive
208
- - Special characters are automatically escaped
209
- - Patterns match against the full URL
210
- - Full URLs or path fragments both work
211
268
 
212
- **Priority:** CLI options > Project config (`.sitemap-qa.config.json`) > Global config (`~/.sitemap-qa/config.json`) > Defaults
269
+ ---
213
270
 
214
271
  ## 📝 License
215
272
 
@@ -222,6 +279,10 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
222
279
  Built with:
223
280
  - [Commander.js](https://github.com/tj/commander.js) - CLI framework
224
281
  - [Chalk](https://github.com/chalk/chalk) - Terminal styling
282
+ - [Undici](https://github.com/nodejs/undici) - High-performance HTTP client
283
+ - [Fast-XML-Parser](https://github.com/NaturalIntelligence/fast-xml-parser) - Fast XML parsing
284
+ - [Zod](https://zod.dev/) - Schema validation
285
+ - [Micromatch](https://github.com/micromatch/micromatch) - Glob pattern matching
225
286
  - [Vitest](https://vitest.dev/) - Testing framework
226
287
  - [TypeScript](https://www.typescriptlang.org/) - Type safety
227
288
 
@@ -229,7 +290,7 @@ Built with:
229
290
 
230
291
  ## 📧 Support
231
292
 
232
- - **Issues**: [GitHub Issues](https://github.com/akotliar/sitemap-qa/issues)-
293
+ - **Issues**: [GitHub Issues](https://github.com/akotliar/sitemap-qa/issues)
233
294
 
234
295
  ---
235
296