@akotliar/sitemap-qa 1.0.0-alpha.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Alex Kotliar
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,235 @@
1
+ # Sitemap-QA
2
+
3
+ > **Automated sitemap analysis for QA teams** — Detect test/qa/dev/staging URLs, admin paths, sensitive parameters, and URLs that shouldn't be publicly indexed.
4
+
5
+ [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
6
+ [![Node Version](https://img.shields.io/badge/node-%3E%3D20.0.0-brightgreen)](package.json)
7
+ [![TypeScript](https://img.shields.io/badge/TypeScript-5.0-blue)](https://www.typescriptlang.org/)
8
+
9
+ Sitemap-QA is a command-line tool that automatically discovers, parses, and analyzes website sitemaps to identify potential quality issues, security risks, and configuration problems. Built for QA teams to validate deployments, catch environment leakage, and identify URLs that shouldn't be publicly indexed.
10
+
11
+ ---
12
+
13
+ ## 🎯 Why Sitemap-QA?
14
+
15
+ Unlike SEO-focused sitemap validators, Sitemap-QA is designed specifically for **QA validation and risk detection**:
16
+
17
+ - ✅ **Detect environment leakage** — Find staging, dev, or test URLs that shouldn't be in production sitemaps
18
+ - ✅ **Identify exposed admin paths** — Catch `/admin`, `/dashboard`, and internal routes in public indexes
19
+ - ✅ **Flag sensitive parameters** — Detect API keys, tokens, or passwords in sitemap URLs
20
+ - ✅ **Validate domain consistency** — Find protocol mismatches and subdomain issues
21
+ - ✅ **Fast and automated** — Analyze thousands of URLs in seconds with detailed reports
22
+
23
+ Perfect for CI/CD pipelines, pre-release validation, and security audits.
24
+
25
+ ---
26
+
27
+ ## 🚀 Quick Start
28
+
29
+ ### Installation
30
+
31
+ ```bash
32
+ npm install -g sitemap-qa
33
+ ```
34
+
35
+ ### Basic Usage
36
+
37
+ ```bash
38
+ # Analyze a website's sitemap
39
+ sitemap-qa analyze https://example.com
40
+
41
+ # Generate JSON output for CI/CD
42
+ sitemap-qa analyze https://example.com --output json > report.json
43
+
44
+ # Increase verbosity for debugging
45
+ sitemap-qa analyze https://example.com --verbose
46
+ ```
47
+
48
+ ---
49
+
50
+ ## 📋 Features
51
+
52
+ ### Automatic Sitemap Discovery
53
+ - Checks `robots.txt` for sitemap declarations
54
+ - Tests standard paths (`/sitemap.xml`, `/sitemap_index.xml`, etc.)
55
+ - Recursively follows sitemap indexes
56
+ - Handles multiple sitemaps and formats
57
+ - Detects and processes malformed sitemap indexes (sitemaps listed in `<url>` blocks instead of `<sitemap>` blocks)
58
+
59
+ ### Risk Detection Patterns
60
+
61
+ | Risk Category | Severity | Examples | Can Be Excluded |
62
+ |--------------|----------|----------|-----------------|
63
+ | **Environment Leakage** | High | `staging.example.com`, `/dev/`, `/test/` | ✅ Via patterns |
64
+ | **Admin Paths** | High | `/admin`, `/dashboard`, `/config`, `/console` | ✅ Via patterns |
65
+ | **Internal Content** | Medium | `/internal` paths | ✅ Via patterns |
66
+ | **Sensitive Parameters** | High | `?token=`, `?apikey=`, `?password=` | ✅ Via patterns |
67
+ | **Test Content** | Medium | `/test-`, `sample-`, `demo-` | ✅ Via patterns |
68
+ | **Protocol Inconsistency** | Medium | HTTP URLs in HTTPS sitemaps | ❌ Always detected |
69
+ | **Domain Mismatch** | Medium | Different domains in sitemap | ❌ Always detected |
70
+
71
+ **Note:** Admin path patterns now properly match URLs with query parameters (e.g., `/admin?id=123`).
72
+
73
+ ### Output Formats
74
+
75
+ #### HTML Report (Interactive)
76
+ The HTML report provides an interactive, visually appealing view with:
77
+ - Expandable/collapsible sections by severity
78
+ - Download buttons to export all URLs per category
79
+ - Clean, modern design with hover effects
80
+ - Portable single-file format
81
+
82
+ #### JSON Report (Machine-Readable)
83
+ ```json
84
+ {
85
+ "analysis_metadata": {
86
+ "base_url": "https://example.com",
87
+ "tool_version": "1.0.0",
88
+ "analysis_type": "rule-based analysis",
89
+ "analysis_timestamp": "2025-12-11T00:00:00.000Z",
90
+ "execution_time_ms": 4523
91
+ },
92
+ "sitemaps_discovered": [
93
+ "https://example.com/sitemap.xml"
94
+ ],
95
+ "suspicious_groups": [
96
+ {
97
+ "category": "environment_leakage",
98
+ "severity": "high",
99
+ "count": 3,
100
+ "rationale": "Production sitemap contains staging URLs",
101
+ "sample_urls": ["..."],
102
+ "recommended_action": "Verify sitemap generation excludes non-production environments"
103
+ }
104
+ ],
105
+ "summary": {
106
+ "high_severity_count": 2,
107
+ "medium_severity_count": 1,
108
+ "low_severity_count": 0,
109
+ "total_risky_urls": 8,
110
+ "overall_status": "issues_found"
111
+ }
112
+ }
113
+ ```
114
+
115
+ ---
116
+
117
+ ## 🛠️ CLI Options
118
+
119
+ ```
120
+ Usage: sitemap-qa analyze [options] <url>
121
+
122
+ Analyze a website's sitemap for quality issues
123
+
124
+ Arguments:
125
+ url Base URL of the website to analyze
126
+
127
+ Options:
128
+ --timeout <seconds> HTTP request timeout in seconds (default: 30)
129
+ --output <format> Output format: html or json (default: "html")
130
+ --output-dir <path> Output directory for reports (default: "./sitemap-qa/report")
131
+ --output-file <path> Custom output filename
132
+ --accepted-patterns <list> Comma-separated patterns to exclude from risk detection
133
+ --verbose Enable verbose logging
134
+ -h, --help Display help for command
135
+ ```
136
+
137
+ ### Examples
138
+
139
+ ```bash
140
+ # Basic analysis with HTML report (default)
141
+ sitemap-qa analyze https://example.com
142
+
143
+ # JSON output for CI/CD integration
144
+ sitemap-qa analyze https://example.com --output json
145
+
146
+ # Custom output directory
147
+ sitemap-qa analyze https://example.com --output-dir ./reports
148
+
149
+ # Exclude specific URL patterns from detection
150
+ sitemap-qa analyze https://example.com --accepted-patterns "internal-*,test-*"
151
+
152
+ # Increase timeout for slow servers
153
+ sitemap-qa analyze https://example.com --timeout 60
154
+
155
+ # Verbose mode for debugging
156
+ sitemap-qa analyze https://example.com --verbose
157
+ ```
158
+
159
+ ---
160
+
161
+ ## 🔧 Configuration
162
+
163
+ Create a `.sitemap-qa.config.json` file in your project root or `~/.sitemap-qa/config.json` for global settings:
164
+
165
+ ```json
166
+ {
167
+ "timeout": 30,
168
+ "concurrency": 10,
169
+ "outputFormat": "html",
170
+ "outputDir": "./sitemap-qa/report",
171
+ "verbose": false,
172
+ "acceptedPatterns": [
173
+ "test-*",
174
+ "staging-*"
175
+ ]
176
+ }
177
+ ```
178
+
179
+ ### Configuration Options
180
+
181
+ | Option | Type | Default | Description |
182
+ |--------|------|---------|-------------|
183
+ | `timeout` | number | `30` | HTTP request timeout in seconds (1-300) |
184
+ | `concurrency` | number | `10` | Number of concurrent HTTP requests |
185
+ | `outputFormat` | string | `"html"` | Output format: `"html"` or `"json"` |
186
+ | `outputDir` | string | `"./sitemap-qa/report"` | Directory for generated reports |
187
+ | `verbose` | boolean | `false` | Enable detailed logging |
188
+ | `acceptedPatterns` | string[] | `[]` | URL patterns to exclude from risk detection |
189
+
190
+ ### Accepted Patterns
191
+
192
+ Exclude specific URLs from risk detection using wildcard patterns:
193
+
194
+ ```json
195
+ {
196
+ "acceptedPatterns": [
197
+ "testing-*", // Matches: test-player, test-player-stats
198
+ "https://example.com/admin/*", // Matches: any URL under /admin/
199
+ "/special-case" // Matches: exact path segment
200
+ ]
201
+ }
202
+ ```
203
+
204
+ **Pattern Syntax:**
205
+ - Use `*` as a wildcard (matches any characters within a path segment)
206
+ - Patterns are case-insensitive
207
+ - Special characters are automatically escaped
208
+ - Patterns match against the full URL
209
+ - Full URLs or path fragments both work
210
+
211
+ **Priority:** CLI options > Project config (`.sitemap-qa.config.json`) > Global config (`~/.sitemap-qa/config.json`) > Defaults
212
+
213
+ ## 📝 License
214
+
215
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
216
+
217
+ ---
218
+
219
+ ## 🙏 Acknowledgments
220
+
221
+ Built with:
222
+ - [Commander.js](https://github.com/tj/commander.js) - CLI framework
223
+ - [Chalk](https://github.com/chalk/chalk) - Terminal styling
224
+ - [Vitest](https://vitest.dev/) - Testing framework
225
+ - [TypeScript](https://www.typescriptlang.org/) - Type safety
226
+
227
+ ---
228
+
229
+ ## 📧 Support
230
+
231
+ - **Issues**: [GitHub Issues](https://github.com/akotliar/sitemap-qa/issues)-
232
+
233
+ ---
234
+
235
+ **Made with ❤️ for QA teams everywhere**