@akotliar/sitemap-qa 1.0.0-alpha.3 → 1.0.0-alpha.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -13,12 +13,12 @@ Sitemap-QA is a command-line tool that automatically discovers, parses, and anal
13
13
 
14
14
  ## 🎯 Why Sitemap-QA?
15
15
 
16
- Unlike SEO-focused sitemap validators, Sitemap-QA is designed specifically for **QA validation and risk detection**:
16
+ Unlike SEO-focused sitemap validators, Sitemap-QA is designed specifically for **QA validation and risk detection** using a **Policy-as-Code** approach:
17
17
 
18
18
  - ✅ **Detect environment leakage** — Find staging, dev, or test URLs that shouldn't be in production sitemaps
19
19
  - ✅ **Identify exposed admin paths** — Catch `/admin`, `/dashboard`, and internal routes in public indexes
20
- - ✅ **Flag sensitive parameters** — Detect API keys, tokens, or passwords in sitemap URLs
21
- - ✅ **Validate domain consistency** — Find protocol mismatches and subdomain issues
20
+ - ✅ **Flag sensitive files** — Detect database backups, environment files, and archives
21
+ - ✅ **Fully Customizable** — Define your own risk categories and patterns using Literal, Glob, or Regex matching
22
22
  - ✅ **Fast and automated** — Analyze thousands of URLs in seconds with detailed reports
23
23
 
24
24
  Perfect for CI/CD pipelines, pre-release validation, and security audits.
@@ -42,9 +42,6 @@ sitemap-qa analyze https://example.com
42
42
 
43
43
  # Generate JSON output for CI/CD
44
44
  sitemap-qa analyze https://example.com --output json > report.json
45
-
46
- # Increase verbosity for debugging
47
- sitemap-qa analyze https://example.com --verbose
48
45
  ```
49
46
 
50
47
  ---
@@ -55,61 +52,66 @@ sitemap-qa analyze https://example.com --verbose
55
52
  - Checks `robots.txt` for sitemap declarations
56
53
  - Tests standard paths (`/sitemap.xml`, `/sitemap_index.xml`, etc.)
57
54
  - Recursively follows sitemap indexes
58
- - Handles multiple sitemaps and formats
59
- - Detects and processes malformed sitemap indexes (sitemaps listed in `<url>` blocks instead of `<sitemap>` blocks)
55
+ - Handles multiple sitemaps in supported formats (XML, compressed XML `.xml.gz`, and dynamically generated/PHP-based sitemaps)
60
56
 
61
57
  ### Risk Detection Patterns
62
58
 
63
- | Risk Category | Severity | Examples | Can Be Excluded |
64
- |--------------|----------|----------|-----------------|
65
- | **Environment Leakage** | High | `staging.example.com`, `/dev/`, `/test/` | ✅ Via patterns |
66
- | **Admin Paths** | High | `/admin`, `/dashboard`, `/config`, `/console` | ✅ Via patterns |
67
- | **Internal Content** | Medium | `/internal` paths | Via patterns |
68
- | **Sensitive Parameters** | High | `?token=`, `?apikey=`, `?password=` | Via patterns |
69
- | **Test Content** | Medium | `/test-`, `sample-`, `demo-` | Via patterns |
70
- | **Protocol Inconsistency** | Medium | HTTP URLs in HTTPS sitemaps | ❌ Always detected |
71
- | **Domain Mismatch** | Medium | Different domains in sitemap | ❌ Always detected |
59
+ The tool comes with a set of default policies, but you can fully customize them in your `sitemap-qa.yaml`.
60
+
61
+ | Risk Category | Description | Example Patterns |
62
+ |--------------|-------------|------------------|
63
+ | **Security & Admin** | Detects exposed administrative interfaces and sensitive configuration files. | `**/admin/**`, `**/.env*`, `/wp-admin` |
64
+ | **Environment Leakage** | Finds staging or development URLs that shouldn't be in production sitemaps. | `**/staging.**`, `**/dev.**` |
65
+ | **Sensitive Files** | Flags database backups, archives, and other sensitive file types. | `**/*.{sql,bak,zip,tar}`, `**/*.tar.gz` |
66
+
67
+ ### Customizing Risks
68
+
69
+ You can add your own categories and patterns to the `sitemap-qa.yaml` file. Patterns support `literal`, `glob`, and `regex` matching.
70
+
71
+ ```yaml
72
+ policies:
73
+ - category: "Internal API"
74
+ patterns:
75
+ - type: "glob"
76
+ value: "**/api/v1/internal/**"
77
+ reason: "Internal API version 1 should not be exposed."
78
+ ```
72
79
 
73
80
 
74
81
  ### Output Formats
75
82
 
76
83
  #### HTML Report (Interactive)
77
84
  The HTML report provides an interactive, visually appealing view with:
78
- - Expandable/collapsible sections by severity
79
- - Download buttons to export all URLs per category
85
+ - Expandable/collapsible sections by category
86
+ - Download buttons to export all URLs per finding
80
87
  - Clean, modern design with hover effects
81
88
  - Portable single-file format
82
89
 
83
90
  #### JSON Report (Machine-Readable)
84
91
  ```json
85
92
  {
86
- "analysis_metadata": {
87
- "base_url": "https://example.com",
88
- "tool_version": "1.0.0",
89
- "analysis_type": "rule-based analysis",
90
- "analysis_timestamp": "2025-12-11T00:00:00.000Z",
91
- "execution_time_ms": 4523
93
+ "metadata": {
94
+ "generatedAt": "2025-12-24T12:00:00.000Z",
95
+ "durationMs": 1240
92
96
  },
93
- "sitemaps_discovered": [
94
- "https://example.com/sitemap.xml"
95
- ],
96
- "suspicious_groups": [
97
+ "summary": {
98
+ "totalUrls": 895,
99
+ "totalRisks": 2,
100
+ "urlsWithRisksCount": 1
101
+ },
102
+ "findings": [
97
103
  {
98
- "category": "environment_leakage",
99
- "severity": "high",
100
- "count": 3,
101
- "rationale": "Production sitemap contains staging URLs",
102
- "sample_urls": ["..."],
103
- "recommended_action": "Verify sitemap generation excludes non-production environments"
104
+ "loc": "https://example.com/admin/login",
105
+ "risks": [
106
+ {
107
+ "category": "Security & Admin",
108
+ "pattern": "**/admin/**",
109
+ "type": "glob",
110
+ "reason": "Administrative interfaces should not be publicly indexed."
111
+ }
112
+ ]
104
113
  }
105
- ],
106
- "summary": {
107
- "high_severity_count": 2,
108
- "medium_severity_count": 1,
109
- "low_severity_count": 0,
110
- "total_risky_urls": 8,
111
- "overall_status": "issues_found"
112
- }
114
+ ]
113
115
  }
114
116
  ```
115
117
 
@@ -126,90 +128,90 @@ Arguments:
126
128
  url Base URL of the website to analyze
127
129
 
128
130
  Options:
129
- --timeout <seconds> HTTP request timeout in seconds (default: 30)
130
- --output <format> Output format: html or json (default: "html")
131
- --output-dir <path> Output directory for reports (default: "./sitemap-qa/report")
132
- --output-file <path> Custom output filename
133
- --accepted-patterns <list> Comma-separated patterns to exclude from risk detection
134
- --verbose Enable verbose logging
131
+ -c, --config <path> Path to sitemap-qa.yaml
132
+ -o, --output <format> Output format: json, html, or all (default: "all")
133
+ -d, --out-dir <path> Output directory for reports (default: ".")
135
134
  -h, --help Display help for command
136
135
  ```
137
136
 
138
137
  ### Examples
139
138
 
140
139
  ```bash
141
- # Basic analysis with HTML report (default)
140
+ # Basic analysis with both HTML and JSON reports (default)
142
141
  sitemap-qa analyze https://example.com
143
142
 
144
- # JSON output for CI/CD integration
143
+ # JSON output only
145
144
  sitemap-qa analyze https://example.com --output json
146
145
 
147
146
  # Custom output directory
148
- sitemap-qa analyze https://example.com --output-dir ./reports
149
-
150
- # Exclude specific URL patterns from detection
151
- sitemap-qa analyze https://example.com --accepted-patterns "internal-*,test-*"
152
-
153
- # Increase timeout for slow servers
154
- sitemap-qa analyze https://example.com --timeout 60
147
+ sitemap-qa analyze https://example.com --out-dir ./reports
155
148
 
156
- # Verbose mode for debugging
157
- sitemap-qa analyze https://example.com --verbose
149
+ # Use a specific configuration file
150
+ sitemap-qa analyze https://example.com --config ./custom-config.yaml
158
151
  ```
159
152
 
160
153
  ---
161
154
 
162
155
  ## 🔧 Configuration
163
156
 
164
- Create a `.sitemap-qa.config.json` file in your project root or `~/.sitemap-qa/config.json` for global settings:
165
-
166
- ```json
167
- {
168
- "timeout": 30,
169
- "concurrency": 10,
170
- "outputFormat": "html",
171
- "outputDir": "./sitemap-qa/report",
172
- "verbose": false,
173
- "acceptedPatterns": [
174
- "test-*",
175
- "staging-*"
176
- ]
177
- }
157
+ Create a `sitemap-qa.yaml` file in your project root to define your monitoring policies and tool settings:
158
+
159
+ ```yaml
160
+ # Tool Settings
161
+ # Default outDir is "."; this example uses a custom reports directory
162
+ outDir: "./sitemap-qa/report" # custom output directory
163
+ outputFormat: "all" # Options: json, html, all
164
+
165
+ # Monitoring Policies
166
+ policies:
167
+ - category: "Security & Admin"
168
+ patterns:
169
+ - type: "glob"
170
+ value: "**/admin/**"
171
+ reason: "Administrative interfaces should not be publicly indexed."
172
+ - type: "literal"
173
+ value: "/wp-admin"
174
+ reason: "WordPress admin paths are common attack vectors."
175
+ - type: "regex"
176
+ value: ".*\\.php$"
177
+ reason: "PHP file detected"
178
178
  ```
179
179
 
180
180
  ### Configuration Options
181
181
 
182
182
  | Option | Type | Default | Description |
183
183
  |--------|------|---------|-------------|
184
- | `timeout` | number | `30` | HTTP request timeout in seconds (1-300) |
185
- | `concurrency` | number | `10` | Number of concurrent HTTP requests |
186
- | `outputFormat` | string | `"html"` | Output format: `"html"` or `"json"` |
187
- | `outputDir` | string | `"./sitemap-qa/report"` | Directory for generated reports |
188
- | `verbose` | boolean | `false` | Enable detailed logging |
189
- | `acceptedPatterns` | string[] | `[]` | URL patterns to exclude from risk detection |
190
-
191
- ### Accepted Patterns
192
-
193
- Exclude specific URLs from risk detection using wildcard patterns:
194
-
195
- ```json
196
- {
197
- "acceptedPatterns": [
198
- "testing-*", // Matches: test-player, test-player-stats
199
- "https://example.com/admin/*", // Matches: any URL under /admin/
200
- "/special-case" // Matches: exact path segment
201
- ]
202
- }
184
+ | `outDir` | string | `"."` | Directory for generated reports (current working directory by default) |
185
+ | `outputFormat` | string | `"all"` | Report types to generate: `json`, `html`, or `all` |
186
+ | `policies` | array | `[]` | List of monitoring policies with patterns |
187
+
188
+ > Note: The earlier `sitemap-qa.yaml` example sets `outDir: "./sitemap-qa/report"` as a recommended path. If you omit `outDir`, the default is `"."` (the current working directory).
189
+ ### Policy Patterns
190
+
191
+ Define patterns to detect risks in your sitemaps:
192
+
193
+ ```yaml
194
+ policies:
195
+ - category: "Custom Rules"
196
+ patterns:
197
+ - type: "literal"
198
+ value: "test"
199
+ reason: "Test URL found"
200
+ - type: "glob"
201
+ value: "**/internal/*"
202
+ reason: "Internal path exposed"
203
+ - type: "regex"
204
+ value: "api/v[0-9]/"
205
+ reason: "API versioning detected"
203
206
  ```
204
207
 
205
- **Pattern Syntax:**
206
- - Use `*` as a wildcard (matches any characters within a path segment)
207
- - Patterns are case-insensitive
208
- - Special characters are automatically escaped
209
- - Patterns match against the full URL
210
- - Full URLs or path fragments both work
208
+ **Rule Types:**
209
+ - `literal`: Exact string match
210
+ - `glob`: Wildcard patterns (e.g., `**/admin/**`)
211
+ - `regex`: Regular expression matching (patterns are YAML strings and must use proper escaping)
212
+ - When defining regex patterns in `sitemap-qa.yaml`, remember they are YAML strings, so you must escape backslashes (for example, `".*\\\\.php$"` in YAML corresponds to the regex `.*\.php$`).
211
213
 
212
- **Priority:** CLI options > Project config (`.sitemap-qa.config.json`) > Global config (`~/.sitemap-qa/config.json`) > Defaults
214
+ **Priority:** CLI options > Project config (`sitemap-qa.yaml`) > Defaults
213
215
 
214
216
  ## 📝 License
215
217
 
@@ -222,6 +224,10 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
222
224
  Built with:
223
225
  - [Commander.js](https://github.com/tj/commander.js) - CLI framework
224
226
  - [Chalk](https://github.com/chalk/chalk) - Terminal styling
227
+ - [Undici](https://github.com/nodejs/undici) - High-performance HTTP client
228
+ - [Fast-XML-Parser](https://github.com/NaturalIntelligence/fast-xml-parser) - Fast XML parsing
229
+ - [Zod](https://zod.dev/) - Schema validation
230
+ - [Micromatch](https://github.com/micromatch/micromatch) - Glob pattern matching
225
231
  - [Vitest](https://vitest.dev/) - Testing framework
226
232
  - [TypeScript](https://www.typescriptlang.org/) - Type safety
227
233