@akotliar/sitemap-qa 1.0.0-alpha.4 → 1.0.0-alpha.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -11,6 +11,29 @@ Sitemap-QA is a command-line tool that automatically discovers, parses, and anal
11
11
 
12
12
  ---
13
13
 
14
+ ## 📑 Table of Contents
15
+
16
+ - [Why Sitemap-QA?](#-why-sitemap-qa)
17
+ - [Quick Start](#-quick-start)
18
+ - [Installation](#installation)
19
+ - [Basic Usage](#basic-usage)
20
+ - [Features](#-features)
21
+ - [Automatic Sitemap Discovery](#automatic-sitemap-discovery)
22
+ - [Risk Detection Patterns](#risk-detection-patterns)
23
+ - [Customizing Risks](#customizing-risks)
24
+ - [Output Formats](#output-formats)
25
+ - [CLI Commands](#-cli-commands)
26
+ - [analyze](#analyze-command)
27
+ - [init](#init-command)
28
+ - [Configuration](#-configuration)
29
+ - [Configuration Options](#configuration-options)
30
+
31
+ - [License](#-license)
32
+ - [Acknowledgments](#-acknowledgments)
33
+ - [Support](#-support)
34
+
35
+ ---
36
+
14
37
  ## 🎯 Why Sitemap-QA?
15
38
 
16
39
  Unlike SEO-focused sitemap validators, Sitemap-QA is designed specifically for **QA validation and risk detection** using a **Policy-as-Code** approach:
@@ -18,6 +41,8 @@ Unlike SEO-focused sitemap validators, Sitemap-QA is designed specifically for *
18
41
  - ✅ **Detect environment leakage** — Find staging, dev, or test URLs that shouldn't be in production sitemaps
19
42
  - ✅ **Identify exposed admin paths** — Catch `/admin`, `/dashboard`, and internal routes in public indexes
20
43
  - ✅ **Flag sensitive files** — Detect database backups, environment files, and archives
44
+ - ✅ **Domain Consistency** — Automatically flag URLs that point to external or incorrect domains (handles `www.` normalization)
45
+ - ✅ **Acceptable Patterns (Allowlist)** — Exclude known safe URLs from being flagged as risks
21
46
  - ✅ **Fully Customizable** — Define your own risk categories and patterns using Literal, Glob, or Regex matching
22
47
  - ✅ **Fast and automated** — Analyze thousands of URLs in seconds with detailed reports
23
48
 
@@ -37,11 +62,17 @@ npm install -g @akotliar/sitemap-qa@alpha
37
62
  ### Basic Usage
38
63
 
39
64
  ```bash
40
- # Analyze a website's sitemap
65
+ # Step 1: Initialize a configuration file (optional but recommended)
66
+ sitemap-qa init
67
+
68
+ # Step 2: Analyze a website's sitemap
41
69
  sitemap-qa analyze https://example.com
42
70
 
43
- # Generate JSON output for CI/CD
44
- sitemap-qa analyze https://example.com --output json > report.json
71
+ # Generate JSON output only for CI/CD
72
+ sitemap-qa analyze https://example.com --output json
73
+
74
+ # Use a custom configuration file
75
+ sitemap-qa analyze https://example.com --config ./custom-config.yaml
45
76
  ```
46
77
 
47
78
  ---
@@ -63,19 +94,11 @@ The tool comes with a set of default policies, but you can fully customize them
63
94
  | **Security & Admin** | Detects exposed administrative interfaces and sensitive configuration files. | `**/admin/**`, `**/.env*`, `/wp-admin` |
64
95
  | **Environment Leakage** | Finds staging or development URLs that shouldn't be in production sitemaps. | `**/staging.**`, `**/dev.**` |
65
96
  | **Sensitive Files** | Flags database backups, archives, and other sensitive file types. | `**/*.{sql,bak,zip,tar}`, `**/*.tar.gz` |
97
+ | **Domain Consistency** | Detects URLs that don't match the target domain (ignoring `www.` differences). | `example.com` vs `other.com` |
66
98
 
67
99
  ### Customizing Risks
68
100
 
69
- You can add your own categories and patterns to the `sitemap-qa.yaml` file. Patterns support `literal`, `glob`, and `regex` matching.
70
-
71
- ```yaml
72
- policies:
73
- - category: "Internal API"
74
- patterns:
75
- - type: "glob"
76
- value: "**/api/v1/internal/**"
77
- reason: "Internal API version 1 should not be exposed."
78
- ```
101
+ You can add your own categories and patterns to the `sitemap-qa.yaml` file. Patterns support `literal`, `glob`, and `regex` matching. See the [Configuration](#-configuration) section for details.
79
102
 
80
103
 
81
104
  ### Output Formats
@@ -97,7 +120,8 @@ The HTML report provides an interactive, visually appealing view with:
97
120
  "summary": {
98
121
  "totalUrls": 895,
99
122
  "totalRisks": 2,
100
- "urlsWithRisksCount": 1
123
+ "urlsWithRisksCount": 1,
124
+ "ignoredUrlsCount": 5
101
125
  },
102
126
  "findings": [
103
127
  {
@@ -117,7 +141,44 @@ The HTML report provides an interactive, visually appealing view with:
117
141
 
118
142
  ---
119
143
 
120
- ## 🛠️ CLI Options
144
+ ## 🛠️ CLI Commands
145
+
146
+ Sitemap-QA provides two main commands: `init` and `analyze`.
147
+
148
+
149
+ ### init Command
150
+
151
+ Initialize a default `sitemap-qa.yaml` configuration file in the current directory.
152
+
153
+ ```
154
+ Usage: sitemap-qa init [options]
155
+
156
+ Initialize a default sitemap-qa.yaml configuration file
157
+
158
+ Options:
159
+ -h, --help Display help for command
160
+ ```
161
+
162
+ #### Example
163
+
164
+ ```bash
165
+ # Create a default configuration file
166
+ sitemap-qa init
167
+
168
+ # This creates sitemap-qa.yaml with:
169
+ # - Default risk policies (Security & Admin, Environment Leakage, Sensitive Files)
170
+ # - Example acceptable patterns
171
+ # - Default output settings
172
+ ```
173
+
174
+ **Note:** The `init` command will fail if `sitemap-qa.yaml` already exists in the current directory to prevent accidental overwrites.
175
+
176
+ ---
177
+
178
+
179
+ ### analyze Command
180
+
181
+ Analyze a website's sitemap for quality issues and security risks.
121
182
 
122
183
  ```
123
184
  Usage: sitemap-qa analyze <url> [options]
@@ -128,13 +189,13 @@ Arguments:
128
189
  url Base URL of the website to analyze
129
190
 
130
191
  Options:
131
- -c, --config <path> Path to sitemap-qa.yaml
192
+ -c, --config <path> Path to sitemap-qa.yaml configuration file
132
193
  -o, --output <format> Output format: json, html, or all (default: "all")
133
194
  -d, --out-dir <path> Output directory for reports (default: ".")
134
195
  -h, --help Display help for command
135
196
  ```
136
197
 
137
- ### Examples
198
+ #### Examples
138
199
 
139
200
  ```bash
140
201
  # Basic analysis with both HTML and JSON reports (default)
@@ -143,14 +204,18 @@ sitemap-qa analyze https://example.com
143
204
  # JSON output only
144
205
  sitemap-qa analyze https://example.com --output json
145
206
 
207
+ # HTML output only
208
+ sitemap-qa analyze https://example.com --output html
209
+
146
210
  # Custom output directory
147
211
  sitemap-qa analyze https://example.com --out-dir ./reports
148
212
 
149
213
  # Use a specific configuration file
150
214
  sitemap-qa analyze https://example.com --config ./custom-config.yaml
151
- ```
152
215
 
153
- ---
216
+ # Combine options
217
+ sitemap-qa analyze https://example.com --config ./custom-config.yaml --output json --out-dir ./reports
218
+ ```
154
219
 
155
220
  ## 🔧 Configuration
156
221
 
@@ -161,8 +226,17 @@ Create a `sitemap-qa.yaml` file in your project root to define your monitoring p
161
226
  # Default outDir is "."; this example uses a custom reports directory
162
227
  outDir: "./sitemap-qa/report" # custom output directory
163
228
  outputFormat: "all" # Options: json, html, all
229
+ enforceDomainConsistency: true # Flag URLs from other domains
164
230
 
165
231
  # Monitoring Policies
232
+ acceptable_patterns:
233
+ - type: "literal"
234
+ value: "/acceptable-path"
235
+ reason: "Example of an acceptable path that should not be flagged."
236
+ - type: "glob"
237
+ value: "**/public-docs/**"
238
+ reason: "Public documentation is always acceptable."
239
+
166
240
  policies:
167
241
  - category: "Security & Admin"
168
242
  patterns:
@@ -183,35 +257,16 @@ policies:
183
257
  |--------|------|---------|-------------|
184
258
  | `outDir` | string | `"."` | Directory for generated reports (current working directory by default) |
185
259
  | `outputFormat` | string | `"all"` | Report types to generate: `json`, `html`, or `all` |
260
+ | `enforceDomainConsistency` | boolean | `true` | If true, flags URLs that don't match the root sitemap domain (ignoring `www.`) |
261
+ | `acceptable_patterns` | array | `[]` | List of patterns to exclude from risk analysis |
186
262
  | `policies` | array | `[]` | List of monitoring policies with patterns |
187
263
 
188
- > Note: The earlier `sitemap-qa.yaml` example sets `outDir: "./sitemap-qa/report"` as a recommended path. If you omit `outDir`, the default is `"."` (the current working directory).
189
- ### Policy Patterns
190
264
 
191
- Define patterns to detect risks in your sitemaps:
265
+ **Priority:** CLI options > Project config (`sitemap-qa.yaml`) > Defaults
192
266
 
193
- ```yaml
194
- policies:
195
- - category: "Custom Rules"
196
- patterns:
197
- - type: "literal"
198
- value: "test"
199
- reason: "Test URL found"
200
- - type: "glob"
201
- value: "**/internal/*"
202
- reason: "Internal path exposed"
203
- - type: "regex"
204
- value: "api/v[0-9]/"
205
- reason: "API versioning detected"
206
- ```
207
267
 
208
- **Rule Types:**
209
- - `literal`: Exact string match
210
- - `glob`: Wildcard patterns (e.g., `**/admin/**`)
211
- - `regex`: Regular expression matching (patterns are YAML strings and must use proper escaping)
212
- - When defining regex patterns in `sitemap-qa.yaml`, remember they are YAML strings, so you must escape backslashes (for example, `".*\\\\.php$"` in YAML corresponds to the regex `.*\.php$`).
213
268
 
214
- **Priority:** CLI options > Project config (`sitemap-qa.yaml`) > Defaults
269
+ ---
215
270
 
216
271
  ## 📝 License
217
272
 
@@ -235,7 +290,7 @@ Built with:
235
290
 
236
291
  ## 📧 Support
237
292
 
238
- - **Issues**: [GitHub Issues](https://github.com/akotliar/sitemap-qa/issues)-
293
+ - **Issues**: [GitHub Issues](https://github.com/akotliar/sitemap-qa/issues)
239
294
 
240
295
  ---
241
296