@houtini/seo-crawler-mcp 2.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.github/workflows/ci.yml +59 -0
- package/LICENSE +190 -0
- package/NOTICE +8 -0
- package/README.md +694 -0
- package/build/analyzers/QueryLoader.d.ts +30 -0
- package/build/analyzers/QueryLoader.d.ts.map +1 -0
- package/build/analyzers/QueryLoader.js +126 -0
- package/build/analyzers/QueryLoader.js.map +1 -0
- package/build/cli.d.ts +3 -0
- package/build/cli.d.ts.map +1 -0
- package/build/cli.js +190 -0
- package/build/cli.js.map +1 -0
- package/build/core/ContentExtractor.d.ts +30 -0
- package/build/core/ContentExtractor.d.ts.map +1 -0
- package/build/core/ContentExtractor.js +362 -0
- package/build/core/ContentExtractor.js.map +1 -0
- package/build/core/CrawlDatabase.d.ts +25 -0
- package/build/core/CrawlDatabase.d.ts.map +1 -0
- package/build/core/CrawlDatabase.js +603 -0
- package/build/core/CrawlDatabase.js.map +1 -0
- package/build/core/CrawlOrchestrator.d.ts +27 -0
- package/build/core/CrawlOrchestrator.d.ts.map +1 -0
- package/build/core/CrawlOrchestrator.js +279 -0
- package/build/core/CrawlOrchestrator.js.map +1 -0
- package/build/core/CrawlStorage.d.ts +33 -0
- package/build/core/CrawlStorage.d.ts.map +1 -0
- package/build/core/CrawlStorage.js +94 -0
- package/build/core/CrawlStorage.js.map +1 -0
- package/build/core/LinkExtractor.d.ts +14 -0
- package/build/core/LinkExtractor.d.ts.map +1 -0
- package/build/core/LinkExtractor.js +91 -0
- package/build/core/LinkExtractor.js.map +1 -0
- package/build/core/UrlManager.d.ts +21 -0
- package/build/core/UrlManager.d.ts.map +1 -0
- package/build/core/UrlManager.js +87 -0
- package/build/core/UrlManager.js.map +1 -0
- package/build/formatters/structured-report-format.d.ts +48 -0
- package/build/formatters/structured-report-format.d.ts.map +1 -0
- package/build/formatters/structured-report-format.js +145 -0
- package/build/formatters/structured-report-format.js.map +1 -0
- package/build/index.d.ts +3 -0
- package/build/index.d.ts.map +1 -0
- package/build/index.js +214 -0
- package/build/index.js.map +1 -0
- package/build/schema/index.d.ts +627 -0
- package/build/schema/index.d.ts.map +1 -0
- package/build/schema/index.js +159 -0
- package/build/schema/index.js.map +1 -0
- package/build/tools/analyze-seo.d.ts +44 -0
- package/build/tools/analyze-seo.d.ts.map +1 -0
- package/build/tools/analyze-seo.js +110 -0
- package/build/tools/analyze-seo.js.map +1 -0
- package/build/tools/list-queries.d.ts +28 -0
- package/build/tools/list-queries.d.ts.map +1 -0
- package/build/tools/list-queries.js +30 -0
- package/build/tools/list-queries.js.map +1 -0
- package/build/tools/query-seo-data.d.ts +15 -0
- package/build/tools/query-seo-data.d.ts.map +1 -0
- package/build/tools/query-seo-data.js +43 -0
- package/build/tools/query-seo-data.js.map +1 -0
- package/build/tools/run-seo-audit.d.ts +3 -0
- package/build/tools/run-seo-audit.d.ts.map +1 -0
- package/build/tools/run-seo-audit.js +54 -0
- package/build/tools/run-seo-audit.js.map +1 -0
- package/build/types/index.d.ts +158 -0
- package/build/types/index.d.ts.map +1 -0
- package/build/types/index.js +2 -0
- package/build/types/index.js.map +1 -0
- package/build/utils/debug.d.ts +2 -0
- package/build/utils/debug.d.ts.map +1 -0
- package/build/utils/debug.js +7 -0
- package/build/utils/debug.js.map +1 -0
- package/package.json +49 -0
- package/server.json +31 -0
- package/src/analyzers/QueryLoader.ts +175 -0
- package/src/analyzers/queries/README.md +228 -0
- package/src/analyzers/queries/content/duplicate-h1.sql +18 -0
- package/src/analyzers/queries/content/duplicate-meta-descriptions.sql +18 -0
- package/src/analyzers/queries/content/duplicate-titles.sql +19 -0
- package/src/analyzers/queries/content/missing-h1.sql +18 -0
- package/src/analyzers/queries/content/missing-meta-descriptions.sql +19 -0
- package/src/analyzers/queries/content/multiple-h1.sql +17 -0
- package/src/analyzers/queries/content/thin-content.sql +18 -0
- package/src/analyzers/queries/critical/404-errors.sql +14 -0
- package/src/analyzers/queries/critical/broken-internal-links.sql +20 -0
- package/src/analyzers/queries/critical/missing-titles.sql +17 -0
- package/src/analyzers/queries/critical/server-errors.sql +15 -0
- package/src/analyzers/queries/opportunities/high-external-links.sql +18 -0
- package/src/analyzers/queries/opportunities/meta-description-length.sql +27 -0
- package/src/analyzers/queries/opportunities/missing-images.sql +18 -0
- package/src/analyzers/queries/opportunities/no-outbound-links.sql +18 -0
- package/src/analyzers/queries/opportunities/title-equals-h1.sql +21 -0
- package/src/analyzers/queries/opportunities/title-length.sql +27 -0
- package/src/analyzers/queries/security/missing-csp.sql +16 -0
- package/src/analyzers/queries/security/missing-hsts.sql +17 -0
- package/src/analyzers/queries/security/missing-referrer-policy.sql +16 -0
- package/src/analyzers/queries/security/missing-x-frame-options.sql +16 -0
- package/src/analyzers/queries/security/protocol-relative-links.sql +16 -0
- package/src/analyzers/queries/security/unsafe-external-links.sql +17 -0
- package/src/analyzers/queries/technical/canonical-issues.sql +20 -0
- package/src/analyzers/queries/technical/heading-hierarchy-issues.sql +19 -0
- package/src/analyzers/queries/technical/non-https.sql +16 -0
- package/src/analyzers/queries/technical/orphan-pages.sql +21 -0
- package/src/analyzers/queries/technical/redirects.sql +15 -0
- package/src/cli.ts +224 -0
- package/src/core/ContentExtractor.ts +480 -0
- package/src/core/CrawlDatabase.ts +736 -0
- package/src/core/CrawlOrchestrator.ts +346 -0
- package/src/core/CrawlStorage.ts +148 -0
- package/src/core/LinkExtractor.ts +123 -0
- package/src/core/UrlManager.ts +114 -0
- package/src/formatters/structured-report-format.ts +254 -0
- package/src/index.ts +259 -0
- package/src/schema/index.ts +176 -0
- package/src/tools/analyze-seo.ts +184 -0
- package/src/tools/list-queries.ts +70 -0
- package/src/tools/query-seo-data.ts +77 -0
- package/src/tools/run-seo-audit.ts +83 -0
- package/src/types/index.ts +179 -0
- package/src/utils/debug.ts +12 -0
- package/tsconfig.json +26 -0
package/README.md
ADDED
|
@@ -0,0 +1,694 @@
|
|
|
1
|
+
# SEO Crawler MCP - Website Crawler & SEO Analyzer for LLMs
|
|
2
|
+
|
|
3
|
+
[](https://www.npmjs.com/package/@houtini/seo-crawler-mcp)
|
|
4
|
+
[](https://www.npmjs.com/package/@houtini/seo-crawler-mcp)
|
|
5
|
+
[](https://github.com/houtini-ai/seo-crawler-mcp/actions)
|
|
6
|
+
[](https://www.typescriptlang.org/)
|
|
7
|
+
[](https://www.npmjs.com/package/@houtini/seo-crawler-mcp)
|
|
8
|
+
[](https://snyk.io/test/github/houtini-ai/seo-crawler-mcp)
|
|
9
|
+
[](https://registry.modelcontextprotocol.io)
|
|
10
|
+
[](https://opensource.org/licenses/Apache-2.0)
|
|
11
|
+
|
|
12
|
+
**Crawl and analyse your website for errors and issues that probably affect your site's SEO**
|
|
13
|
+
|
|
14
|
+
I wanted to build on my experience working with the MCP protocol SDK to see just how far we can extend an AI assistant's capabilities. I decided that I'd quite like to build a crawler to check my site's "technical SEO" health and came across Crawlee - which seemed like the ideal library to base the crawl component of my MCP.
|
|
15
|
+
|
|
16
|
+
What's interesting is that MCP usually indicates a server connection of some sort. This is not so with Crawlee MCP. The MCP protocol is probably more powerful than I realised - this is a self-contained application wrapped in the MCP SDK that handles everything locally:
|
|
17
|
+
|
|
18
|
+
- Smart request scheduling and queue management
|
|
19
|
+
- Automatic retry logic and error handling
|
|
20
|
+
- Respectful crawling with configurable delays
|
|
21
|
+
- Memory-efficient streaming for large sites
|
|
22
|
+
- Better-SQLite3 embedded database storing every crawled page's HTML, metadata, headers, link relationships, and site structure
|
|
23
|
+
- Custom SQL analysis engine with 25+ specialised queries detecting content issues, technical SEO problems, security vulnerabilities, and optimisation opportunities
|
|
24
|
+
|
|
25
|
+
Claude (or your AI assistant of choice) can orchestrate this entire stack through simple function calls. The crawl runs asynchronously, stores everything in SQLite, and then Claude can query that data through natural language - "analyse this crawl for seo opportunities" or "report on internal broken links" - and the MCP server translates that into sophisticated SQL analysis.
|
|
26
|
+
|
|
27
|
+
**You can also run crawls directly from the terminal** - perfect for large sites or background processing. The CLI mode lets you run a crawl, get the output directory, and then hand that over to Claude for AI-powered analysis via the MCP tools.
|
|
28
|
+
|
|
29
|
+
### Credits
|
|
30
|
+
|
|
31
|
+
The core crawling architecture is inspired by the logic and patterns from the [LibreCrawl](https://github.com/libre-crawl/core) project. We've adapted their proven crawling methodology for use within the MCP protocol whilst adding comprehensive SEO analysis capabilities.
|
|
32
|
+
|
|
33
|
+
---
|
|
34
|
+
|
|
35
|
+
## Installation
|
|
36
|
+
|
|
37
|
+
### For Beginners
|
|
38
|
+
|
|
39
|
+
If you're new to MCP servers, I'd recommend reading these first:
|
|
40
|
+
- [How to Add an MCP Server to Claude Desktop](https://houtini.com/how-to-add-an-mcp-server-to-claude-desktop/)
|
|
41
|
+
- [Claude Desktop Beginner's Guide](https://houtini.com/claude-desktop-beginners-guide/)
|
|
42
|
+
|
|
43
|
+
I'd also suggest installing [Desktop Commander](https://houtini.com/desktop-commander/) first - it's useful for working with the crawl output files. See the [Desktop Commander setup guide](https://github.com/wonderwhy-er/DesktopCommanderMCP) for details.
|
|
44
|
+
|
|
45
|
+
### Quick Install (NPX)
|
|
46
|
+
|
|
47
|
+
Add this to your Claude Desktop config file:
|
|
48
|
+
|
|
49
|
+
**Windows:** `C:\Users\[YourName]\AppData\Roaming\Claude\claude_desktop_config.json`
|
|
50
|
+
**Mac:** `~/Library/Application Support/Claude/claude_desktop_config.json`
|
|
51
|
+
|
|
52
|
+
```json
|
|
53
|
+
{
|
|
54
|
+
"mcpServers": {
|
|
55
|
+
"seo-crawler-mcp": {
|
|
56
|
+
"command": "npx",
|
|
57
|
+
"args": ["-y", "@houtini/seo-crawler-mcp"],
|
|
58
|
+
"env": {
|
|
59
|
+
"OUTPUT_DIR": "C:\\seo-audits"
|
|
60
|
+
}
|
|
61
|
+
}
|
|
62
|
+
}
|
|
63
|
+
}
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
Restart Claude Desktop. Four tools will be available:
|
|
67
|
+
- `seo-crawler-mcp:run_seo_audit`
|
|
68
|
+
- `seo-crawler-mcp:analyze_seo`
|
|
69
|
+
- `seo-crawler-mcp:query_seo_data`
|
|
70
|
+
- `seo-crawler-mcp:list_seo_queries`
|
|
71
|
+
|
|
72
|
+
### Development Install
|
|
73
|
+
|
|
74
|
+
```bash
|
|
75
|
+
cd C:\MCP\seo-crawler-mcp
|
|
76
|
+
npm install
|
|
77
|
+
npm run build
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
Then use the local path in your config:
|
|
81
|
+
|
|
82
|
+
```json
|
|
83
|
+
{
|
|
84
|
+
"mcpServers": {
|
|
85
|
+
"seo-crawler-mcp": {
|
|
86
|
+
"command": "node",
|
|
87
|
+
"args": ["C:\\MCP\\seo-crawler-mcp\\build\\index.js"],
|
|
88
|
+
"env": {
|
|
89
|
+
"OUTPUT_DIR": "C:\\seo-audits",
|
|
90
|
+
"DEBUG": "false"
|
|
91
|
+
}
|
|
92
|
+
}
|
|
93
|
+
}
|
|
94
|
+
}
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
**Environment Variables:**
|
|
98
|
+
- `OUTPUT_DIR`: Directory where crawl results are saved (required)
|
|
99
|
+
- `DEBUG`: Set to `"true"` to enable verbose debug logging (optional, default: `"false"`)
|
|
100
|
+
|
|
101
|
+
**CLI Usage for Local Development:**
|
|
102
|
+
|
|
103
|
+
When running the CLI from a local build (not installed via npm), use `node` directly:
|
|
104
|
+
|
|
105
|
+
```bash
|
|
106
|
+
# Run crawl
|
|
107
|
+
node C:\MCP\seo-crawler-mcp\build\cli.js crawl https://example.com --max-pages=20
|
|
108
|
+
|
|
109
|
+
# Analyze results
|
|
110
|
+
node C:\MCP\seo-crawler-mcp\build\cli.js analyze C:\seo-audits\example.com_2026-02-02_abc123
|
|
111
|
+
|
|
112
|
+
# List queries
|
|
113
|
+
node C:\MCP\seo-crawler-mcp\build\cli.js queries --category=critical
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
---
|
|
117
|
+
|
|
118
|
+
## CLI Mode (Terminal Usage)
|
|
119
|
+
|
|
120
|
+
For large crawls or background processing, you can run crawls directly from the terminal.
|
|
121
|
+
|
|
122
|
+
**Note:** These examples use `npx` for globally installed packages. For local development, see the "Development Install" section above.
|
|
123
|
+
}
|
|
124
|
+
}
|
|
125
|
+
}
|
|
126
|
+
}
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
---
|
|
130
|
+
|
|
131
|
+
## CLI Mode (Terminal Usage)
|
|
132
|
+
|
|
133
|
+
For large crawls or background processing, you can run crawls directly from the terminal:
|
|
134
|
+
|
|
135
|
+
### Run a Crawl
|
|
136
|
+
|
|
137
|
+
```bash
|
|
138
|
+
# Basic crawl
|
|
139
|
+
npx @houtini/seo-crawler-mcp crawl https://example.com
|
|
140
|
+
|
|
141
|
+
# Large crawl with custom settings
|
|
142
|
+
npx @houtini/seo-crawler-mcp crawl https://example.com --max-pages=5000 --depth=5
|
|
143
|
+
|
|
144
|
+
# Using Googlebot user agent
|
|
145
|
+
npx @houtini/seo-crawler-mcp crawl https://example.com --user-agent=googlebot
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
### Quick Analysis
|
|
149
|
+
|
|
150
|
+
```bash
|
|
151
|
+
# Show summary statistics
|
|
152
|
+
npx @houtini/seo-crawler-mcp analyze C:/seo-audits/example.com_2026-02-01_abc123
|
|
153
|
+
|
|
154
|
+
# Detailed JSON output
|
|
155
|
+
npx @houtini/seo-crawler-mcp analyze C:/seo-audits/example.com_2026-02-01_abc123 --format=detailed
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
### List Available Queries
|
|
159
|
+
|
|
160
|
+
```bash
|
|
161
|
+
# All queries
|
|
162
|
+
npx @houtini/seo-crawler-mcp queries
|
|
163
|
+
|
|
164
|
+
# Security queries only
|
|
165
|
+
npx @houtini/seo-crawler-mcp queries --category=security
|
|
166
|
+
|
|
167
|
+
# Critical priority queries
|
|
168
|
+
npx @houtini/seo-crawler-mcp queries --priority=CRITICAL
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
### Workflow: Terminal + Claude
|
|
172
|
+
|
|
173
|
+
1. **Run large crawl from terminal** (runs in background, can close terminal)
|
|
174
|
+
```bash
|
|
175
|
+
npx @houtini/seo-crawler-mcp crawl https://bigsite.com --max-pages=5000
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
2. **Get the output path** from the crawl results
|
|
179
|
+
```
|
|
180
|
+
Output Path: C:\seo-audits\bigsite.com_2026-02-02T10-15-30_abc123
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
3. **In Claude Desktop, analyze with AI**
|
|
184
|
+
```
|
|
185
|
+
Analyze the crawl at C:\seo-audits\bigsite.com_2026-02-02T10-15-30_abc123
|
|
186
|
+
Show me the critical issues
|
|
187
|
+
What are the biggest SEO problems?
|
|
188
|
+
Give me a detailed report on broken internal links
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
This workflow is perfect for:
|
|
192
|
+
- Large sites (1000+ pages) where you want the crawl to run overnight
|
|
193
|
+
- Multiple sites you want to crawl in batch
|
|
194
|
+
- Automated crawling via cron jobs or scheduled tasks
|
|
195
|
+
- Keeping terminal-based workflow whilst using Claude for intelligent analysis
|
|
196
|
+
|
|
197
|
+
---
|
|
198
|
+
|
|
199
|
+
## How to Use This
|
|
200
|
+
|
|
201
|
+
### Complete SEO Audit
|
|
202
|
+
|
|
203
|
+
The typical workflow goes like this:
|
|
204
|
+
|
|
205
|
+
1. **Crawl the website**
|
|
206
|
+
```
|
|
207
|
+
Use seo-crawler-mcp to crawl https://example.com with maxPages=2000
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
2. **Run the analysis**
|
|
211
|
+
```
|
|
212
|
+
Analyse the crawl at C:/seo-audits/example.com_2026-02-01_abc123
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
3. **Investigate specific issues**
|
|
216
|
+
```
|
|
217
|
+
Show me the broken internal links from that crawl
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
Claude handles the rest - calling the right tools, parsing the results, and presenting everything in readable format.
|
|
221
|
+
|
|
222
|
+
### Security Audit
|
|
223
|
+
|
|
224
|
+
If you're specifically worried about security headers:
|
|
225
|
+
|
|
226
|
+
1. **List available security queries**
|
|
227
|
+
```
|
|
228
|
+
What security checks can you run on an SEO crawl?
|
|
229
|
+
```
|
|
230
|
+
|
|
231
|
+
2. **Run security-focused analysis**
|
|
232
|
+
```
|
|
233
|
+
Check the security issues in crawl C:/seo-audits/example.com_2026-02-01_abc123
|
|
234
|
+
```
|
|
235
|
+
|
|
236
|
+
3. **Deep dive on specific problems**
|
|
237
|
+
```
|
|
238
|
+
Show me all pages with unsafe external links
|
|
239
|
+
```
|
|
240
|
+
|
|
241
|
+
---
|
|
242
|
+
|
|
243
|
+
## What Gets Detected
|
|
244
|
+
|
|
245
|
+
The analysis engine includes 25 comprehensive SEO checks across five categories:
|
|
246
|
+
|
|
247
|
+
### Critical Issues (4 checks)
|
|
248
|
+
- Missing title tags - pages without titles don't rank
|
|
249
|
+
- Broken internal links - 404/5xx responses that hurt crawlability
|
|
250
|
+
- Server errors - 5xx responses indicating site problems
|
|
251
|
+
- 404 errors - broken pages that need fixing or redirecting
|
|
252
|
+
|
|
253
|
+
### Content Quality (7 checks)
|
|
254
|
+
- Duplicate titles across different pages
|
|
255
|
+
- Duplicate meta descriptions
|
|
256
|
+
- Duplicate H1 tags
|
|
257
|
+
- Missing meta descriptions
|
|
258
|
+
- Missing H1 tags
|
|
259
|
+
- Multiple H1 tags on single pages
|
|
260
|
+
- Thin content - pages under 300 words
|
|
261
|
+
|
|
262
|
+
### Technical SEO (5 checks)
|
|
263
|
+
- Redirect chains and loops
|
|
264
|
+
- Orphan pages with no internal links
|
|
265
|
+
- Canonical URL mismatches
|
|
266
|
+
- Non-HTTPS pages still in use
|
|
267
|
+
- Heading hierarchy problems (H3 before H2, etc)
|
|
268
|
+
|
|
269
|
+
### Security (6 checks)
|
|
270
|
+
- Missing Content-Security-Policy headers
|
|
271
|
+
- Missing HSTS (Strict-Transport-Security)
|
|
272
|
+
- Missing X-Frame-Options (clickjacking protection)
|
|
273
|
+
- Missing Referrer-Policy
|
|
274
|
+
- Unsafe external links (target="_blank" without rel="noopener")
|
|
275
|
+
- Protocol-relative links (//example.com)
|
|
276
|
+
|
|
277
|
+
### Optimisation (6 checks)
|
|
278
|
+
- Title tags too long or too short
|
|
279
|
+
- Meta descriptions length issues
|
|
280
|
+
- Title matches H1 (opportunity for differentiation)
|
|
281
|
+
- Pages with no outbound links
|
|
282
|
+
- Pages with excessive external links
|
|
283
|
+
- Pages missing images
|
|
284
|
+
|
|
285
|
+
---
|
|
286
|
+
|
|
287
|
+
## Data Storage
|
|
288
|
+
|
|
289
|
+
The crawler stores everything in SQLite databases organised by domain and date:
|
|
290
|
+
|
|
291
|
+
```
|
|
292
|
+
C:/seo-audits/example.com_2026-02-01_abc123/
|
|
293
|
+
├── crawl-data.db # SQLite database
|
|
294
|
+
│ ├── pages # Every page crawled
|
|
295
|
+
│ ├── links # All link relationships
|
|
296
|
+
│ ├── errors # Crawl errors
|
|
297
|
+
│ └── crawl_metadata # Statistics
|
|
298
|
+
├── config.json # Crawl settings
|
|
299
|
+
└── crawl-export.csv # Optional CSV export
|
|
300
|
+
```
|
|
301
|
+
|
|
302
|
+
---
|
|
303
|
+
|
|
304
|
+
## Performance
|
|
305
|
+
|
|
306
|
+
Typical crawl performance metrics:
|
|
307
|
+
|
|
308
|
+
**Crawl Speed:**
|
|
309
|
+
- Medium site (1,500-2,000 pages): ~15 minutes
|
|
310
|
+
- 300,000+ link relationships tracked
|
|
311
|
+
- Database size: ~15MB for 2,000 pages
|
|
312
|
+
|
|
313
|
+
**Query Performance:**
|
|
314
|
+
- Simple queries: under 10ms
|
|
315
|
+
- Complex queries: under 100ms
|
|
316
|
+
- Join queries: under 200ms
|
|
317
|
+
- Full analysis: under 600ms
|
|
318
|
+
|
|
319
|
+
The SQLite approach works well here. Everything stays local, no API rate limits to worry about, and the query performance is more than adequate for SEO analysis.
|
|
320
|
+
|
|
321
|
+
---
|
|
322
|
+
|
|
323
|
+
## Limitations
|
|
324
|
+
|
|
325
|
+
There are 4 additional checks planned for v3.0:
|
|
326
|
+
|
|
327
|
+
- **Core Web Vitals** - requires Playwright for real browser metrics
|
|
328
|
+
- **Robots.txt validation** - needs parser library
|
|
329
|
+
- **Readability scoring** - requires text analysis library
|
|
330
|
+
- **Mobile rendering issues** - needs device emulation
|
|
331
|
+
|
|
332
|
+
The current 25 checks cover the most critical aspects of technical SEO that directly impact search engine crawling, indexing, and ranking.
|
|
333
|
+
|
|
334
|
+
---
|
|
335
|
+
|
|
336
|
+
## Technical Details
|
|
337
|
+
|
|
338
|
+
**Built with:**
|
|
339
|
+
- TypeScript 5.3
|
|
340
|
+
- Crawlee 3.7 (HttpCrawler)
|
|
341
|
+
- better-sqlite3 12.6
|
|
342
|
+
- Cheerio 1.0 (HTML parsing)
|
|
343
|
+
- MCP SDK 1.0
|
|
344
|
+
|
|
345
|
+
The code uses ES modules throughout, with proper Zod validation on inputs and comprehensive error handling. I've kept the architecture clean - separate modules for crawling, analysis, formatting, and tool definitions.
|
|
346
|
+
|
|
347
|
+
**Deployment:**
|
|
348
|
+
- Local MCP server via Node.js
|
|
349
|
+
- No external dependencies
|
|
350
|
+
- Configurable output directory
|
|
351
|
+
- Concurrent crawling (5 workers)
|
|
352
|
+
|
|
353
|
+
---
|
|
354
|
+
|
|
355
|
+
## MCP Tools Reference
|
|
356
|
+
|
|
357
|
+
### run_seo_audit
|
|
358
|
+
|
|
359
|
+
Crawl a website and extract comprehensive SEO data into SQLite.
|
|
360
|
+
|
|
361
|
+
**Parameters:**
|
|
362
|
+
- `url` (required) - Website URL to crawl
|
|
363
|
+
- `maxPages` (optional) - Maximum pages to crawl (default: 1000)
|
|
364
|
+
- `depth` (optional) - Maximum crawl depth (default: 3)
|
|
365
|
+
- `userAgent` (optional) - "chrome" or "googlebot" (default: "chrome")
|
|
366
|
+
|
|
367
|
+
**Example:**
|
|
368
|
+
```typescript
|
|
369
|
+
run_seo_audit({
|
|
370
|
+
url: "https://example.com",
|
|
371
|
+
maxPages: 2000,
|
|
372
|
+
depth: 5,
|
|
373
|
+
userAgent: "chrome"
|
|
374
|
+
})
|
|
375
|
+
```
|
|
376
|
+
|
|
377
|
+
**Returns:** Crawl ID and output path
|
|
378
|
+
|
|
379
|
+
---
|
|
380
|
+
|
|
381
|
+
### analyze_seo
|
|
382
|
+
|
|
383
|
+
Run comprehensive SEO analysis on a completed crawl.
|
|
384
|
+
|
|
385
|
+
**Parameters:**
|
|
386
|
+
- `crawlPath` (required) - Path to crawl output directory
|
|
387
|
+
- `format` (optional) - "structured", "summary", or "detailed" (default: "structured")
|
|
388
|
+
- `includeCategories` (optional) - Filter by categories: "critical", "content", "technical", "security", "opportunities"
|
|
389
|
+
- `maxExamplesPerIssue` (optional) - Maximum example URLs per issue (default: 10)
|
|
390
|
+
|
|
391
|
+
**Example:**
|
|
392
|
+
```typescript
|
|
393
|
+
analyze_seo({
|
|
394
|
+
crawlPath: "C:/seo-audits/example.com_2026-02-01_abc123",
|
|
395
|
+
format: "structured",
|
|
396
|
+
includeCategories: ["critical", "security"],
|
|
397
|
+
maxExamplesPerIssue: 5
|
|
398
|
+
})
|
|
399
|
+
```
|
|
400
|
+
|
|
401
|
+
**Returns:** Structured report with issues, affected URLs, and fix recommendations
|
|
402
|
+
|
|
403
|
+
---
|
|
404
|
+
|
|
405
|
+
### query_seo_data
|
|
406
|
+
|
|
407
|
+
Execute specific SEO queries by name.
|
|
408
|
+
|
|
409
|
+
**Parameters:**
|
|
410
|
+
- `crawlPath` (required) - Path to crawl output directory
|
|
411
|
+
- `query` (required) - Query name (see list_seo_queries)
|
|
412
|
+
- `limit` (optional) - Maximum results (default: 100)
|
|
413
|
+
|
|
414
|
+
**Example:**
|
|
415
|
+
```typescript
|
|
416
|
+
query_seo_data({
|
|
417
|
+
crawlPath: "C:/seo-audits/example.com_2026-02-01_abc123",
|
|
418
|
+
query: "broken-internal-links",
|
|
419
|
+
limit: 50
|
|
420
|
+
})
|
|
421
|
+
```
|
|
422
|
+
|
|
423
|
+
**Returns:** Query results with affected URLs and context
|
|
424
|
+
|
|
425
|
+
---
|
|
426
|
+
|
|
427
|
+
### list_seo_queries
|
|
428
|
+
|
|
429
|
+
Discover available SEO analysis queries.
|
|
430
|
+
|
|
431
|
+
**Parameters:**
|
|
432
|
+
- `category` (optional) - Filter by category
|
|
433
|
+
- `priority` (optional) - Filter by priority level
|
|
434
|
+
|
|
435
|
+
**Example:**
|
|
436
|
+
```typescript
|
|
437
|
+
list_seo_queries({
|
|
438
|
+
category: "security",
|
|
439
|
+
priority: "HIGH"
|
|
440
|
+
})
|
|
441
|
+
```
|
|
442
|
+
|
|
443
|
+
**Returns:** List of available queries with descriptions and priorities
|
|
444
|
+
|
|
445
|
+
---
|
|
446
|
+
|
|
447
|
+
## Available Queries
|
|
448
|
+
|
|
449
|
+
The analysis engine includes 28 predefined SQL queries organised by category. Each query includes detailed impact analysis and fix recommendations.
|
|
450
|
+
|
|
451
|
+
### Critical Issues (4 queries)
|
|
452
|
+
|
|
453
|
+
**missing-titles**
|
|
454
|
+
- **What it finds:** Pages without title tags
|
|
455
|
+
- **Why it matters:** Title tags are the most important on-page SEO element. Without them, pages are essentially invisible to search engines.
|
|
456
|
+
- **Fix:** Add unique, descriptive title tags (50-60 characters) to all pages immediately.
|
|
457
|
+
|
|
458
|
+
**broken-internal-links**
|
|
459
|
+
- **What it finds:** Internal links pointing to 404/5xx error pages
|
|
460
|
+
- **Why it matters:** Broken links hurt crawlability and waste crawl budget. They create dead ends for users and search engines.
|
|
461
|
+
- **Fix:** Update or remove broken links. Add redirects for moved pages.
|
|
462
|
+
|
|
463
|
+
**server-errors**
|
|
464
|
+
- **What it finds:** Pages returning 5xx status codes
|
|
465
|
+
- **Why it matters:** Indicates server problems that prevent search engines from indexing content.
|
|
466
|
+
- **Fix:** Investigate server issues, check error logs, ensure adequate resources.
|
|
467
|
+
|
|
468
|
+
**not-found-errors**
|
|
469
|
+
- **What it finds:** Pages returning 404 status codes
|
|
470
|
+
- **Why it matters:** Lost indexing opportunities and poor user experience.
|
|
471
|
+
- **Fix:** Add 301 redirects or remove links to non-existent pages.
|
|
472
|
+
|
|
473
|
+
### Content Quality (7 queries)
|
|
474
|
+
|
|
475
|
+
**duplicate-titles**
|
|
476
|
+
- **What it finds:** Multiple pages sharing identical title tags
|
|
477
|
+
- **Why it matters:** Confuses search engines about which page to rank for queries.
|
|
478
|
+
- **Fix:** Make each page's title tag unique and descriptive of its specific content.
|
|
479
|
+
|
|
480
|
+
**duplicate-descriptions**
|
|
481
|
+
- **What it finds:** Multiple pages with identical meta descriptions
|
|
482
|
+
- **Why it matters:** Reduces click-through rates as snippets look identical in search results.
|
|
483
|
+
- **Fix:** Write unique meta descriptions (150-160 characters) for each page.
|
|
484
|
+
|
|
485
|
+
**duplicate-h1s**
|
|
486
|
+
- **What it finds:** Multiple pages sharing the same H1 heading
|
|
487
|
+
- **Why it matters:** H1 tags signal page topic - duplicates dilute topical clarity.
|
|
488
|
+
- **Fix:** Ensure each page has a unique H1 that accurately describes its content.
|
|
489
|
+
|
|
490
|
+
**missing-descriptions**
|
|
491
|
+
- **What it finds:** Pages without meta description tags
|
|
492
|
+
- **Why it matters:** Search engines create their own snippets, often poorly representing content.
|
|
493
|
+
- **Fix:** Add compelling meta descriptions (150-160 characters) for all important pages.
|
|
494
|
+
|
|
495
|
+
**missing-h1s**
|
|
496
|
+
- **What it finds:** Pages without H1 headings
|
|
497
|
+
- **Why it matters:** H1 is a primary signal of page topic and structure.
|
|
498
|
+
- **Fix:** Add descriptive H1 tags to all content pages.
|
|
499
|
+
|
|
500
|
+
**multiple-h1s**
|
|
501
|
+
- **What it finds:** Pages with more than one H1 tag
|
|
502
|
+
- **Why it matters:** Dilutes topical focus and confuses heading hierarchy.
|
|
503
|
+
- **Fix:** Use only one H1 per page. Convert other H1s to H2 or H3.
|
|
504
|
+
|
|
505
|
+
**thin-content**
|
|
506
|
+
- **What it finds:** Pages with less than 300 words of content
|
|
507
|
+
- **Why it matters:** Thin content provides little value and ranks poorly.
|
|
508
|
+
- **Fix:** Expand content with valuable information or consolidate into existing pages.
|
|
509
|
+
|
|
510
|
+
### Technical SEO (5 queries)
|
|
511
|
+
|
|
512
|
+
**redirect-pages**
|
|
513
|
+
- **What it finds:** Pages that redirect to other URLs
|
|
514
|
+
- **Why it matters:** Multiple redirects waste crawl budget and slow page loads.
|
|
515
|
+
- **Fix:** Update internal links to point directly to final destination.
|
|
516
|
+
|
|
517
|
+
**redirect-chains**
|
|
518
|
+
- **What it finds:** URLs that redirect multiple times before reaching destination
|
|
519
|
+
- **Why it matters:** Each redirect adds latency and risks breaking the chain.
|
|
520
|
+
- **Fix:** Implement direct redirects from source to final destination.
|
|
521
|
+
|
|
522
|
+
**orphan-pages**
|
|
523
|
+
- **What it finds:** Pages with no internal links pointing to them
|
|
524
|
+
- **Why it matters:** Search engines may never discover orphan pages.
|
|
525
|
+
- **Fix:** Add internal links from relevant pages to connect orphans to site structure.
|
|
526
|
+
|
|
527
|
+
**canonical-issues**
|
|
528
|
+
- **What it finds:** Pages where canonical URL doesn't match actual URL
|
|
529
|
+
- **Why it matters:** Signals duplicate content or indexing preference conflicts.
|
|
530
|
+
- **Fix:** Ensure canonical tags point to the correct version of each page.
|
|
531
|
+
|
|
532
|
+
**non-https-pages**
|
|
533
|
+
- **What it finds:** Pages still using HTTP instead of HTTPS
|
|
534
|
+
- **Why it matters:** Security risk, ranking penalty, and browser warnings.
|
|
535
|
+
- **Fix:** Implement HTTPS across entire site with proper redirects.
|
|
536
|
+
|
|
537
|
+
### Security (6 queries)
|
|
538
|
+
|
|
539
|
+
**missing-csp**
|
|
540
|
+
- **What it finds:** Pages without Content-Security-Policy headers
|
|
541
|
+
- **Why it matters:** Vulnerability to XSS attacks and code injection.
|
|
542
|
+
- **Fix:** Implement CSP headers to control resource loading.
|
|
543
|
+
|
|
544
|
+
**missing-hsts**
|
|
545
|
+
- **What it finds:** Pages without Strict-Transport-Security headers
|
|
546
|
+
- **Why it matters:** Allows protocol downgrade attacks.
|
|
547
|
+
- **Fix:** Add HSTS headers to enforce HTTPS connections.
|
|
548
|
+
|
|
549
|
+
**missing-x-frame-options**
|
|
550
|
+
- **What it finds:** Pages without X-Frame-Options headers
|
|
551
|
+
- **Why it matters:** Vulnerability to clickjacking attacks.
|
|
552
|
+
- **Fix:** Add X-Frame-Options headers (DENY or SAMEORIGIN).
|
|
553
|
+
|
|
554
|
+
**missing-referrer-policy**
|
|
555
|
+
- **What it finds:** Pages without Referrer-Policy headers
|
|
556
|
+
- **Why it matters:** Potential privacy and security leakage.
|
|
557
|
+
- **Fix:** Implement appropriate referrer policy for your use case.
|
|
558
|
+
|
|
559
|
+
**unsafe-external-links**
|
|
560
|
+
- **What it finds:** Links with target="_blank" but without rel="noopener"
|
|
561
|
+
- **Why it matters:** Security vulnerability allowing opened page to control opener window.
|
|
562
|
+
- **Fix:** Add rel="noopener noreferrer" to all target="_blank" links.
|
|
563
|
+
|
|
564
|
+
**protocol-relative-links**
|
|
565
|
+
- **What it finds:** Links using // instead of https://
|
|
566
|
+
- **Why it matters:** Can cause mixed content issues and security warnings.
|
|
567
|
+
- **Fix:** Use absolute HTTPS URLs for all external resources.
|
|
568
|
+
|
|
569
|
+
### Optimisation Opportunities (6 queries)
|
|
570
|
+
|
|
571
|
+
**title-length-issues**
|
|
572
|
+
- **What it finds:** Title tags shorter than 30 characters or longer than 60
|
|
573
|
+
- **Why it matters:** Too short titles waste opportunity; too long get truncated in search results.
|
|
574
|
+
- **Fix:** Aim for 50-60 characters for optimal display in search results.
|
|
575
|
+
|
|
576
|
+
**description-length-issues**
|
|
577
|
+
- **What it finds:** Meta descriptions shorter than 120 or longer than 160 characters
|
|
578
|
+
- **Why it matters:** Poor descriptions reduce click-through rates.
|
|
579
|
+
- **Fix:** Write descriptions between 150-160 characters for full display.
|
|
580
|
+
|
|
581
|
+
**title-equals-h1**
|
|
582
|
+
- **What it finds:** Pages where title tag matches H1 exactly
|
|
583
|
+
- **Why it matters:** Missed opportunity to target different keywords or angles.
|
|
584
|
+
- **Fix:** Make title and H1 complementary but not identical for broader keyword coverage.
|
|
585
|
+
|
|
586
|
+
**no-outbound-links**
|
|
587
|
+
- **What it finds:** Pages with zero external links
|
|
588
|
+
- **Why it matters:** Can appear spammy or siloed; linking to quality sources builds trust.
|
|
589
|
+
- **Fix:** Add relevant external links to authoritative sources where appropriate.
|
|
590
|
+
|
|
591
|
+
**high-external-links**
|
|
592
|
+
- **What it finds:** Pages with excessive external links (20+)
|
|
593
|
+
- **Why it matters:** Can appear spammy and leaks PageRank unnecessarily.
|
|
594
|
+
- **Fix:** Reduce external links to most relevant and valuable resources.
|
|
595
|
+
|
|
596
|
+
**missing-images**
|
|
597
|
+
- **What it finds:** Pages without any images
|
|
598
|
+
- **Why it matters:** Images improve engagement and provide additional ranking signals.
|
|
599
|
+
- **Fix:** Add relevant, optimized images with proper alt text.
|
|
600
|
+
|
|
601
|
+
---
|
|
602
|
+
|
|
603
|
+
### Using Queries
|
|
604
|
+
|
|
605
|
+
**In Claude Desktop:**
|
|
606
|
+
```
|
|
607
|
+
List all available queries
|
|
608
|
+
Show me the critical queries only
|
|
609
|
+
Run the missing-titles query on my crawl
|
|
610
|
+
What does the orphan-pages query check for?
|
|
611
|
+
```
|
|
612
|
+
|
|
613
|
+
**In CLI:**
|
|
614
|
+
```bash
|
|
615
|
+
# List all queries
|
|
616
|
+
seo-crawler-mcp queries
|
|
617
|
+
|
|
618
|
+
# Filter by category
|
|
619
|
+
seo-crawler-mcp queries --category=security
|
|
620
|
+
|
|
621
|
+
# Filter by priority
|
|
622
|
+
seo-crawler-mcp queries --priority=CRITICAL
|
|
623
|
+
```
|
|
624
|
+
|
|
625
|
+
Each query returns:
|
|
626
|
+
- Affected URLs
|
|
627
|
+
- Relevant context (word count, status codes, etc.)
|
|
628
|
+
- Count of affected pages
|
|
629
|
+
- Organized by severity
|
|
630
|
+
|
|
631
|
+
---
|
|
632
|
+
|
|
633
|
+
## Development
|
|
634
|
+
|
|
635
|
+
```bash
|
|
636
|
+
# Build
|
|
637
|
+
npm run build
|
|
638
|
+
|
|
639
|
+
# Development mode
|
|
640
|
+
npm run dev
|
|
641
|
+
|
|
642
|
+
# Run tests
|
|
643
|
+
npm test
|
|
644
|
+
```
|
|
645
|
+
|
|
646
|
+
---
|
|
647
|
+
|
|
648
|
+
## Version History
|
|
649
|
+
|
|
650
|
+
### v2.0.1 (2026-02-02)
|
|
651
|
+
- Fixed MemoryStorage cleanup bug (added explicit purge in finally block)
|
|
652
|
+
- Added CLI mode for terminal-based crawling
|
|
653
|
+
- Removed proprietary tool references from documentation
|
|
654
|
+
- Ensures guaranteed fresh state between consecutive crawls
|
|
655
|
+
|
|
656
|
+
### v2.0.0 (2026-02-01)
|
|
657
|
+
- Added comprehensive SQL-based analysis engine
|
|
658
|
+
- 28 SEO queries covering industry-standard audit requirements
|
|
659
|
+
- Three analysis tools: analyze_seo, query_seo_data, list_seo_queries
|
|
660
|
+
- 86% coverage of standard SEO audit requirements
|
|
661
|
+
|
|
662
|
+
### v1.1.0 (2026-02-01)
|
|
663
|
+
- Enhanced data collection with security headers
|
|
664
|
+
- Heading structure validation (H1-H6)
|
|
665
|
+
- Link security analysis
|
|
666
|
+
- Response time accuracy improvements
|
|
667
|
+
|
|
668
|
+
### v1.0.0 (2026-01-31)
|
|
669
|
+
- Initial release with SQLite storage
|
|
670
|
+
- LibreCrawl pattern implementation
|
|
671
|
+
- Basic crawl tool (run_seo_audit)
|
|
672
|
+
|
|
673
|
+
---
|
|
674
|
+
|
|
675
|
+
## Licence
|
|
676
|
+
|
|
677
|
+
Apache License 2.0
|
|
678
|
+
|
|
679
|
+
Copyright 2026 Richard Baxter
|
|
680
|
+
|
|
681
|
+
This product includes software developed by Apify and the Crawlee project.
|
|
682
|
+
See NOTICE file for details.
|
|
683
|
+
|
|
684
|
+
---
|
|
685
|
+
|
|
686
|
+
## Support
|
|
687
|
+
|
|
688
|
+
**GitHub:** https://github.com/houtini-ai/seo-crawler-mcp
|
|
689
|
+
**Issues:** https://github.com/houtini-ai/seo-crawler-mcp/issues
|
|
690
|
+
**Author:** Richard Baxter <hello@houtini.com>
|
|
691
|
+
|
|
692
|
+
---
|
|
693
|
+
|
|
694
|
+
**Tags:** seo, crawler, audit, technical-seo, mcp, crawlee, sqlite, web-scraping, site-analysis
|