@houtini/seo-crawler-mcp 2.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (121) hide show
  1. package/.github/workflows/ci.yml +59 -0
  2. package/LICENSE +190 -0
  3. package/NOTICE +8 -0
  4. package/README.md +694 -0
  5. package/build/analyzers/QueryLoader.d.ts +30 -0
  6. package/build/analyzers/QueryLoader.d.ts.map +1 -0
  7. package/build/analyzers/QueryLoader.js +126 -0
  8. package/build/analyzers/QueryLoader.js.map +1 -0
  9. package/build/cli.d.ts +3 -0
  10. package/build/cli.d.ts.map +1 -0
  11. package/build/cli.js +190 -0
  12. package/build/cli.js.map +1 -0
  13. package/build/core/ContentExtractor.d.ts +30 -0
  14. package/build/core/ContentExtractor.d.ts.map +1 -0
  15. package/build/core/ContentExtractor.js +362 -0
  16. package/build/core/ContentExtractor.js.map +1 -0
  17. package/build/core/CrawlDatabase.d.ts +25 -0
  18. package/build/core/CrawlDatabase.d.ts.map +1 -0
  19. package/build/core/CrawlDatabase.js +603 -0
  20. package/build/core/CrawlDatabase.js.map +1 -0
  21. package/build/core/CrawlOrchestrator.d.ts +27 -0
  22. package/build/core/CrawlOrchestrator.d.ts.map +1 -0
  23. package/build/core/CrawlOrchestrator.js +279 -0
  24. package/build/core/CrawlOrchestrator.js.map +1 -0
  25. package/build/core/CrawlStorage.d.ts +33 -0
  26. package/build/core/CrawlStorage.d.ts.map +1 -0
  27. package/build/core/CrawlStorage.js +94 -0
  28. package/build/core/CrawlStorage.js.map +1 -0
  29. package/build/core/LinkExtractor.d.ts +14 -0
  30. package/build/core/LinkExtractor.d.ts.map +1 -0
  31. package/build/core/LinkExtractor.js +91 -0
  32. package/build/core/LinkExtractor.js.map +1 -0
  33. package/build/core/UrlManager.d.ts +21 -0
  34. package/build/core/UrlManager.d.ts.map +1 -0
  35. package/build/core/UrlManager.js +87 -0
  36. package/build/core/UrlManager.js.map +1 -0
  37. package/build/formatters/structured-report-format.d.ts +48 -0
  38. package/build/formatters/structured-report-format.d.ts.map +1 -0
  39. package/build/formatters/structured-report-format.js +145 -0
  40. package/build/formatters/structured-report-format.js.map +1 -0
  41. package/build/index.d.ts +3 -0
  42. package/build/index.d.ts.map +1 -0
  43. package/build/index.js +214 -0
  44. package/build/index.js.map +1 -0
  45. package/build/schema/index.d.ts +627 -0
  46. package/build/schema/index.d.ts.map +1 -0
  47. package/build/schema/index.js +159 -0
  48. package/build/schema/index.js.map +1 -0
  49. package/build/tools/analyze-seo.d.ts +44 -0
  50. package/build/tools/analyze-seo.d.ts.map +1 -0
  51. package/build/tools/analyze-seo.js +110 -0
  52. package/build/tools/analyze-seo.js.map +1 -0
  53. package/build/tools/list-queries.d.ts +28 -0
  54. package/build/tools/list-queries.d.ts.map +1 -0
  55. package/build/tools/list-queries.js +30 -0
  56. package/build/tools/list-queries.js.map +1 -0
  57. package/build/tools/query-seo-data.d.ts +15 -0
  58. package/build/tools/query-seo-data.d.ts.map +1 -0
  59. package/build/tools/query-seo-data.js +43 -0
  60. package/build/tools/query-seo-data.js.map +1 -0
  61. package/build/tools/run-seo-audit.d.ts +3 -0
  62. package/build/tools/run-seo-audit.d.ts.map +1 -0
  63. package/build/tools/run-seo-audit.js +54 -0
  64. package/build/tools/run-seo-audit.js.map +1 -0
  65. package/build/types/index.d.ts +158 -0
  66. package/build/types/index.d.ts.map +1 -0
  67. package/build/types/index.js +2 -0
  68. package/build/types/index.js.map +1 -0
  69. package/build/utils/debug.d.ts +2 -0
  70. package/build/utils/debug.d.ts.map +1 -0
  71. package/build/utils/debug.js +7 -0
  72. package/build/utils/debug.js.map +1 -0
  73. package/package.json +49 -0
  74. package/server.json +31 -0
  75. package/src/analyzers/QueryLoader.ts +175 -0
  76. package/src/analyzers/queries/README.md +228 -0
  77. package/src/analyzers/queries/content/duplicate-h1.sql +18 -0
  78. package/src/analyzers/queries/content/duplicate-meta-descriptions.sql +18 -0
  79. package/src/analyzers/queries/content/duplicate-titles.sql +19 -0
  80. package/src/analyzers/queries/content/missing-h1.sql +18 -0
  81. package/src/analyzers/queries/content/missing-meta-descriptions.sql +19 -0
  82. package/src/analyzers/queries/content/multiple-h1.sql +17 -0
  83. package/src/analyzers/queries/content/thin-content.sql +18 -0
  84. package/src/analyzers/queries/critical/404-errors.sql +14 -0
  85. package/src/analyzers/queries/critical/broken-internal-links.sql +20 -0
  86. package/src/analyzers/queries/critical/missing-titles.sql +17 -0
  87. package/src/analyzers/queries/critical/server-errors.sql +15 -0
  88. package/src/analyzers/queries/opportunities/high-external-links.sql +18 -0
  89. package/src/analyzers/queries/opportunities/meta-description-length.sql +27 -0
  90. package/src/analyzers/queries/opportunities/missing-images.sql +18 -0
  91. package/src/analyzers/queries/opportunities/no-outbound-links.sql +18 -0
  92. package/src/analyzers/queries/opportunities/title-equals-h1.sql +21 -0
  93. package/src/analyzers/queries/opportunities/title-length.sql +27 -0
  94. package/src/analyzers/queries/security/missing-csp.sql +16 -0
  95. package/src/analyzers/queries/security/missing-hsts.sql +17 -0
  96. package/src/analyzers/queries/security/missing-referrer-policy.sql +16 -0
  97. package/src/analyzers/queries/security/missing-x-frame-options.sql +16 -0
  98. package/src/analyzers/queries/security/protocol-relative-links.sql +16 -0
  99. package/src/analyzers/queries/security/unsafe-external-links.sql +17 -0
  100. package/src/analyzers/queries/technical/canonical-issues.sql +20 -0
  101. package/src/analyzers/queries/technical/heading-hierarchy-issues.sql +19 -0
  102. package/src/analyzers/queries/technical/non-https.sql +16 -0
  103. package/src/analyzers/queries/technical/orphan-pages.sql +21 -0
  104. package/src/analyzers/queries/technical/redirects.sql +15 -0
  105. package/src/cli.ts +224 -0
  106. package/src/core/ContentExtractor.ts +480 -0
  107. package/src/core/CrawlDatabase.ts +736 -0
  108. package/src/core/CrawlOrchestrator.ts +346 -0
  109. package/src/core/CrawlStorage.ts +148 -0
  110. package/src/core/LinkExtractor.ts +123 -0
  111. package/src/core/UrlManager.ts +114 -0
  112. package/src/formatters/structured-report-format.ts +254 -0
  113. package/src/index.ts +259 -0
  114. package/src/schema/index.ts +176 -0
  115. package/src/tools/analyze-seo.ts +184 -0
  116. package/src/tools/list-queries.ts +70 -0
  117. package/src/tools/query-seo-data.ts +77 -0
  118. package/src/tools/run-seo-audit.ts +83 -0
  119. package/src/types/index.ts +179 -0
  120. package/src/utils/debug.ts +12 -0
  121. package/tsconfig.json +26 -0
package/README.md ADDED
@@ -0,0 +1,694 @@
1
+ # SEO Crawler MCP - Website Crawler & SEO Analyzer for LLMs
2
+
3
+ [![npm version](https://img.shields.io/npm/v/@houtini/seo-crawler-mcp.svg)](https://www.npmjs.com/package/@houtini/seo-crawler-mcp)
4
+ [![npm downloads](https://img.shields.io/npm/dm/@houtini/seo-crawler-mcp.svg)](https://www.npmjs.com/package/@houtini/seo-crawler-mcp)
5
+ [![Build Status](https://github.com/houtini-ai/seo-crawler-mcp/workflows/CI/badge.svg)](https://github.com/houtini-ai/seo-crawler-mcp/actions)
6
+ [![TypeScript](https://img.shields.io/badge/TypeScript-5.3-blue?logo=typescript&logoColor=white)](https://www.typescriptlang.org/)
7
+ [![Type Definitions](https://img.shields.io/badge/types-included-blue)](https://www.npmjs.com/package/@houtini/seo-crawler-mcp)
8
+ [![Known Vulnerabilities](https://snyk.io/test/github/houtini-ai/seo-crawler-mcp/badge.svg)](https://snyk.io/test/github/houtini-ai/seo-crawler-mcp)
9
+ [![MCP Registry](https://img.shields.io/badge/MCP-Registry-blue?style=flat-square)](https://registry.modelcontextprotocol.io)
10
+ [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
11
+
12
+ **Crawl and analyse your website for errors and issues that probably affect your site's SEO**
13
+
14
+ I wanted to build on my experience working with the MCP protocol SDK to see just how far we can extend an AI assistant's capabilities. I decided that I'd quite like to build a crawler to check my site's "technical SEO" health and came across Crawlee - which seemed like the ideal library to base the crawl component of my MCP.
15
+
16
+ What's interesting is that MCP usually indicates a server connection of some sort. This is not so with Crawlee MCP. The MCP protocol is probably more powerful than I realised - this is a self-contained application wrapped in the MCP SDK that handles everything locally:
17
+
18
+ - Smart request scheduling and queue management
19
+ - Automatic retry logic and error handling
20
+ - Respectful crawling with configurable delays
21
+ - Memory-efficient streaming for large sites
22
+ - Better-SQLite3 embedded database storing every crawled page's HTML, metadata, headers, link relationships, and site structure
23
+ - Custom SQL analysis engine with 25+ specialised queries detecting content issues, technical SEO problems, security vulnerabilities, and optimisation opportunities
24
+
25
+ Claude (or your AI assistant of choice) can orchestrate this entire stack through simple function calls. The crawl runs asynchronously, stores everything in SQLite, and then Claude can query that data through natural language - "analyse this crawl for seo opportunities" or "report on internal broken links" - and the MCP server translates that into sophisticated SQL analysis.
26
+
27
+ **You can also run crawls directly from the terminal** - perfect for large sites or background processing. The CLI mode lets you run a crawl, get the output directory, and then hand that over to Claude for AI-powered analysis via the MCP tools.
28
+
29
+ ### Credits
30
+
31
+ The core crawling architecture is inspired by the logic and patterns from the [LibreCrawl](https://github.com/libre-crawl/core) project. We've adapted their proven crawling methodology for use within the MCP protocol whilst adding comprehensive SEO analysis capabilities.
32
+
33
+ ---
34
+
35
+ ## Installation
36
+
37
+ ### For Beginners
38
+
39
+ If you're new to MCP servers, I'd recommend reading these first:
40
+ - [How to Add an MCP Server to Claude Desktop](https://houtini.com/how-to-add-an-mcp-server-to-claude-desktop/)
41
+ - [Claude Desktop Beginner's Guide](https://houtini.com/claude-desktop-beginners-guide/)
42
+
43
+ I'd also suggest installing [Desktop Commander](https://houtini.com/desktop-commander/) first - it's useful for working with the crawl output files. See the [Desktop Commander setup guide](https://github.com/wonderwhy-er/DesktopCommanderMCP) for details.
44
+
45
+ ### Quick Install (NPX)
46
+
47
+ Add this to your Claude Desktop config file:
48
+
49
+ **Windows:** `C:\Users\[YourName]\AppData\Roaming\Claude\claude_desktop_config.json`
50
+ **Mac:** `~/Library/Application Support/Claude/claude_desktop_config.json`
51
+
52
+ ```json
53
+ {
54
+ "mcpServers": {
55
+ "seo-crawler-mcp": {
56
+ "command": "npx",
57
+ "args": ["-y", "@houtini/seo-crawler-mcp"],
58
+ "env": {
59
+ "OUTPUT_DIR": "C:\\seo-audits"
60
+ }
61
+ }
62
+ }
63
+ }
64
+ ```
65
+
66
+ Restart Claude Desktop. Four tools will be available:
67
+ - `seo-crawler-mcp:run_seo_audit`
68
+ - `seo-crawler-mcp:analyze_seo`
69
+ - `seo-crawler-mcp:query_seo_data`
70
+ - `seo-crawler-mcp:list_seo_queries`
71
+
72
+ ### Development Install
73
+
74
+ ```bash
75
+ cd C:\MCP\seo-crawler-mcp
76
+ npm install
77
+ npm run build
78
+ ```
79
+
80
+ Then use the local path in your config:
81
+
82
+ ```json
83
+ {
84
+ "mcpServers": {
85
+ "seo-crawler-mcp": {
86
+ "command": "node",
87
+ "args": ["C:\\MCP\\seo-crawler-mcp\\build\\index.js"],
88
+ "env": {
89
+ "OUTPUT_DIR": "C:\\seo-audits",
90
+ "DEBUG": "false"
91
+ }
92
+ }
93
+ }
94
+ }
95
+ ```
96
+
97
+ **Environment Variables:**
98
+ - `OUTPUT_DIR`: Directory where crawl results are saved (required)
99
+ - `DEBUG`: Set to `"true"` to enable verbose debug logging (optional, default: `"false"`)
100
+
101
+ **CLI Usage for Local Development:**
102
+
103
+ When running the CLI from a local build (not installed via npm), use `node` directly:
104
+
105
+ ```bash
106
+ # Run crawl
107
+ node C:\MCP\seo-crawler-mcp\build\cli.js crawl https://example.com --max-pages=20
108
+
109
+ # Analyze results
110
+ node C:\MCP\seo-crawler-mcp\build\cli.js analyze C:\seo-audits\example.com_2026-02-02_abc123
111
+
112
+ # List queries
113
+ node C:\MCP\seo-crawler-mcp\build\cli.js queries --category=critical
114
+ ```
115
+
116
+ ---
117
+
118
+ ## CLI Mode (Terminal Usage)
119
+
120
+ For large crawls or background processing, you can run crawls directly from the terminal.
121
+
122
+ **Note:** These examples use `npx` for globally installed packages. For local development, see the "Development Install" section above.
123
+ }
124
+ }
125
+ }
126
+ }
127
+ ```
128
+
129
+ ---
130
+
131
+ ## CLI Mode (Terminal Usage)
132
+
133
+ For large crawls or background processing, you can run crawls directly from the terminal:
134
+
135
+ ### Run a Crawl
136
+
137
+ ```bash
138
+ # Basic crawl
139
+ npx @houtini/seo-crawler-mcp crawl https://example.com
140
+
141
+ # Large crawl with custom settings
142
+ npx @houtini/seo-crawler-mcp crawl https://example.com --max-pages=5000 --depth=5
143
+
144
+ # Using Googlebot user agent
145
+ npx @houtini/seo-crawler-mcp crawl https://example.com --user-agent=googlebot
146
+ ```
147
+
148
+ ### Quick Analysis
149
+
150
+ ```bash
151
+ # Show summary statistics
152
+ npx @houtini/seo-crawler-mcp analyze C:/seo-audits/example.com_2026-02-01_abc123
153
+
154
+ # Detailed JSON output
155
+ npx @houtini/seo-crawler-mcp analyze C:/seo-audits/example.com_2026-02-01_abc123 --format=detailed
156
+ ```
157
+
158
+ ### List Available Queries
159
+
160
+ ```bash
161
+ # All queries
162
+ npx @houtini/seo-crawler-mcp queries
163
+
164
+ # Security queries only
165
+ npx @houtini/seo-crawler-mcp queries --category=security
166
+
167
+ # Critical priority queries
168
+ npx @houtini/seo-crawler-mcp queries --priority=CRITICAL
169
+ ```
170
+
171
+ ### Workflow: Terminal + Claude
172
+
173
+ 1. **Run large crawl from terminal** (runs in background, can close terminal)
174
+ ```bash
175
+ npx @houtini/seo-crawler-mcp crawl https://bigsite.com --max-pages=5000
176
+ ```
177
+
178
+ 2. **Get the output path** from the crawl results
179
+ ```
180
+ Output Path: C:\seo-audits\bigsite.com_2026-02-02T10-15-30_abc123
181
+ ```
182
+
183
+ 3. **In Claude Desktop, analyze with AI**
184
+ ```
185
+ Analyze the crawl at C:\seo-audits\bigsite.com_2026-02-02T10-15-30_abc123
186
+ Show me the critical issues
187
+ What are the biggest SEO problems?
188
+ Give me a detailed report on broken internal links
189
+ ```
190
+
191
+ This workflow is perfect for:
192
+ - Large sites (1000+ pages) where you want the crawl to run overnight
193
+ - Multiple sites you want to crawl in batch
194
+ - Automated crawling via cron jobs or scheduled tasks
195
+ - Keeping terminal-based workflow whilst using Claude for intelligent analysis
196
+
197
+ ---
198
+
199
+ ## How to Use This
200
+
201
+ ### Complete SEO Audit
202
+
203
+ The typical workflow goes like this:
204
+
205
+ 1. **Crawl the website**
206
+ ```
207
+ Use seo-crawler-mcp to crawl https://example.com with maxPages=2000
208
+ ```
209
+
210
+ 2. **Run the analysis**
211
+ ```
212
+ Analyse the crawl at C:/seo-audits/example.com_2026-02-01_abc123
213
+ ```
214
+
215
+ 3. **Investigate specific issues**
216
+ ```
217
+ Show me the broken internal links from that crawl
218
+ ```
219
+
220
+ Claude handles the rest - calling the right tools, parsing the results, and presenting everything in readable format.
221
+
222
+ ### Security Audit
223
+
224
+ If you're specifically worried about security headers:
225
+
226
+ 1. **List available security queries**
227
+ ```
228
+ What security checks can you run on an SEO crawl?
229
+ ```
230
+
231
+ 2. **Run security-focused analysis**
232
+ ```
233
+ Check the security issues in crawl C:/seo-audits/example.com_2026-02-01_abc123
234
+ ```
235
+
236
+ 3. **Deep dive on specific problems**
237
+ ```
238
+ Show me all pages with unsafe external links
239
+ ```
240
+
241
+ ---
242
+
243
+ ## What Gets Detected
244
+
245
+ The analysis engine includes 25 comprehensive SEO checks across five categories:
246
+
247
+ ### Critical Issues (4 checks)
248
+ - Missing title tags - pages without titles don't rank
249
+ - Broken internal links - 404/5xx responses that hurt crawlability
250
+ - Server errors - 5xx responses indicating site problems
251
+ - 404 errors - broken pages that need fixing or redirecting
252
+
253
+ ### Content Quality (7 checks)
254
+ - Duplicate titles across different pages
255
+ - Duplicate meta descriptions
256
+ - Duplicate H1 tags
257
+ - Missing meta descriptions
258
+ - Missing H1 tags
259
+ - Multiple H1 tags on single pages
260
+ - Thin content - pages under 300 words
261
+
262
+ ### Technical SEO (5 checks)
263
+ - Redirect chains and loops
264
+ - Orphan pages with no internal links
265
+ - Canonical URL mismatches
266
+ - Non-HTTPS pages still in use
267
+ - Heading hierarchy problems (H3 before H2, etc)
268
+
269
+ ### Security (6 checks)
270
+ - Missing Content-Security-Policy headers
271
+ - Missing HSTS (Strict-Transport-Security)
272
+ - Missing X-Frame-Options (clickjacking protection)
273
+ - Missing Referrer-Policy
274
+ - Unsafe external links (target="_blank" without rel="noopener")
275
+ - Protocol-relative links (//example.com)
276
+
277
+ ### Optimisation (6 checks)
278
+ - Title tags too long or too short
279
+ - Meta descriptions length issues
280
+ - Title matches H1 (opportunity for differentiation)
281
+ - Pages with no outbound links
282
+ - Pages with excessive external links
283
+ - Pages missing images
284
+
285
+ ---
286
+
287
+ ## Data Storage
288
+
289
+ The crawler stores everything in SQLite databases organised by domain and date:
290
+
291
+ ```
292
+ C:/seo-audits/example.com_2026-02-01_abc123/
293
+ ├── crawl-data.db # SQLite database
294
+ │ ├── pages # Every page crawled
295
+ │ ├── links # All link relationships
296
+ │ ├── errors # Crawl errors
297
+ │ └── crawl_metadata # Statistics
298
+ ├── config.json # Crawl settings
299
+ └── crawl-export.csv # Optional CSV export
300
+ ```
301
+
302
+ ---
303
+
304
+ ## Performance
305
+
306
+ Typical crawl performance metrics:
307
+
308
+ **Crawl Speed:**
309
+ - Medium site (1,500-2,000 pages): ~15 minutes
310
+ - 300,000+ link relationships tracked
311
+ - Database size: ~15MB for 2,000 pages
312
+
313
+ **Query Performance:**
314
+ - Simple queries: under 10ms
315
+ - Complex queries: under 100ms
316
+ - Join queries: under 200ms
317
+ - Full analysis: under 600ms
318
+
319
+ The SQLite approach works well here. Everything stays local, no API rate limits to worry about, and the query performance is more than adequate for SEO analysis.
320
+
321
+ ---
322
+
323
+ ## Limitations
324
+
325
+ There are 4 additional checks planned for v3.0:
326
+
327
+ - **Core Web Vitals** - requires Playwright for real browser metrics
328
+ - **Robots.txt validation** - needs parser library
329
+ - **Readability scoring** - requires text analysis library
330
+ - **Mobile rendering issues** - needs device emulation
331
+
332
+ The current 25 checks cover the most critical aspects of technical SEO that directly impact search engine crawling, indexing, and ranking.
333
+
334
+ ---
335
+
336
+ ## Technical Details
337
+
338
+ **Built with:**
339
+ - TypeScript 5.3
340
+ - Crawlee 3.7 (HttpCrawler)
341
+ - better-sqlite3 12.6
342
+ - Cheerio 1.0 (HTML parsing)
343
+ - MCP SDK 1.0
344
+
345
+ The code uses ES modules throughout, with proper Zod validation on inputs and comprehensive error handling. I've kept the architecture clean - separate modules for crawling, analysis, formatting, and tool definitions.
346
+
347
+ **Deployment:**
348
+ - Local MCP server via Node.js
349
+ - No external dependencies
350
+ - Configurable output directory
351
+ - Concurrent crawling (5 workers)
352
+
353
+ ---
354
+
355
+ ## MCP Tools Reference
356
+
357
+ ### run_seo_audit
358
+
359
+ Crawl a website and extract comprehensive SEO data into SQLite.
360
+
361
+ **Parameters:**
362
+ - `url` (required) - Website URL to crawl
363
+ - `maxPages` (optional) - Maximum pages to crawl (default: 1000)
364
+ - `depth` (optional) - Maximum crawl depth (default: 3)
365
+ - `userAgent` (optional) - "chrome" or "googlebot" (default: "chrome")
366
+
367
+ **Example:**
368
+ ```typescript
369
+ run_seo_audit({
370
+ url: "https://example.com",
371
+ maxPages: 2000,
372
+ depth: 5,
373
+ userAgent: "chrome"
374
+ })
375
+ ```
376
+
377
+ **Returns:** Crawl ID and output path
378
+
379
+ ---
380
+
381
+ ### analyze_seo
382
+
383
+ Run comprehensive SEO analysis on a completed crawl.
384
+
385
+ **Parameters:**
386
+ - `crawlPath` (required) - Path to crawl output directory
387
+ - `format` (optional) - "structured", "summary", or "detailed" (default: "structured")
388
+ - `includeCategories` (optional) - Filter by categories: "critical", "content", "technical", "security", "opportunities"
389
+ - `maxExamplesPerIssue` (optional) - Maximum example URLs per issue (default: 10)
390
+
391
+ **Example:**
392
+ ```typescript
393
+ analyze_seo({
394
+ crawlPath: "C:/seo-audits/example.com_2026-02-01_abc123",
395
+ format: "structured",
396
+ includeCategories: ["critical", "security"],
397
+ maxExamplesPerIssue: 5
398
+ })
399
+ ```
400
+
401
+ **Returns:** Structured report with issues, affected URLs, and fix recommendations
402
+
403
+ ---
404
+
405
+ ### query_seo_data
406
+
407
+ Execute specific SEO queries by name.
408
+
409
+ **Parameters:**
410
+ - `crawlPath` (required) - Path to crawl output directory
411
+ - `query` (required) - Query name (see list_seo_queries)
412
+ - `limit` (optional) - Maximum results (default: 100)
413
+
414
+ **Example:**
415
+ ```typescript
416
+ query_seo_data({
417
+ crawlPath: "C:/seo-audits/example.com_2026-02-01_abc123",
418
+ query: "broken-internal-links",
419
+ limit: 50
420
+ })
421
+ ```
422
+
423
+ **Returns:** Query results with affected URLs and context
424
+
425
+ ---
426
+
427
+ ### list_seo_queries
428
+
429
+ Discover available SEO analysis queries.
430
+
431
+ **Parameters:**
432
+ - `category` (optional) - Filter by category
433
+ - `priority` (optional) - Filter by priority level
434
+
435
+ **Example:**
436
+ ```typescript
437
+ list_seo_queries({
438
+ category: "security",
439
+ priority: "HIGH"
440
+ })
441
+ ```
442
+
443
+ **Returns:** List of available queries with descriptions and priorities
444
+
445
+ ---
446
+
447
+ ## Available Queries
448
+
449
+ The analysis engine includes 28 predefined SQL queries organised by category. Each query includes detailed impact analysis and fix recommendations.
450
+
451
+ ### Critical Issues (4 queries)
452
+
453
+ **missing-titles**
454
+ - **What it finds:** Pages without title tags
455
+ - **Why it matters:** Title tags are the most important on-page SEO element. Without them, pages are essentially invisible to search engines.
456
+ - **Fix:** Add unique, descriptive title tags (50-60 characters) to all pages immediately.
457
+
458
+ **broken-internal-links**
459
+ - **What it finds:** Internal links pointing to 404/5xx error pages
460
+ - **Why it matters:** Broken links hurt crawlability and waste crawl budget. They create dead ends for users and search engines.
461
+ - **Fix:** Update or remove broken links. Add redirects for moved pages.
462
+
463
+ **server-errors**
464
+ - **What it finds:** Pages returning 5xx status codes
465
+ - **Why it matters:** Indicates server problems that prevent search engines from indexing content.
466
+ - **Fix:** Investigate server issues, check error logs, ensure adequate resources.
467
+
468
+ **not-found-errors**
469
+ - **What it finds:** Pages returning 404 status codes
470
+ - **Why it matters:** Lost indexing opportunities and poor user experience.
471
+ - **Fix:** Add 301 redirects or remove links to non-existent pages.
472
+
473
+ ### Content Quality (7 queries)
474
+
475
+ **duplicate-titles**
476
+ - **What it finds:** Multiple pages sharing identical title tags
477
+ - **Why it matters:** Confuses search engines about which page to rank for queries.
478
+ - **Fix:** Make each page's title tag unique and descriptive of its specific content.
479
+
480
+ **duplicate-descriptions**
481
+ - **What it finds:** Multiple pages with identical meta descriptions
482
+ - **Why it matters:** Reduces click-through rates as snippets look identical in search results.
483
+ - **Fix:** Write unique meta descriptions (150-160 characters) for each page.
484
+
485
+ **duplicate-h1s**
486
+ - **What it finds:** Multiple pages sharing the same H1 heading
487
+ - **Why it matters:** H1 tags signal page topic - duplicates dilute topical clarity.
488
+ - **Fix:** Ensure each page has a unique H1 that accurately describes its content.
489
+
490
+ **missing-descriptions**
491
+ - **What it finds:** Pages without meta description tags
492
+ - **Why it matters:** Search engines create their own snippets, often poorly representing content.
493
+ - **Fix:** Add compelling meta descriptions (150-160 characters) for all important pages.
494
+
495
+ **missing-h1s**
496
+ - **What it finds:** Pages without H1 headings
497
+ - **Why it matters:** H1 is a primary signal of page topic and structure.
498
+ - **Fix:** Add descriptive H1 tags to all content pages.
499
+
500
+ **multiple-h1s**
501
+ - **What it finds:** Pages with more than one H1 tag
502
+ - **Why it matters:** Dilutes topical focus and confuses heading hierarchy.
503
+ - **Fix:** Use only one H1 per page. Convert other H1s to H2 or H3.
504
+
505
+ **thin-content**
506
+ - **What it finds:** Pages with less than 300 words of content
507
+ - **Why it matters:** Thin content provides little value and ranks poorly.
508
+ - **Fix:** Expand content with valuable information or consolidate into existing pages.
509
+
510
+ ### Technical SEO (5 queries)
511
+
512
+ **redirect-pages**
513
+ - **What it finds:** Pages that redirect to other URLs
514
+ - **Why it matters:** Multiple redirects waste crawl budget and slow page loads.
515
+ - **Fix:** Update internal links to point directly to final destination.
516
+
517
+ **redirect-chains**
518
+ - **What it finds:** URLs that redirect multiple times before reaching destination
519
+ - **Why it matters:** Each redirect adds latency and risks breaking the chain.
520
+ - **Fix:** Implement direct redirects from source to final destination.
521
+
522
+ **orphan-pages**
523
+ - **What it finds:** Pages with no internal links pointing to them
524
+ - **Why it matters:** Search engines may never discover orphan pages.
525
+ - **Fix:** Add internal links from relevant pages to connect orphans to site structure.
526
+
527
+ **canonical-issues**
528
+ - **What it finds:** Pages where canonical URL doesn't match actual URL
529
+ - **Why it matters:** Signals duplicate content or indexing preference conflicts.
530
+ - **Fix:** Ensure canonical tags point to the correct version of each page.
531
+
532
+ **non-https-pages**
533
+ - **What it finds:** Pages still using HTTP instead of HTTPS
534
+ - **Why it matters:** Security risk, ranking penalty, and browser warnings.
535
+ - **Fix:** Implement HTTPS across entire site with proper redirects.
536
+
537
+ ### Security (6 queries)
538
+
539
+ **missing-csp**
540
+ - **What it finds:** Pages without Content-Security-Policy headers
541
+ - **Why it matters:** Vulnerability to XSS attacks and code injection.
542
+ - **Fix:** Implement CSP headers to control resource loading.
543
+
544
+ **missing-hsts**
545
+ - **What it finds:** Pages without Strict-Transport-Security headers
546
+ - **Why it matters:** Allows protocol downgrade attacks.
547
+ - **Fix:** Add HSTS headers to enforce HTTPS connections.
548
+
549
+ **missing-x-frame-options**
550
+ - **What it finds:** Pages without X-Frame-Options headers
551
+ - **Why it matters:** Vulnerability to clickjacking attacks.
552
+ - **Fix:** Add X-Frame-Options headers (DENY or SAMEORIGIN).
553
+
554
+ **missing-referrer-policy**
555
+ - **What it finds:** Pages without Referrer-Policy headers
556
+ - **Why it matters:** Potential privacy and security leakage.
557
+ - **Fix:** Implement appropriate referrer policy for your use case.
558
+
559
+ **unsafe-external-links**
560
+ - **What it finds:** Links with target="_blank" but without rel="noopener"
561
+ - **Why it matters:** Security vulnerability allowing opened page to control opener window.
562
+ - **Fix:** Add rel="noopener noreferrer" to all target="_blank" links.
563
+
564
+ **protocol-relative-links**
565
+ - **What it finds:** Links using // instead of https://
566
+ - **Why it matters:** Can cause mixed content issues and security warnings.
567
+ - **Fix:** Use absolute HTTPS URLs for all external resources.
568
+
569
+ ### Optimisation Opportunities (6 queries)
570
+
571
+ **title-length-issues**
572
+ - **What it finds:** Title tags shorter than 30 characters or longer than 60
573
+ - **Why it matters:** Too short titles waste opportunity; too long get truncated in search results.
574
+ - **Fix:** Aim for 50-60 characters for optimal display in search results.
575
+
576
+ **description-length-issues**
577
+ - **What it finds:** Meta descriptions shorter than 120 or longer than 160 characters
578
+ - **Why it matters:** Poor descriptions reduce click-through rates.
579
+ - **Fix:** Write descriptions between 150-160 characters for full display.
580
+
581
+ **title-equals-h1**
582
+ - **What it finds:** Pages where title tag matches H1 exactly
583
+ - **Why it matters:** Missed opportunity to target different keywords or angles.
584
+ - **Fix:** Make title and H1 complementary but not identical for broader keyword coverage.
585
+
586
+ **no-outbound-links**
587
+ - **What it finds:** Pages with zero external links
588
+ - **Why it matters:** Can appear spammy or siloed; linking to quality sources builds trust.
589
+ - **Fix:** Add relevant external links to authoritative sources where appropriate.
590
+
591
+ **high-external-links**
592
+ - **What it finds:** Pages with excessive external links (20+)
593
+ - **Why it matters:** Can appear spammy and leaks PageRank unnecessarily.
594
+ - **Fix:** Reduce external links to most relevant and valuable resources.
595
+
596
+ **missing-images**
597
+ - **What it finds:** Pages without any images
598
+ - **Why it matters:** Images improve engagement and provide additional ranking signals.
599
+ - **Fix:** Add relevant, optimized images with proper alt text.
600
+
601
+ ---
602
+
603
+ ### Using Queries
604
+
605
+ **In Claude Desktop:**
606
+ ```
607
+ List all available queries
608
+ Show me the critical queries only
609
+ Run the missing-titles query on my crawl
610
+ What does the orphan-pages query check for?
611
+ ```
612
+
613
+ **In CLI:**
614
+ ```bash
615
+ # List all queries
616
+ seo-crawler-mcp queries
617
+
618
+ # Filter by category
619
+ seo-crawler-mcp queries --category=security
620
+
621
+ # Filter by priority
622
+ seo-crawler-mcp queries --priority=CRITICAL
623
+ ```
624
+
625
+ Each query returns:
626
+ - Affected URLs
627
+ - Relevant context (word count, status codes, etc.)
628
+ - Count of affected pages
629
+ - Organized by severity
630
+
631
+ ---
632
+
633
+ ## Development
634
+
635
+ ```bash
636
+ # Build
637
+ npm run build
638
+
639
+ # Development mode
640
+ npm run dev
641
+
642
+ # Run tests
643
+ npm test
644
+ ```
645
+
646
+ ---
647
+
648
+ ## Version History
649
+
650
+ ### v2.0.1 (2026-02-02)
651
+ - Fixed MemoryStorage cleanup bug (added explicit purge in finally block)
652
+ - Added CLI mode for terminal-based crawling
653
+ - Removed proprietary tool references from documentation
654
+ - Ensures guaranteed fresh state between consecutive crawls
655
+
656
+ ### v2.0.0 (2026-02-01)
657
+ - Added comprehensive SQL-based analysis engine
658
+ - 28 SEO queries covering industry-standard audit requirements
659
+ - Three analysis tools: analyze_seo, query_seo_data, list_seo_queries
660
+ - 86% coverage of standard SEO audit requirements
661
+
662
+ ### v1.1.0 (2026-02-01)
663
+ - Enhanced data collection with security headers
664
+ - Heading structure validation (H1-H6)
665
+ - Link security analysis
666
+ - Response time accuracy improvements
667
+
668
+ ### v1.0.0 (2026-01-31)
669
+ - Initial release with SQLite storage
670
+ - LibreCrawl pattern implementation
671
+ - Basic crawl tool (run_seo_audit)
672
+
673
+ ---
674
+
675
+ ## Licence
676
+
677
+ Apache License 2.0
678
+
679
+ Copyright 2026 Richard Baxter
680
+
681
+ This product includes software developed by Apify and the Crawlee project.
682
+ See NOTICE file for details.
683
+
684
+ ---
685
+
686
+ ## Support
687
+
688
+ **GitHub:** https://github.com/houtini-ai/seo-crawler-mcp
689
+ **Issues:** https://github.com/houtini-ai/seo-crawler-mcp/issues
690
+ **Author:** Richard Baxter <hello@houtini.com>
691
+
692
+ ---
693
+
694
+ **Tags:** seo, crawler, audit, technical-seo, mcp, crawlee, sqlite, web-scraping, site-analysis