pagesight 0.1.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,20 +1,22 @@
1
1
  # Pagesight
2
2
 
3
- Google's data + AI crawler intelligence, your AI assistant's hands.
3
+ [![npm version](https://img.shields.io/npm/v/pagesight.svg)](https://www.npmjs.com/package/pagesight)
4
4
 
5
- An open-source MCP server that gives AI assistants direct access to Google Search Console, PageSpeed Insights, Chrome UX Report, and a robots.txt analyzer that audits 139+ AI crawlers. No made-up rules. No invented scores. Just data from authoritative sources.
5
+ See your site the way search engines and AI see it.
6
6
 
7
- Most SEO tools flag "title over 60 characters" and "only one H1 allowed." [Google's own engineers say those rules don't exist.](#why-not-other-seo-tools) Pagesight skips the myths and asks the sources directly.
7
+ ```bash
8
+ npm install pagesight
9
+ ```
8
10
 
9
- ## Tools
11
+ Most SEO tools flag "title over 60 characters" and "only one H1 allowed." [Google's own engineers say those rules don't exist.](#why-not-other-seo-tools) Pagesight skips the myths and goes to the sources.
10
12
 
11
- Eight tools. Three Google APIs. 139+ AI bots tracked. One install.
13
+ ## Tools
12
14
 
13
15
  ### `inspect`
14
16
 
15
17
  Ask Google: is this page indexed? What canonical did you choose? Any crawl errors? Structured data issues?
16
18
 
17
- Returns index status, canonical (yours vs Google's), crawl status, rich results validation, sitemaps, and referring URLs — directly from Google's index.
19
+ Returns index status, canonical (yours vs Google's), crawl status, rich results validation, sitemaps, and referring URLs.
18
20
 
19
21
  ### `pagespeed`
20
22
 
@@ -22,10 +24,9 @@ Run Google Lighthouse on any URL:
22
24
 
23
25
  - **Scores**: performance, accessibility, best-practices, seo
24
26
  - **Core Web Vitals (lab)**: FCP, LCP, TBT, CLS, Speed Index, TTI
25
- - **CrUX field data**: real Chrome user metrics when available (page + origin)
27
+ - **CrUX field data**: real Chrome user metrics (page + origin)
26
28
  - **Opportunities**: ranked by severity with potential savings
27
29
  - **Strategy**: `mobile` or `desktop`
28
- - **Locale**: localized results (e.g., `pt-BR`)
29
30
 
30
31
  ### `crux`
31
32
 
@@ -33,56 +34,50 @@ Real-world Core Web Vitals from Chrome users (28-day rolling window):
33
34
 
34
35
  - **Metrics**: LCP, FCP, INP, CLS, TTFB, RTT, navigation types, form factors
35
36
  - **Granularity**: by URL or origin, by device (DESKTOP, PHONE, TABLET)
36
- - **Data**: p75 values + histogram distributions (good/needs improvement/poor)
37
+ - **Data**: p75 values + histogram distributions
37
38
 
38
39
  ### `crux_history`
39
40
 
40
41
  Core Web Vitals trends over time — up to 40 weekly data points (~10 months):
41
42
 
42
43
  - Trend detection (improved/stable/worse) with percentage change
43
- - Recent data points table for LCP, INP, CLS
44
+ - Recent data points table for core metrics
44
45
  - Custom period count (1-40)
45
46
 
46
47
  ### `performance`
47
48
 
48
- Google Search Console search analytics with full API coverage:
49
+ Google Search Console search analytics:
49
50
 
50
51
  - **Dimensions**: `query`, `page`, `country`, `device`, `date`, `searchAppearance`, `hour`
51
52
  - **Search types**: `web`, `image`, `video`, `news`, `discover`, `googleNews`
52
53
  - **Filters**: `equals`, `contains`, `notEquals`, `notContains`, `includingRegex`, `excludingRegex`
53
- - **Aggregation**: `auto`, `byPage`, `byProperty`, `byNewsShowcasePanel`
54
- - **Data freshness**: `all`, `final`, `hourly_all`
55
- - **Pagination**: up to 25,000 rows with offset
54
+ - **Pagination**: up to 25,000 rows
56
55
 
57
56
  ### `robots`
58
57
 
59
- Fetch and analyze any site's robots.txt:
58
+ Analyze any site's robots.txt:
60
59
 
61
60
  - **Syntax validation** per [RFC 9309](https://www.rfc-editor.org/rfc/rfc9309)
62
- - **AI crawler audit** — checks 139+ bots from the [ai-robots-txt](https://github.com/ai-robots-txt/ai.robots.txt) community registry
61
+ - **AI crawler audit** — 139+ bots from the [ai-robots-txt](https://github.com/ai-robots-txt/ai.robots.txt) community registry
63
62
  - **Bot categories**: training scrapers, AI search crawlers, AI assistants, AI agents
64
- - **Per-bot status**: blocked or allowed, with the matched rule and group
65
- - **Path checking**: is a specific path allowed for a specific user-agent?
66
- - **Sitemaps**: lists all sitemaps declared in robots.txt
63
+ - **Per-bot status**: blocked or allowed, with the matched rule
67
64
 
68
65
  ```
69
- === robots.txt: https://www.cnn.com ===
70
- AI Crawlers: 55 blocked, 84 allowed (of 139 known)
71
- Source: github.com/ai-robots-txt/ai.robots.txt
66
+ === robots.txt: https://www.nytimes.com ===
67
+ AI Crawlers: 35 blocked, 104 allowed (of 139 known)
72
68
 
73
69
  BLOCKED GPTBot (OpenAI) — GPT model training
74
70
  BLOCKED ClaudeBot (Anthropic) — Claude model training
75
71
  ALLOWED Claude-User (Anthropic) — User-initiated fetching
76
- BLOCKED PerplexityBot (Perplexity) — Search indexing
77
72
  ```
78
73
 
79
74
  ### `sitemaps`
80
75
 
81
76
  Search Console properties and sitemaps (read-only):
82
77
 
83
- - `list_sites` — all GSC properties with permission level
78
+ - `list_sites` — all properties with permission level
84
79
  - `get_site` — details for a specific property
85
- - `list_sitemaps` — sitemaps with error/warning counts and content types
80
+ - `list_sitemaps` — sitemaps with error/warning counts
86
81
  - `get_sitemap` — full details for a specific sitemap
87
82
 
88
83
  ### `setup`
@@ -95,23 +90,11 @@ Check auth status or walk through OAuth interactively.
95
90
 
96
91
  1. Go to [Google Cloud Console](https://console.cloud.google.com/)
97
92
  2. Create a project (or use existing)
98
- 3. Enable three APIs:
99
- - **Google Search Console API**
100
- - **PageSpeed Insights API**
101
- - **Chrome UX Report API**
93
+ 3. Enable: **Search Console API**, **PageSpeed Insights API**, **Chrome UX Report API**
102
94
  4. Create **OAuth client ID** (Desktop app) — for Search Console
103
95
  5. Create **API key** — for PageSpeed and CrUX
104
96
 
105
- ### 2. Authorize Search Console
106
-
107
- Use the `setup` tool to walk through OAuth, or manually:
108
-
109
- 1. Visit the auth URL with your client ID
110
- 2. Authorize access to Search Console
111
- 3. Copy the code from the redirect URL
112
- 4. Exchange it for a refresh token
113
-
114
- ### 3. Configure
97
+ ### 2. Configure
115
98
 
116
99
  ```env
117
100
  GSC_CLIENT_ID=your-client-id.apps.googleusercontent.com
@@ -120,11 +103,9 @@ GSC_REFRESH_TOKEN=your-refresh-token
120
103
  GOOGLE_API_KEY=your-api-key
121
104
  ```
122
105
 
123
- Note: The `robots` tool works without any credentials — it fetches the public `/robots.txt` file directly.
124
-
125
- ## Usage
106
+ The `robots` tool works without any credentials.
126
107
 
127
- Add to Claude Code, Cursor, or any MCP client:
108
+ ### 3. Use with your AI assistant
128
109
 
129
110
  ```json
130
111
  {
@@ -143,37 +124,33 @@ Add to Claude Code, Cursor, or any MCP client:
143
124
  }
144
125
  ```
145
126
 
146
- Then just talk to your AI assistant:
127
+ Then just ask:
147
128
 
148
129
  ```
149
130
  "Is https://mysite.com indexed?"
150
- "What canonical did Google choose for this page?"
151
- "Run pagespeed on my homepage, mobile"
152
- "Show me CrUX data for my site on phones"
153
- "How have my Core Web Vitals changed over the last 10 months?"
154
- "Which queries bring traffic to this page?"
131
+ "Run pagespeed on my homepage"
155
132
  "Which AI crawlers can access my site?"
156
- "Is GPTBot blocked on reddit.com?"
157
- "Any sitemap errors?"
133
+ "How have my Core Web Vitals changed?"
134
+ "Which queries bring traffic to this page?"
158
135
  ```
159
136
 
160
137
  ## Why not other SEO tools?
161
138
 
162
- We researched every common SEO "rule" against official Google documentation. Most are myths:
139
+ We checked every common SEO "rule" against official Google documentation:
163
140
 
164
- - **"Title must be under 60 characters"** — Google: "there's no limit." Gary Illyes called it "an externally made-up metric."
141
+ - **"Title must be under 60 characters"** — Google: "there's no limit." Gary Illyes: "an externally made-up metric."
165
142
  - **"Meta description must be 155 characters"** — Google: "there's no limit on how long a meta description can be."
166
143
  - **"Only one H1 per page"** — John Mueller: "You can use H1 tags as often as you want. There's no limit."
167
- - **"Minimum 300 words per page"** — Mueller: "the number of words on a page is not a quality factor, not a ranking factor."
144
+ - **"Minimum 300 words per page"** — Mueller: "the number of words on a page is not a quality factor."
168
145
  - **"Text-to-HTML ratio matters"** — Mueller: "it makes absolutely no sense at all for SEO."
169
146
 
170
- Tools that flag these "issues" are reporting their opinions, not data. Pagesight only reports what authoritative sources actually return — Google's APIs for search data, RFC 9309 for robots.txt, and a community-maintained registry for AI crawlers.
147
+ Tools that flag these are reporting their opinions. Pagesight only reports what the sources actually return.
171
148
 
172
149
  ## Development
173
150
 
174
151
  ```bash
175
- bun install # install dependencies
176
- bun run start # start MCP server
152
+ bun install
153
+ bun run start # start server
177
154
  bun run lint # biome check
178
155
  bun run format # biome format
179
156
  ```
package/package.json CHANGED
@@ -1,7 +1,23 @@
1
1
  {
2
2
  "name": "pagesight",
3
- "version": "0.1.0",
4
- "description": "Google's data + AI crawler intelligence for AI assistants. MCP server for SEO, GEO, and web performance.",
3
+ "version": "0.2.1",
4
+ "description": "See your site the way search engines and AI see it.",
5
+ "keywords": [
6
+ "seo",
7
+ "geo",
8
+ "google-search-console",
9
+ "pagespeed-insights",
10
+ "core-web-vitals",
11
+ "web-performance",
12
+ "robots-txt",
13
+ "ai-crawlers",
14
+ "mcp"
15
+ ],
16
+ "files": [
17
+ "src",
18
+ "README.md",
19
+ "LICENSE"
20
+ ],
5
21
  "type": "module",
6
22
  "main": "src/index.ts",
7
23
  "bin": {
@@ -9,9 +25,9 @@
9
25
  },
10
26
  "repository": {
11
27
  "type": "git",
12
- "url": "git+https://github.com/caiopizzol/sitelint.git"
28
+ "url": "git+https://github.com/caiopizzol/pagesight.git"
13
29
  },
14
- "homepage": "https://github.com/caiopizzol/sitelint",
30
+ "homepage": "https://github.com/caiopizzol/pagesight",
15
31
  "author": "Caio Pizzol",
16
32
  "scripts": {
17
33
  "start": "bun run src/index.ts",
@@ -2,31 +2,15 @@ import type { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
2
2
  import { z } from "zod";
3
3
  import { auditAiCrawlers, type CrawlerStatus, fetchRobotsTxt, isAllowed, type RobotsTxt } from "../lib/robots.js";
4
4
 
5
- function formatCrawlerStatus(statuses: CrawlerStatus[]): string {
6
- const lines: string[] = [];
7
-
8
- // Group by category
9
- const categories = new Map<string, CrawlerStatus[]>();
10
- for (const s of statuses) {
11
- const cat = s.category;
12
- if (!categories.has(cat)) categories.set(cat, []);
13
- categories.get(cat)?.push(s);
14
- }
15
-
16
- for (const [cat, bots] of categories) {
17
- lines.push(`--- ${cat} (${bots.length}) ---`, "");
18
-
19
- for (const bot of bots) {
20
- const status = bot.allowed ? "ALLOWED" : "BLOCKED";
21
- lines.push(` ${status} ${bot.name} (${bot.company})`);
22
- if (bot.matchedRule) {
23
- lines.push(` Rule: ${bot.matchedRule.type}: ${bot.matchedRule.path} (group: ${bot.matchedGroup})`);
24
- }
25
- }
26
- lines.push("");
27
- }
28
-
29
- return lines.join("\n").trimEnd();
5
+ // Normalize the messy registry categories into clean buckets
6
+ function normalizeCategory(raw: string): string {
7
+ const lower = raw.toLowerCase();
8
+ if (lower.includes("training") || lower.includes("train") || lower.includes("scrape") || lower.includes("dataset"))
9
+ return "Training";
10
+ if (lower.includes("search") && !lower.includes("assistant")) return "Search";
11
+ if (lower.includes("assistant") || lower.includes("user prompt") || lower.includes("user quer")) return "Assistant";
12
+ if (lower.includes("agent")) return "Agent";
13
+ return "Other";
30
14
  }
31
15
 
32
16
  function formatRobotsAudit(origin: string, robots: RobotsTxt, statusCode: number, crawlers: CrawlerStatus[]): string {
@@ -63,17 +47,67 @@ function formatRobotsAudit(origin: string, robots: RobotsTxt, statusCode: number
63
47
  }
64
48
 
65
49
  // AI Crawler audit
66
- const allowed = crawlers.filter((c) => c.allowed);
67
50
  const blocked = crawlers.filter((c) => !c.allowed);
51
+ const allowed = crawlers.filter((c) => c.allowed);
68
52
 
69
53
  lines.push(
70
54
  "",
71
55
  `--- AI Crawlers: ${blocked.length} blocked, ${allowed.length} allowed (of ${crawlers.length} known) ---`,
72
56
  "",
73
- `Source: github.com/ai-robots-txt/ai.robots.txt (${crawlers.length} bots)`,
74
- "",
57
+ `Source: github.com/ai-robots-txt/ai.robots.txt`,
75
58
  );
76
- lines.push(formatCrawlerStatus(crawlers));
59
+
60
+ if (blocked.length === 0) {
61
+ lines.push("", "All 139 known AI crawlers are allowed. No bots are explicitly blocked.");
62
+ } else if (blocked.length === crawlers.length) {
63
+ lines.push("", "All known AI crawlers are blocked.");
64
+ // Show how they're blocked
65
+ const byGroup = new Map<string, string[]>();
66
+ for (const bot of blocked) {
67
+ const group = bot.matchedGroup ?? "wildcard";
68
+ if (!byGroup.has(group)) byGroup.set(group, []);
69
+ byGroup.get(group)?.push(bot.name);
70
+ }
71
+ for (const [group, bots] of byGroup) {
72
+ lines.push(` via group "${group}": ${bots.length} bots`);
73
+ }
74
+ } else {
75
+ // Mixed — show blocked bots in detail, grouped by normalized category
76
+ lines.push("");
77
+
78
+ const blockedByCategory = new Map<string, CrawlerStatus[]>();
79
+ for (const bot of blocked) {
80
+ const cat = normalizeCategory(bot.category);
81
+ if (!blockedByCategory.has(cat)) blockedByCategory.set(cat, []);
82
+ blockedByCategory.get(cat)?.push(bot);
83
+ }
84
+
85
+ const categoryOrder = ["Training", "Search", "Assistant", "Agent", "Other"];
86
+ for (const cat of categoryOrder) {
87
+ const bots = blockedByCategory.get(cat);
88
+ if (!bots) continue;
89
+
90
+ lines.push(`Blocked ${cat} (${bots.length}):`);
91
+ for (const bot of bots) {
92
+ lines.push(` BLOCKED ${bot.name} (${bot.company})`);
93
+ }
94
+ lines.push("");
95
+ }
96
+
97
+ // Summary of allowed by category
98
+ const allowedByCategory = new Map<string, number>();
99
+ for (const bot of allowed) {
100
+ const cat = normalizeCategory(bot.category);
101
+ allowedByCategory.set(cat, (allowedByCategory.get(cat) ?? 0) + 1);
102
+ }
103
+
104
+ const allowedSummary = categoryOrder
105
+ .filter((cat) => allowedByCategory.has(cat))
106
+ .map((cat) => `${cat}: ${allowedByCategory.get(cat)}`)
107
+ .join(", ");
108
+
109
+ lines.push(`Allowed (${allowed.length}): ${allowedSummary}`);
110
+ }
77
111
 
78
112
  return lines.join("\n");
79
113
  }
@@ -81,7 +115,7 @@ function formatRobotsAudit(origin: string, robots: RobotsTxt, statusCode: number
81
115
  export function registerRobotsTool(server: McpServer): void {
82
116
  server.tool(
83
117
  "robots",
84
- "Fetch and analyze a site's robots.txt. Validates syntax per RFC 9309, audits AI crawler access (130+ bots from ai-robots-txt registry), lists sitemaps, and reports blocked vs allowed bots by category.",
118
+ "Fetch and analyze a site's robots.txt. Validates syntax per RFC 9309, audits AI crawler access (139+ bots), lists sitemaps. Shows blocked bots in detail, summarizes allowed.",
85
119
  {
86
120
  url: z
87
121
  .string()
@@ -1,40 +0,0 @@
1
- name: CI
2
-
3
- on:
4
- push:
5
- branches: [main]
6
- pull_request:
7
- branches: [main]
8
-
9
- jobs:
10
- check:
11
- runs-on: ubuntu-latest
12
- steps:
13
- - uses: actions/checkout@v4
14
- - uses: oven-sh/setup-bun@v2
15
- - run: bun install --frozen-lockfile
16
- - run: bunx biome check src/
17
- - run: bunx tsc --noEmit
18
-
19
- release:
20
- needs: check
21
- if: github.ref == 'refs/heads/main' && github.event_name == 'push'
22
- runs-on: ubuntu-latest
23
- permissions:
24
- contents: write
25
- issues: write
26
- pull-requests: write
27
- steps:
28
- - uses: actions/checkout@v4
29
- with:
30
- fetch-depth: 0
31
- - uses: actions/setup-node@v4
32
- with:
33
- node-version: 22
34
- - uses: oven-sh/setup-bun@v2
35
- - run: bun install --frozen-lockfile
36
- - run: npx semantic-release
37
- env:
38
- GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
39
- NPM_TOKEN: ${{ secrets.NPM_TOKEN }}
40
- ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
package/.releaserc.json DELETED
@@ -1,10 +0,0 @@
1
- {
2
- "branches": ["main"],
3
- "plugins": [
4
- "@semantic-release/commit-analyzer",
5
- "semantic-release-ai-notes",
6
- "@semantic-release/npm",
7
- "@semantic-release/github",
8
- "@semantic-release/git"
9
- ]
10
- }
package/CLAUDE.md DELETED
@@ -1,52 +0,0 @@
1
- # Pagesight
2
-
3
- MCP server for SEO, GEO, and web performance analysis. npm package: `pagesight`.
4
-
5
- ## Stack
6
-
7
- - **Runtime**: Bun (not Node.js)
8
- - **Language**: TypeScript
9
- - **MCP SDK**: `@modelcontextprotocol/sdk`
10
- - **Linter**: Biome
11
- - **Git hooks**: Lefthook (pre-commit: biome + tsc)
12
-
13
- ## Architecture
14
-
15
- ```
16
- src/
17
- index.ts # MCP server entry, registers all tools
18
- lib/
19
- auth.ts # OAuth 2.0 + Service Account auth for GSC
20
- gsc.ts # Google Search Console API client
21
- psi.ts # PageSpeed Insights API client
22
- crux.ts # Chrome UX Report API client
23
- tools/
24
- inspect.ts # URL Inspection tool
25
- pagespeed.ts # PageSpeed Insights tool
26
- crux.ts # CrUX + CrUX History tools
27
- performance.ts # Search Analytics tool
28
- sitemaps.ts # Sites + Sitemaps tool
29
- setup.ts # Auth setup helper
30
- ```
31
-
32
- ## APIs Used
33
-
34
- | API | Auth | Env Var |
35
- |-----|------|---------|
36
- | Google Search Console | OAuth 2.0 / Service Account | `GSC_CLIENT_ID`, `GSC_CLIENT_SECRET`, `GSC_REFRESH_TOKEN` |
37
- | PageSpeed Insights v5 | API key (optional) | `GOOGLE_API_KEY` |
38
- | Chrome UX Report | API key (required) | `GOOGLE_API_KEY` |
39
-
40
- ## Commands
41
-
42
- - `bun run src/index.ts` — start MCP server
43
- - `bun run lint` — biome check
44
- - `bun run format` — biome format
45
- - `bun test` — run tests
46
-
47
- ## Conventions
48
-
49
- - All tools have try/catch error handling with clean error messages
50
- - Use Bun built-in APIs over third-party packages
51
- - No HTML parsing or on-page analysis — only authoritative data sources
52
- - Every check must be backed by an official API or standard (Google APIs, RFC 9309), not industry conventions
package/biome.json DELETED
@@ -1,14 +0,0 @@
1
- {
2
- "$schema": "https://biomejs.dev/schemas/2.4.10/schema.json",
3
- "files": {
4
- "includes": ["src/**"]
5
- },
6
- "formatter": {
7
- "indentStyle": "space",
8
- "indentWidth": 2,
9
- "lineWidth": 120
10
- },
11
- "linter": {
12
- "enabled": true
13
- }
14
- }