npm - pagesight - Versions diffs - 0.1.0 → 0.2.1 - Mend

pagesight 0.1.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

package/README.md CHANGED Viewed

@@ -1,20 +1,22 @@
 # Pagesight
-Google's data + AI crawler intelligence, your AI assistant's hands.
+[![npm version](https://img.shields.io/npm/v/pagesight.svg)](https://www.npmjs.com/package/pagesight)
-An open-source MCP server that gives AI assistants direct access to Google Search Console, PageSpeed Insights, Chrome UX Report, and a robots.txt analyzer that audits 139+ AI crawlers. No made-up rules. No invented scores. Just data from authoritative sources.
+See your site the way search engines and AI see it.
-Most SEO tools flag "title over 60 characters" and "only one H1 allowed." [Google's own engineers say those rules don't exist.](#why-not-other-seo-tools) Pagesight skips the myths and asks the sources directly.
+```bash
+npm install pagesight
+```
-## Tools
+Most SEO tools flag "title over 60 characters" and "only one H1 allowed." [Google's own engineers say those rules don't exist.](#why-not-other-seo-tools) Pagesight skips the myths and goes to the sources.
-Eight tools. Three Google APIs. 139+ AI bots tracked. One install.
+## Tools
 ### `inspect`
 Ask Google: is this page indexed? What canonical did you choose? Any crawl errors? Structured data issues?
-Returns index status, canonical (yours vs Google's), crawl status, rich results validation, sitemaps, and referring URLs — directly from Google's index.
+Returns index status, canonical (yours vs Google's), crawl status, rich results validation, sitemaps, and referring URLs.
 ### `pagespeed`
@@ -22,10 +24,9 @@ Run Google Lighthouse on any URL:
 - **Scores**: performance, accessibility, best-practices, seo
 - **Core Web Vitals (lab)**: FCP, LCP, TBT, CLS, Speed Index, TTI
-- **CrUX field data**: real Chrome user metrics when available (page + origin)
+- **CrUX field data**: real Chrome user metrics (page + origin)
 - **Opportunities**: ranked by severity with potential savings
 - **Strategy**: `mobile` or `desktop`
-- **Locale**: localized results (e.g., `pt-BR`)
 ### `crux`
@@ -33,56 +34,50 @@ Real-world Core Web Vitals from Chrome users (28-day rolling window):
 - **Metrics**: LCP, FCP, INP, CLS, TTFB, RTT, navigation types, form factors
 - **Granularity**: by URL or origin, by device (DESKTOP, PHONE, TABLET)
-- **Data**: p75 values + histogram distributions (good/needs improvement/poor)
+- **Data**: p75 values + histogram distributions
 ### `crux_history`
 Core Web Vitals trends over time — up to 40 weekly data points (~10 months):
 - Trend detection (improved/stable/worse) with percentage change
-- Recent data points table for LCP, INP, CLS
+- Recent data points table for core metrics
 - Custom period count (1-40)
 ### `performance`
-Google Search Console search analytics with full API coverage:
+Google Search Console search analytics:
 - **Dimensions**: `query`, `page`, `country`, `device`, `date`, `searchAppearance`, `hour`
 - **Search types**: `web`, `image`, `video`, `news`, `discover`, `googleNews`
 - **Filters**: `equals`, `contains`, `notEquals`, `notContains`, `includingRegex`, `excludingRegex`
-- **Aggregation**: `auto`, `byPage`, `byProperty`, `byNewsShowcasePanel`
-- **Data freshness**: `all`, `final`, `hourly_all`
-- **Pagination**: up to 25,000 rows with offset
+- **Pagination**: up to 25,000 rows
 ### `robots`
-Fetch and analyze any site's robots.txt:
+Analyze any site's robots.txt:
 - **Syntax validation** per [RFC 9309](https://www.rfc-editor.org/rfc/rfc9309)
-- **AI crawler audit** — checks 139+ bots from the [ai-robots-txt](https://github.com/ai-robots-txt/ai.robots.txt) community registry
+- **AI crawler audit** — 139+ bots from the [ai-robots-txt](https://github.com/ai-robots-txt/ai.robots.txt) community registry
 - **Bot categories**: training scrapers, AI search crawlers, AI assistants, AI agents
-- **Per-bot status**: blocked or allowed, with the matched rule and group
-- **Path checking**: is a specific path allowed for a specific user-agent?
-- **Sitemaps**: lists all sitemaps declared in robots.txt
+- **Per-bot status**: blocked or allowed, with the matched rule
 ```
-=== robots.txt: https://www.cnn.com ===
-AI Crawlers: 55 blocked, 84 allowed (of 139 known)
-Source: github.com/ai-robots-txt/ai.robots.txt
+=== robots.txt: https://www.nytimes.com ===
+AI Crawlers: 35 blocked, 104 allowed (of 139 known)
   BLOCKED  GPTBot (OpenAI) — GPT model training
   BLOCKED  ClaudeBot (Anthropic) — Claude model training
   ALLOWED  Claude-User (Anthropic) — User-initiated fetching
-  BLOCKED  PerplexityBot (Perplexity) — Search indexing
 ```
 ### `sitemaps`
 Search Console properties and sitemaps (read-only):
-- `list_sites` — all GSC properties with permission level
+- `list_sites` — all properties with permission level
 - `get_site` — details for a specific property
-- `list_sitemaps` — sitemaps with error/warning counts and content types
+- `list_sitemaps` — sitemaps with error/warning counts
 - `get_sitemap` — full details for a specific sitemap
 ### `setup`
@@ -95,23 +90,11 @@ Check auth status or walk through OAuth interactively.
 1. Go to [Google Cloud Console](https://console.cloud.google.com/)
 2. Create a project (or use existing)
-3. Enable three APIs:
-   - **Google Search Console API**
-   - **PageSpeed Insights API**
-   - **Chrome UX Report API**
+3. Enable: **Search Console API**, **PageSpeed Insights API**, **Chrome UX Report API**
 4. Create **OAuth client ID** (Desktop app) — for Search Console
 5. Create **API key** — for PageSpeed and CrUX
-### 2. Authorize Search Console
-Use the `setup` tool to walk through OAuth, or manually:
-1. Visit the auth URL with your client ID
-2. Authorize access to Search Console
-3. Copy the code from the redirect URL
-4. Exchange it for a refresh token
-### 3. Configure
+### 2. Configure
 ```env
 GSC_CLIENT_ID=your-client-id.apps.googleusercontent.com
@@ -120,11 +103,9 @@ GSC_REFRESH_TOKEN=your-refresh-token
 GOOGLE_API_KEY=your-api-key
 ```
-Note: The `robots` tool works without any credentials — it fetches the public `/robots.txt` file directly.
-## Usage
+The `robots` tool works without any credentials.
-Add to Claude Code, Cursor, or any MCP client:
+### 3. Use with your AI assistant
 ```json
 {
@@ -143,37 +124,33 @@ Add to Claude Code, Cursor, or any MCP client:
 }
 ```
-Then just talk to your AI assistant:
+Then just ask:
 ```
 "Is https://mysite.com indexed?"
-"What canonical did Google choose for this page?"
-"Run pagespeed on my homepage, mobile"
-"Show me CrUX data for my site on phones"
-"How have my Core Web Vitals changed over the last 10 months?"
-"Which queries bring traffic to this page?"
+"Run pagespeed on my homepage"
 "Which AI crawlers can access my site?"
-"Is GPTBot blocked on reddit.com?"
-"Any sitemap errors?"
+"How have my Core Web Vitals changed?"
+"Which queries bring traffic to this page?"
 ```
 ## Why not other SEO tools?
-We researched every common SEO "rule" against official Google documentation. Most are myths:
+We checked every common SEO "rule" against official Google documentation:
-- **"Title must be under 60 characters"** — Google: "there's no limit." Gary Illyes called it "an externally made-up metric."
+- **"Title must be under 60 characters"** — Google: "there's no limit." Gary Illyes: "an externally made-up metric."
 - **"Meta description must be 155 characters"** — Google: "there's no limit on how long a meta description can be."
 - **"Only one H1 per page"** — John Mueller: "You can use H1 tags as often as you want. There's no limit."
-- **"Minimum 300 words per page"** — Mueller: "the number of words on a page is not a quality factor, not a ranking factor."
+- **"Minimum 300 words per page"** — Mueller: "the number of words on a page is not a quality factor."
 - **"Text-to-HTML ratio matters"** — Mueller: "it makes absolutely no sense at all for SEO."
-Tools that flag these "issues" are reporting their opinions, not data. Pagesight only reports what authoritative sources actually return — Google's APIs for search data, RFC 9309 for robots.txt, and a community-maintained registry for AI crawlers.
+Tools that flag these are reporting their opinions. Pagesight only reports what the sources actually return.
 ## Development
 ```bash
-bun install       # install dependencies
-bun run start     # start MCP server
+bun install
+bun run start     # start server
 bun run lint      # biome check
 bun run format    # biome format
 ```

package/package.json CHANGED Viewed

@@ -1,7 +1,23 @@
 {
   "name": "pagesight",
-  "version": "0.1.0",
-  "description": "Google's data + AI crawler intelligence for AI assistants. MCP server for SEO, GEO, and web performance.",
+  "version": "0.2.1",
+  "description": "See your site the way search engines and AI see it.",
+  "keywords": [
+    "seo",
+    "geo",
+    "google-search-console",
+    "pagespeed-insights",
+    "core-web-vitals",
+    "web-performance",
+    "robots-txt",
+    "ai-crawlers",
+    "mcp"
+  ],
+  "files": [
+    "src",
+    "README.md",
+    "LICENSE"
+  ],
   "type": "module",
   "main": "src/index.ts",
   "bin": {
@@ -9,9 +25,9 @@
   },
   "repository": {
     "type": "git",
-    "url": "git+https://github.com/caiopizzol/sitelint.git"
+    "url": "git+https://github.com/caiopizzol/pagesight.git"
   },
-  "homepage": "https://github.com/caiopizzol/sitelint",
+  "homepage": "https://github.com/caiopizzol/pagesight",
   "author": "Caio Pizzol",
   "scripts": {
     "start": "bun run src/index.ts",

package/src/tools/robots.ts CHANGED Viewed

@@ -2,31 +2,15 @@ import type { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
 import { z } from "zod";
 import { auditAiCrawlers, type CrawlerStatus, fetchRobotsTxt, isAllowed, type RobotsTxt } from "../lib/robots.js";
-function formatCrawlerStatus(statuses: CrawlerStatus[]): string {
-  const lines: string[] = [];
-  // Group by category
-  const categories = new Map<string, CrawlerStatus[]>();
-  for (const s of statuses) {
-    const cat = s.category;
-    if (!categories.has(cat)) categories.set(cat, []);
-    categories.get(cat)?.push(s);
-  }
-  for (const [cat, bots] of categories) {
-    lines.push(`--- ${cat} (${bots.length}) ---`, "");
-    for (const bot of bots) {
-      const status = bot.allowed ? "ALLOWED" : "BLOCKED";
-      lines.push(`  ${status}  ${bot.name} (${bot.company})`);
-      if (bot.matchedRule) {
-        lines.push(`          Rule: ${bot.matchedRule.type}: ${bot.matchedRule.path} (group: ${bot.matchedGroup})`);
-      }
-    }
-    lines.push("");
-  }
-  return lines.join("\n").trimEnd();
+// Normalize the messy registry categories into clean buckets
+function normalizeCategory(raw: string): string {
+  const lower = raw.toLowerCase();
+  if (lower.includes("training") || lower.includes("train") || lower.includes("scrape") || lower.includes("dataset"))
+    return "Training";
+  if (lower.includes("search") && !lower.includes("assistant")) return "Search";
+  if (lower.includes("assistant") || lower.includes("user prompt") || lower.includes("user quer")) return "Assistant";
+  if (lower.includes("agent")) return "Agent";
+  return "Other";
 }
 function formatRobotsAudit(origin: string, robots: RobotsTxt, statusCode: number, crawlers: CrawlerStatus[]): string {
@@ -63,17 +47,67 @@ function formatRobotsAudit(origin: string, robots: RobotsTxt, statusCode: number
   }
   // AI Crawler audit
-  const allowed = crawlers.filter((c) => c.allowed);
   const blocked = crawlers.filter((c) => !c.allowed);
+  const allowed = crawlers.filter((c) => c.allowed);
   lines.push(
     "",
     `--- AI Crawlers: ${blocked.length} blocked, ${allowed.length} allowed (of ${crawlers.length} known) ---`,
     "",
-    `Source: github.com/ai-robots-txt/ai.robots.txt (${crawlers.length} bots)`,
-    "",
+    `Source: github.com/ai-robots-txt/ai.robots.txt`,
   );
-  lines.push(formatCrawlerStatus(crawlers));
+  if (blocked.length === 0) {
+    lines.push("", "All 139 known AI crawlers are allowed. No bots are explicitly blocked.");
+  } else if (blocked.length === crawlers.length) {
+    lines.push("", "All known AI crawlers are blocked.");
+    // Show how they're blocked
+    const byGroup = new Map<string, string[]>();
+    for (const bot of blocked) {
+      const group = bot.matchedGroup ?? "wildcard";
+      if (!byGroup.has(group)) byGroup.set(group, []);
+      byGroup.get(group)?.push(bot.name);
+    }
+    for (const [group, bots] of byGroup) {
+      lines.push(`  via group "${group}": ${bots.length} bots`);
+    }
+  } else {
+    // Mixed — show blocked bots in detail, grouped by normalized category
+    lines.push("");
+    const blockedByCategory = new Map<string, CrawlerStatus[]>();
+    for (const bot of blocked) {
+      const cat = normalizeCategory(bot.category);
+      if (!blockedByCategory.has(cat)) blockedByCategory.set(cat, []);
+      blockedByCategory.get(cat)?.push(bot);
+    }
+    const categoryOrder = ["Training", "Search", "Assistant", "Agent", "Other"];
+    for (const cat of categoryOrder) {
+      const bots = blockedByCategory.get(cat);
+      if (!bots) continue;
+      lines.push(`Blocked ${cat} (${bots.length}):`);
+      for (const bot of bots) {
+        lines.push(`  BLOCKED  ${bot.name} (${bot.company})`);
+      }
+      lines.push("");
+    }
+    // Summary of allowed by category
+    const allowedByCategory = new Map<string, number>();
+    for (const bot of allowed) {
+      const cat = normalizeCategory(bot.category);
+      allowedByCategory.set(cat, (allowedByCategory.get(cat) ?? 0) + 1);
+    }
+    const allowedSummary = categoryOrder
+      .filter((cat) => allowedByCategory.has(cat))
+      .map((cat) => `${cat}: ${allowedByCategory.get(cat)}`)
+      .join(", ");
+    lines.push(`Allowed (${allowed.length}): ${allowedSummary}`);
+  }
   return lines.join("\n");
 }
@@ -81,7 +115,7 @@ function formatRobotsAudit(origin: string, robots: RobotsTxt, statusCode: number
 export function registerRobotsTool(server: McpServer): void {
   server.tool(
     "robots",
-    "Fetch and analyze a site's robots.txt. Validates syntax per RFC 9309, audits AI crawler access (130+ bots from ai-robots-txt registry), lists sitemaps, and reports blocked vs allowed bots by category.",
+    "Fetch and analyze a site's robots.txt. Validates syntax per RFC 9309, audits AI crawler access (139+ bots), lists sitemaps. Shows blocked bots in detail, summarizes allowed.",
     {
       url: z
         .string()

package/.github/workflows/ci.yml DELETED Viewed

@@ -1,40 +0,0 @@
-name: CI
-on:
-  push:
-    branches: [main]
-  pull_request:
-    branches: [main]
-jobs:
-  check:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v4
-      - uses: oven-sh/setup-bun@v2
-      - run: bun install --frozen-lockfile
-      - run: bunx biome check src/
-      - run: bunx tsc --noEmit
-  release:
-    needs: check
-    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
-    runs-on: ubuntu-latest
-    permissions:
-      contents: write
-      issues: write
-      pull-requests: write
-    steps:
-      - uses: actions/checkout@v4
-        with:
-          fetch-depth: 0
-      - uses: actions/setup-node@v4
-        with:
-          node-version: 22
-      - uses: oven-sh/setup-bun@v2
-      - run: bun install --frozen-lockfile
-      - run: npx semantic-release
-        env:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-          NPM_TOKEN: ${{ secrets.NPM_TOKEN }}
-          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

package/.releaserc.json DELETED Viewed

@@ -1,10 +0,0 @@
-{
-  "branches": ["main"],
-  "plugins": [
-    "@semantic-release/commit-analyzer",
-    "semantic-release-ai-notes",
-    "@semantic-release/npm",
-    "@semantic-release/github",
-    "@semantic-release/git"
-  ]
-}

package/CLAUDE.md DELETED Viewed

@@ -1,52 +0,0 @@
-# Pagesight
-MCP server for SEO, GEO, and web performance analysis. npm package: `pagesight`.
-## Stack
-- **Runtime**: Bun (not Node.js)
-- **Language**: TypeScript
-- **MCP SDK**: `@modelcontextprotocol/sdk`
-- **Linter**: Biome
-- **Git hooks**: Lefthook (pre-commit: biome + tsc)
-## Architecture
-```
-src/
-  index.ts          # MCP server entry, registers all tools
-  lib/
-    auth.ts         # OAuth 2.0 + Service Account auth for GSC
-    gsc.ts          # Google Search Console API client
-    psi.ts          # PageSpeed Insights API client
-    crux.ts         # Chrome UX Report API client
-  tools/
-    inspect.ts      # URL Inspection tool
-    pagespeed.ts    # PageSpeed Insights tool
-    crux.ts         # CrUX + CrUX History tools
-    performance.ts  # Search Analytics tool
-    sitemaps.ts     # Sites + Sitemaps tool
-    setup.ts        # Auth setup helper
-```
-## APIs Used
-| API | Auth | Env Var |
-|-----|------|---------|
-| Google Search Console | OAuth 2.0 / Service Account | `GSC_CLIENT_ID`, `GSC_CLIENT_SECRET`, `GSC_REFRESH_TOKEN` |
-| PageSpeed Insights v5 | API key (optional) | `GOOGLE_API_KEY` |
-| Chrome UX Report | API key (required) | `GOOGLE_API_KEY` |
-## Commands
-- `bun run src/index.ts` — start MCP server
-- `bun run lint` — biome check
-- `bun run format` — biome format
-- `bun test` — run tests
-## Conventions
-- All tools have try/catch error handling with clean error messages
-- Use Bun built-in APIs over third-party packages
-- No HTML parsing or on-page analysis — only authoritative data sources
-- Every check must be backed by an official API or standard (Google APIs, RFC 9309), not industry conventions

package/biome.json DELETED Viewed

@@ -1,14 +0,0 @@
-{
-  "$schema": "https://biomejs.dev/schemas/2.4.10/schema.json",
-  "files": {
-    "includes": ["src/**"]
-  },
-  "formatter": {
-    "indentStyle": "space",
-    "indentWidth": 2,
-    "lineWidth": 120
-  },
-  "linter": {
-    "enabled": true
-  }
-}