npm - mallmaverick-store-scraper - Versions diffs - 0.1.0 - Mend

mallmaverick-store-scraper 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

package/README.md +225 -0
package/package.json +41 -0
package/src/brandSiteFallback.js +272 -0
package/src/browser.js +234 -0
package/src/deterministic.js +235 -0
package/src/discovery.js +298 -0
package/src/externalFollow.js +89 -0
package/src/hoursParser.js +313 -0
package/src/hoursPipeline.js +151 -0
package/src/imageExtraction.js +331 -0
package/src/llmExtract.js +99 -0
package/src/logoExtraction.js +130 -0
package/src/main.js +330 -0
package/src/mallContext.js +201 -0
package/src/mcp-server.js +425 -0
package/src/openai-proxy.js +52 -0
package/src/output.js +21 -0
package/src/retryStrategy.js +60 -0
package/src/storeExtractor.js +239 -0
package/src/storeModel.js +147 -0

package/README.md ADDED Viewed

@@ -0,0 +1,225 @@
+# mall-scraper-mcp
+Layered scraper for shopping-mall store directories. Works as:
+- **MCP server** — coworkers drive scrapes from Claude Desktop / Claude Code
+- **CLI** — direct command-line use (`node src/main.js`)
+Both share the same v5 pipeline: deterministic hours extraction (JSON-LD →
+DOM patterns → labeled section → sync-with-mall → focused LLM → external
+follow), per-page image classification with logo/brand/storefront separation,
+brand-site fallback for problematic logos.
+---
+## How coworkers install it
+Once published to npm and the Cloudflare Worker is deployed, every coworker
+runs **one command** in their terminal:
+```bash
+claude mcp add mall-scraper \
+  --env MALL_SCRAPER_PROXY_URL=https://mall-scraper-openai-proxy.YOURSUB.workers.dev \
+  --env MALL_SCRAPER_TOKEN=YOUR_SHARED_SECRET \
+  -- npx -y mallmaverick-store-scraper@latest
+```
+Then in Claude they say things like:
+> Scrape https://grasslands.ca/store-directory/, first 10 stores.
+> Save as CSV.
+Claude calls the `scrape_directory` tool, returns the data, and Claude can
+do follow-up analysis (write CSV, find missing fields, retry specific stores).
+### What requires no setup on coworker machines
+- ❌ No git clone
+- ❌ No OpenAI API key (it lives in your Worker)
+- ❌ No zip to download or replace on updates
+- ✅ npm/npx + Node 18+ (most have this; otherwise nodejs.org)
+- ✅ The shared-secret token (you give them)
+The first scrape downloads Chromium (~170 MB, one-time, automatic via Puppeteer).
+---
+## How YOU set it up (one-time)
+### 1. Deploy the Cloudflare Worker (10 min)
+See `cloudflare-worker/README.md`. The short version:
+```bash
+cd cloudflare-worker
+npm install
+npx wrangler login          # browser auth to your Cloudflare account
+npx wrangler deploy
+npx wrangler secret put OPENAI_API_KEY     # paste your real OpenAI key
+npx wrangler secret put SHARED_SECRET      # paste a long random string
+```
+You now have:
+- `MALL_SCRAPER_PROXY_URL` = https://mall-scraper-openai-proxy.YOURSUB.workers.dev
+- `MALL_SCRAPER_TOKEN` = (whatever you put as SHARED_SECRET)
+Free tier covers ~300 mall scrapes/day. Cost = whatever your OpenAI bill is
+(~$0.005/store at gpt-5.4-mini).
+### 2. Publish the npm package
+```bash
+# Log in to npm
+npm login
+# Sanity check
+npm pack --dry-run                # see exactly what would be published
+# First publish
+npm publish --access public
+```
+If `mall-scraper-mcp` is taken, edit `package.json` `"name"` to something
+available (or use a scope like `@yourname/mall-scraper-mcp` — make sure to
+`npm publish --access public` for scoped public packages).
+### 3. Share the install command with coworkers
+Send them the one-line `claude mcp add` command above, with your actual
+proxy URL and shared secret pasted in.
+---
+## How you ship updates
+This is the workflow that makes "easy updates" actually easy:
+```bash
+# Make changes
+git commit -am "improve hours layer 4 for X site"
+# Bump the version
+npm version patch                 # 0.1.0 → 0.1.1   (bug fixes)
+npm version minor                 # 0.1.0 → 0.2.0   (new features)
+# Publish
+npm publish
+```
+Coworkers get the new version automatically on their next Claude session
+because the install command uses `npx -y mallmaverick-store-scraper@latest` — npx
+re-resolves to the latest published version every time.
+If you want stricter pinning (you publish a buggy version, want time to
+revert), tell them to use `mall-scraper-mcp@0.1.1` instead of `@latest`.
+### Worker updates (less frequent)
+```bash
+cd cloudflare-worker
+npx wrangler deploy
+```
+Live in seconds. No coworker action needed.
+---
+## CLI usage (you, or fallback)
+```bash
+cd path/to/mall-scraper-mcp
+npm install
+echo "OPENAI_API_KEY=sk-..." > .env    # or set MALL_SCRAPER_* env vars
+./run.sh
+```
+CLI prompts for: directory URL, model, max stores, concurrency, threshold,
+vision yes/no. Output lands in `extracted_stores/`.
+---
+## MCP tools exposed
+| Tool | Use when |
+|---|---|
+| `scrape_directory` | User wants the full per-store extraction across a directory listing |
+| `get_store_hours` | Debugging — quick hours-only check on a single store URL |
+| `validate_image_url` | A logo isn't loading in the CMS — confirm whether the URL itself is bad |
+All three accept JSON inputs documented in their schemas; Claude figures out
+the args from the conversation.
+---
+## File layout
+```
+mall-scraper-mcp/
+├── package.json             ← bin entry → src/mcp-server.js
+├── src/
+│   ├── mcp-server.js        ← MCP stdio server (entry for `npx mallmaverick-store-scraper`)
+│   ├── main.js              ← CLI entry
+│   ├── openai-proxy.js      ← chooses direct OpenAI vs Worker proxy from env
+│   ├── browser.js           ← Puppeteer wrapper + XHR intercept
+│   ├── discovery.js         ← directory URL discovery + logo map
+│   ├── hoursParser.js       ← canonical hours parsing / validation
+│   ├── hoursPipeline.js     ← 7-layer hours extraction
+│   ├── mallContext.js       ← mall hours + socials + chrome images detection
+│   ├── imageExtraction.js   ← logo/brand/storefront classifier
+│   ├── brandSiteFallback.js ← brand-site logo when mall has GIF/missing
+│   ├── deterministic.js     ← phone, socials, website, status flags
+│   ├── storeExtractor.js    ← LLM extraction for non-deterministic fields
+│   ├── retryStrategy.js     ← 3-attempt escalating page loads
+│   ├── storeModel.js        ← 40-field schema + CSV writer (CRLF/BOM)
+│   └── output.js            ← (legacy, unused by mcp server)
+├── cloudflare-worker/
+│   ├── worker.js            ← OpenAI proxy (30 LOC)
+│   ├── wrangler.toml
+│   └── README.md
+└── test/
+    └── hoursParser.test.js  ← 40+ unit tests
+```
+---
+## Auth modes
+The scraper supports two ways to reach OpenAI; it picks the first that's
+configured:
+1. **Proxy mode (production / coworker default).**
+   `MALL_SCRAPER_PROXY_URL` + `MALL_SCRAPER_TOKEN` set → calls go through the
+   Cloudflare Worker, which holds your real OpenAI key.
+2. **Direct mode (your local dev fallback).**
+   `OPENAI_API_KEY` set → calls go straight to api.openai.com. Useful when
+   developing without spinning up the Worker.
+If neither is set, the scraper refuses to start with a clear error.
+---
+## Troubleshooting
+**Logo URL returns HTML in coworker's CMS:**
+Ask Claude to run `validate_image_url` on the failing URL. Confirms whether
+the URL itself returns a real image. If it does, the issue is on the CMS
+side (the shopcurrents-style empty `property_manager_id` case is a known
+example).
+**Coworker gets "unauthorized" from the Worker:**
+Their `MALL_SCRAPER_TOKEN` doesn't match the current `SHARED_SECRET`. Either
+rotate it on their side or `wrangler secret put SHARED_SECRET` to match.
+**First scrape takes 2-3 minutes:**
+Puppeteer is downloading Chrome on first run (~170 MB). Subsequent scrapes
+are normal speed.
+**`npx mallmaverick-store-scraper` not found:**
+They need Node 18+ in PATH. `node --version` to check.
+---
+## License
+MIT.

package/package.json ADDED Viewed

@@ -0,0 +1,41 @@
+{
+  "name": "mallmaverick-store-scraper",
+  "version": "0.1.0",
+  "description": "MCP server + CLI for scraping shopping mall store directories. Hours-first layered pipeline + image classification.",
+  "main": "src/main.js",
+  "type": "commonjs",
+  "bin": {
+    "mallmaverick-store-scraper": "src/mcp-server.js"
+  },
+  "scripts": {
+    "start": "node src/main.js",
+    "start:mcp": "node src/mcp-server.js",
+    "test:hours": "node test/hoursParser.test.js"
+  },
+  "files": [
+    "src/**/*.js",
+    "README.md",
+    "LICENSE"
+  ],
+  "dependencies": {
+    "@modelcontextprotocol/sdk": "^1.0.0",
+    "openai": "^4.52.0",
+    "puppeteer": "^24.15.0",
+    "dotenv": "^16.4.5",
+    "readline-sync": "^1.4.10",
+    "p-limit": "^3.1.0"
+  },
+  "engines": {
+    "node": ">=18.0.0"
+  },
+  "keywords": [
+    "mcp",
+    "claude",
+    "scraper",
+    "shopping-mall",
+    "store-directory",
+    "puppeteer"
+  ],
+  "author": "",
+  "license": "MIT"
+}

package/src/brandSiteFallback.js ADDED Viewed

@@ -0,0 +1,272 @@
+'use strict';
+/**
+ * Brand-site logo fallback.
+ *
+ * When the mall-hosted logo is missing or in a problematic format (e.g. GIF
+ * that the downstream CMS can't ingest), and the store has a known website
+ * field, load the brand homepage and try to extract a logo from there.
+ *
+ * Priority order:
+ *   1. JSON-LD `logo` property (Organization / LocalBusiness / WebSite schema)
+ *   2. <link rel="icon" sizes="...">  with the largest size — modern brand
+ *      sites publish a high-res icon (192x192+) that's a usable logo
+ *   3. <img> with class/id/parent containing "logo"
+ *   4. og:image if it looks logo-shaped (square-ish, < 1000px)
+ *
+ * GIFs are skipped at every stage — the whole point of the fallback is to
+ * find a non-GIF.
+ *
+ * Returns { url, source } or null.
+ */
+const { URL } = require('url');
+const http = require('http');
+const https = require('https');
+const { loadPage, newPage } = require('./browser');
+const TIMEOUT_MS = 15000;
+/**
+ * HEAD/GET-validate that a URL actually serves an image.
+ * Many brand sites have hardcoded `<link rel="apple-touch-icon" href="/foo.png">`
+ * pointing at files that don't exist — the server returns HTML 200 instead of 404.
+ *
+ * Returns the validated URL (possibly after redirect) or null.
+ */
+function validateImageUrl(url, { timeout = 8000, allowGif = false } = {}) {
+  return new Promise((resolve) => {
+    let finalUrl = url;
+    let redirectsLeft = 3;
+    const attempt = (u) => {
+      const mod = u.startsWith('https') ? https : http;
+      const req = mod.request(u, {
+        method: 'HEAD',
+        timeout,
+        headers: {
+          'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
+          'Accept': 'image/*,*/*;q=0.5',
+        },
+      }, (res) => {
+        // Follow redirects
+        if (res.statusCode >= 300 && res.statusCode < 400 && res.headers.location && redirectsLeft > 0) {
+          redirectsLeft--;
+          try {
+            const next = new URL(res.headers.location, u).toString();
+            finalUrl = next;
+            return attempt(next);
+          } catch { return resolve(null); }
+        }
+        if (res.statusCode !== 200) return resolve(null);
+        const ct = (res.headers['content-type'] || '').toLowerCase();
+        if (!ct.startsWith('image/')) return resolve(null);
+        if (!allowGif && ct.includes('gif')) return resolve(null);
+        // Content-length sanity (filter tiny tracking pixels)
+        const cl = parseInt(res.headers['content-length'] || '0', 10);
+        if (cl > 0 && cl < 200) return resolve(null);
+        resolve(finalUrl);
+      });
+      req.on('error', () => resolve(null));
+      req.on('timeout', () => { req.destroy(); resolve(null); });
+      req.end();
+    };
+    attempt(url);
+  });
+}
+async function fetchBrandLogo(browser, websiteUrl, storeName, { logger } = {}) {
+  if (!websiteUrl) return null;
+  let normalized;
+  try { normalized = new URL(websiteUrl).origin + '/'; } catch { return null; }
+  const page = await newPage(browser);
+  try {
+    let data;
+    try {
+      data = await loadPage(page, normalized, { waitUntil: 'domcontentloaded', timeout: TIMEOUT_MS });
+    } catch (err) {
+      if (logger) logger.warn(`   ⚠ brand-site fetch failed (${normalized}): ${err.message}`);
+      return null;
+    }
+    const origin = normalized.replace(/\/+$/, '');
+    // Try each layer's candidates and return the first one that VALIDATES
+    // (HEAD request returns image/* content-type with HTTP 200).
+    // --- Layer 1: JSON-LD logo / image ---
+    const ldLogo = findJsonLdLogo(data.jsonLd, origin);
+    if (ldLogo && !isGif(ldLogo)) {
+      const valid = await validateImageUrl(ldLogo);
+      if (valid) return { url: valid, source: 'brand-jsonld' };
+      if (logger) logger.warn(`   ⚠ brand JSON-LD logo failed validation: ${ldLogo}`);
+    }
+    // --- Layer 2: ranked <link rel="icon"> candidates ---
+    const iconCandidates = await page.evaluate(() => {
+      const links = Array.from(document.querySelectorAll('link[rel~="icon" i], link[rel~="apple-touch-icon" i], link[rel="mask-icon"]'));
+      return links.map(l => ({
+        href: l.href,
+        sizes: l.getAttribute('sizes') || '',
+        type: l.getAttribute('type') || '',
+        rel: l.getAttribute('rel') || '',
+      })).filter(o => o.href);
+    });
+    const rankedIcons = rankIcons(iconCandidates, origin);
+    for (const icon of rankedIcons) {
+      if (isGif(icon)) continue;
+      const valid = await validateImageUrl(icon);
+      if (valid) return { url: valid, source: 'brand-icon' };
+    }
+    // --- Layer 3: img with class/id/parent containing "logo" ---
+    const cleanName = (storeName || '').toLowerCase().replace(/[^a-z0-9]/g, '');
+    const imgLogo = await page.evaluate(({ origin, cleanName }) => {
+      const resolve = (src) => {
+        if (!src) return null;
+        if (src.startsWith('data:')) return null;
+        if (src.startsWith('http')) return src;
+        if (src.startsWith('//')) return 'https:' + src;
+        if (src.startsWith('/')) return origin + src;
+        return null;
+      };
+      const SKIP_FILE = /(banner|hero|cover|background|placeholder)/i;
+      const out = [];
+      document.querySelectorAll('img').forEach(img => {
+        const src = img.currentSrc || img.src || img.getAttribute('data-src') || img.getAttribute('data-lazy-src');
+        const url = resolve(src);
+        if (!url) return;
+        if (/\.gif(\?|$)/i.test(url)) return;          // skip GIFs
+        const w = img.naturalWidth || 0;
+        const h = img.naturalHeight || 0;
+        if (w < 32 || h < 32) return;
+        const file = (() => { try { return new URL(url).pathname.split('/').pop().toLowerCase(); } catch { return ''; } })();
+        const cls = (img.className || '').toLowerCase();
+        const id = (img.id || '').toLowerCase();
+        const parent = img.parentElement;
+        const parentCls = parent ? (parent.className || '').toLowerCase() : '';
+        const alt = (img.getAttribute('alt') || '').toLowerCase();
+        let score = 0;
+        if (/\blogo\b/.test(cls)) score += 80;
+        if (/\blogo\b/.test(parentCls)) score += 60;
+        if (/\blogo\b/.test(id)) score += 60;
+        if (/logo/.test(file)) score += 50;
+        if (cleanName.length >= 4 && file.includes(cleanName.slice(0, 4))) score += 40;
+        if (cleanName.length >= 4 && alt.replace(/[^a-z0-9]/g, '').includes(cleanName.slice(0, 4))) score += 30;
+        // Square-ish aspect bonus
+        const aspect = h ? w / h : 1;
+        if (aspect >= 0.5 && aspect <= 2.0) score += 15;
+        // Reasonable logo size
+        if (w >= 60 && w <= 600 && h >= 60 && h <= 600) score += 15;
+        if (SKIP_FILE.test(file)) score -= 80;
+        // Header position bonus
+        const rect = img.getBoundingClientRect();
+        if (rect.top < 200) score += 10;
+        if (score > 0) out.push({ url, score, file, w, h });
+      });
+      return out.sort((a, b) => b.score - a.score).slice(0, 3);
+    }, { origin, cleanName });
+    for (const cand of (imgLogo || [])) {
+      if (cand.score < 50) break;
+      const valid = await validateImageUrl(cand.url);
+      if (valid) return { url: valid, source: 'brand-img' };
+    }
+    // --- Layer 4: og:image if it validates as an image ---
+    if (data.metaTags && data.metaTags['og:image']) {
+      const ogUrl = data.metaTags['og:image'];
+      if (!isGif(ogUrl)) {
+        const valid = await validateImageUrl(ogUrl);
+        if (valid) return { url: valid, source: 'brand-og' };
+      }
+    }
+    return null;
+  } finally {
+    try { await page.close(); } catch (_) {}
+  }
+}
+function isGif(url) {
+  return /\.gif(\?|$)/i.test(url || '');
+}
+function findJsonLdLogo(jsonLd, origin) {
+  if (!jsonLd) return null;
+  const arr = Array.isArray(jsonLd) ? jsonLd : [jsonLd];
+  const visit = (node) => {
+    if (!node || typeof node !== 'object') return null;
+    if (Array.isArray(node)) {
+      for (const n of node) { const r = visit(n); if (r) return r; }
+      return null;
+    }
+    // Direct logo property (Schema.org Organization)
+    if (node.logo) {
+      if (typeof node.logo === 'string') return resolveLogoUrl(node.logo, origin);
+      if (node.logo.url) return resolveLogoUrl(node.logo.url, origin);
+    }
+    // image fallback only when @type is Organization or similar
+    if (node['@type'] && /Organization|LocalBusiness|Restaurant|Store/i.test(String(node['@type']))) {
+      if (node.image) {
+        if (typeof node.image === 'string') return resolveLogoUrl(node.image, origin);
+        if (Array.isArray(node.image) && node.image[0]) {
+          return resolveLogoUrl(typeof node.image[0] === 'string' ? node.image[0] : node.image[0].url, origin);
+        }
+        if (node.image.url) return resolveLogoUrl(node.image.url, origin);
+      }
+    }
+    if (node['@graph']) { const r = visit(node['@graph']); if (r) return r; }
+    for (const k of Object.keys(node)) {
+      if (k.startsWith('@')) continue;
+      const v = node[k];
+      if (v && typeof v === 'object') { const r = visit(v); if (r) return r; }
+    }
+    return null;
+  };
+  for (const node of arr) { const r = visit(node); if (r) return r; }
+  return null;
+}
+function resolveLogoUrl(raw, origin) {
+  if (!raw) return null;
+  if (raw.startsWith('http')) return raw;
+  if (raw.startsWith('//')) return 'https:' + raw;
+  if (raw.startsWith('/')) return origin + raw;
+  return null;
+}
+function rankIcons(icons, origin) {
+  if (!icons || icons.length === 0) return [];
+  const score = (i) => {
+    let s = 0;
+    if (/\.ico(\?|$)/i.test(i.href) || i.type === 'image/x-icon') s -= 30;
+    if (i.rel && /apple-touch-icon/i.test(i.rel)) s += 30;
+    if (i.sizes) {
+      const m = String(i.sizes).match(/(\d{2,4})x\d{2,4}/);
+      if (m) {
+        const px = parseInt(m[1], 10);
+        if (px >= 192) s += 50;
+        else if (px >= 96) s += 30;
+        else if (px >= 64) s += 10;
+        else s -= 20;
+      }
+    }
+    if (/\.svg(\?|$)/i.test(i.href)) s += 20;
+    if (/\.png(\?|$)/i.test(i.href)) s += 15;
+    return s;
+  };
+  return icons
+    .map(i => ({ ...i, _s: score(i) }))
+    .filter(i => i._s > 0)
+    .sort((a, b) => b._s - a._s)
+    .map(i => {
+      if (i.href.startsWith('http')) return i.href;
+      if (i.href.startsWith('/')) return origin + i.href;
+      return null;
+    })
+    .filter(Boolean);
+}
+module.exports = { fetchBrandLogo };