mallmaverick-store-scraper 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,225 @@
1
+ # mall-scraper-mcp
2
+
3
+ Layered scraper for shopping-mall store directories. Works as:
4
+
5
+ - **MCP server** — coworkers drive scrapes from Claude Desktop / Claude Code
6
+ - **CLI** — direct command-line use (`node src/main.js`)
7
+
8
+ Both share the same v5 pipeline: deterministic hours extraction (JSON-LD →
9
+ DOM patterns → labeled section → sync-with-mall → focused LLM → external
10
+ follow), per-page image classification with logo/brand/storefront separation,
11
+ brand-site fallback for problematic logos.
12
+
13
+ ---
14
+
15
+ ## How coworkers install it
16
+
17
+ Once published to npm and the Cloudflare Worker is deployed, every coworker
18
+ runs **one command** in their terminal:
19
+
20
+ ```bash
21
+ claude mcp add mall-scraper \
22
+ --env MALL_SCRAPER_PROXY_URL=https://mall-scraper-openai-proxy.YOURSUB.workers.dev \
23
+ --env MALL_SCRAPER_TOKEN=YOUR_SHARED_SECRET \
24
+ -- npx -y mallmaverick-store-scraper@latest
25
+ ```
26
+
27
+ Then in Claude they say things like:
28
+
29
+ > Scrape https://grasslands.ca/store-directory/, first 10 stores.
30
+ > Save as CSV.
31
+
32
+ Claude calls the `scrape_directory` tool, returns the data, and Claude can
33
+ do follow-up analysis (write CSV, find missing fields, retry specific stores).
34
+
35
+ ### What requires no setup on coworker machines
36
+
37
+ - ❌ No git clone
38
+ - ❌ No OpenAI API key (it lives in your Worker)
39
+ - ❌ No zip to download or replace on updates
40
+ - ✅ npm/npx + Node 18+ (most have this; otherwise nodejs.org)
41
+ - ✅ The shared-secret token (you give them)
42
+
43
+ The first scrape downloads Chromium (~170 MB, one-time, automatic via Puppeteer).
44
+
45
+ ---
46
+
47
+ ## How YOU set it up (one-time)
48
+
49
+ ### 1. Deploy the Cloudflare Worker (10 min)
50
+
51
+ See `cloudflare-worker/README.md`. The short version:
52
+
53
+ ```bash
54
+ cd cloudflare-worker
55
+ npm install
56
+ npx wrangler login # browser auth to your Cloudflare account
57
+ npx wrangler deploy
58
+ npx wrangler secret put OPENAI_API_KEY # paste your real OpenAI key
59
+ npx wrangler secret put SHARED_SECRET # paste a long random string
60
+ ```
61
+
62
+ You now have:
63
+ - `MALL_SCRAPER_PROXY_URL` = https://mall-scraper-openai-proxy.YOURSUB.workers.dev
64
+ - `MALL_SCRAPER_TOKEN` = (whatever you put as SHARED_SECRET)
65
+
66
+ Free tier covers ~300 mall scrapes/day. Cost = whatever your OpenAI bill is
67
+ (~$0.005/store at gpt-5.4-mini).
68
+
69
+ ### 2. Publish the npm package
70
+
71
+ ```bash
72
+ # Log in to npm
73
+ npm login
74
+
75
+ # Sanity check
76
+ npm pack --dry-run # see exactly what would be published
77
+
78
+ # First publish
79
+ npm publish --access public
80
+ ```
81
+
82
+ If `mall-scraper-mcp` is taken, edit `package.json` `"name"` to something
83
+ available (or use a scope like `@yourname/mall-scraper-mcp` — make sure to
84
+ `npm publish --access public` for scoped public packages).
85
+
86
+ ### 3. Share the install command with coworkers
87
+
88
+ Send them the one-line `claude mcp add` command above, with your actual
89
+ proxy URL and shared secret pasted in.
90
+
91
+ ---
92
+
93
+ ## How you ship updates
94
+
95
+ This is the workflow that makes "easy updates" actually easy:
96
+
97
+ ```bash
98
+ # Make changes
99
+ git commit -am "improve hours layer 4 for X site"
100
+
101
+ # Bump the version
102
+ npm version patch # 0.1.0 → 0.1.1 (bug fixes)
103
+ npm version minor # 0.1.0 → 0.2.0 (new features)
104
+
105
+ # Publish
106
+ npm publish
107
+ ```
108
+
109
+ Coworkers get the new version automatically on their next Claude session
110
+ because the install command uses `npx -y mallmaverick-store-scraper@latest` — npx
111
+ re-resolves to the latest published version every time.
112
+
113
+ If you want stricter pinning (you publish a buggy version, want time to
114
+ revert), tell them to use `mall-scraper-mcp@0.1.1` instead of `@latest`.
115
+
116
+ ### Worker updates (less frequent)
117
+
118
+ ```bash
119
+ cd cloudflare-worker
120
+ npx wrangler deploy
121
+ ```
122
+
123
+ Live in seconds. No coworker action needed.
124
+
125
+ ---
126
+
127
+ ## CLI usage (you, or fallback)
128
+
129
+ ```bash
130
+ cd path/to/mall-scraper-mcp
131
+ npm install
132
+ echo "OPENAI_API_KEY=sk-..." > .env # or set MALL_SCRAPER_* env vars
133
+ ./run.sh
134
+ ```
135
+
136
+ CLI prompts for: directory URL, model, max stores, concurrency, threshold,
137
+ vision yes/no. Output lands in `extracted_stores/`.
138
+
139
+ ---
140
+
141
+ ## MCP tools exposed
142
+
143
+ | Tool | Use when |
144
+ |---|---|
145
+ | `scrape_directory` | User wants the full per-store extraction across a directory listing |
146
+ | `get_store_hours` | Debugging — quick hours-only check on a single store URL |
147
+ | `validate_image_url` | A logo isn't loading in the CMS — confirm whether the URL itself is bad |
148
+
149
+ All three accept JSON inputs documented in their schemas; Claude figures out
150
+ the args from the conversation.
151
+
152
+ ---
153
+
154
+ ## File layout
155
+
156
+ ```
157
+ mall-scraper-mcp/
158
+ ├── package.json ← bin entry → src/mcp-server.js
159
+ ├── src/
160
+ │ ├── mcp-server.js ← MCP stdio server (entry for `npx mallmaverick-store-scraper`)
161
+ │ ├── main.js ← CLI entry
162
+ │ ├── openai-proxy.js ← chooses direct OpenAI vs Worker proxy from env
163
+ │ ├── browser.js ← Puppeteer wrapper + XHR intercept
164
+ │ ├── discovery.js ← directory URL discovery + logo map
165
+ │ ├── hoursParser.js ← canonical hours parsing / validation
166
+ │ ├── hoursPipeline.js ← 7-layer hours extraction
167
+ │ ├── mallContext.js ← mall hours + socials + chrome images detection
168
+ │ ├── imageExtraction.js ← logo/brand/storefront classifier
169
+ │ ├── brandSiteFallback.js ← brand-site logo when mall has GIF/missing
170
+ │ ├── deterministic.js ← phone, socials, website, status flags
171
+ │ ├── storeExtractor.js ← LLM extraction for non-deterministic fields
172
+ │ ├── retryStrategy.js ← 3-attempt escalating page loads
173
+ │ ├── storeModel.js ← 40-field schema + CSV writer (CRLF/BOM)
174
+ │ └── output.js ← (legacy, unused by mcp server)
175
+ ├── cloudflare-worker/
176
+ │ ├── worker.js ← OpenAI proxy (30 LOC)
177
+ │ ├── wrangler.toml
178
+ │ └── README.md
179
+ └── test/
180
+ └── hoursParser.test.js ← 40+ unit tests
181
+ ```
182
+
183
+ ---
184
+
185
+ ## Auth modes
186
+
187
+ The scraper supports two ways to reach OpenAI; it picks the first that's
188
+ configured:
189
+
190
+ 1. **Proxy mode (production / coworker default).**
191
+ `MALL_SCRAPER_PROXY_URL` + `MALL_SCRAPER_TOKEN` set → calls go through the
192
+ Cloudflare Worker, which holds your real OpenAI key.
193
+
194
+ 2. **Direct mode (your local dev fallback).**
195
+ `OPENAI_API_KEY` set → calls go straight to api.openai.com. Useful when
196
+ developing without spinning up the Worker.
197
+
198
+ If neither is set, the scraper refuses to start with a clear error.
199
+
200
+ ---
201
+
202
+ ## Troubleshooting
203
+
204
+ **Logo URL returns HTML in coworker's CMS:**
205
+ Ask Claude to run `validate_image_url` on the failing URL. Confirms whether
206
+ the URL itself returns a real image. If it does, the issue is on the CMS
207
+ side (the shopcurrents-style empty `property_manager_id` case is a known
208
+ example).
209
+
210
+ **Coworker gets "unauthorized" from the Worker:**
211
+ Their `MALL_SCRAPER_TOKEN` doesn't match the current `SHARED_SECRET`. Either
212
+ rotate it on their side or `wrangler secret put SHARED_SECRET` to match.
213
+
214
+ **First scrape takes 2-3 minutes:**
215
+ Puppeteer is downloading Chrome on first run (~170 MB). Subsequent scrapes
216
+ are normal speed.
217
+
218
+ **`npx mallmaverick-store-scraper` not found:**
219
+ They need Node 18+ in PATH. `node --version` to check.
220
+
221
+ ---
222
+
223
+ ## License
224
+
225
+ MIT.
package/package.json ADDED
@@ -0,0 +1,41 @@
1
+ {
2
+ "name": "mallmaverick-store-scraper",
3
+ "version": "0.1.0",
4
+ "description": "MCP server + CLI for scraping shopping mall store directories. Hours-first layered pipeline + image classification.",
5
+ "main": "src/main.js",
6
+ "type": "commonjs",
7
+ "bin": {
8
+ "mallmaverick-store-scraper": "src/mcp-server.js"
9
+ },
10
+ "scripts": {
11
+ "start": "node src/main.js",
12
+ "start:mcp": "node src/mcp-server.js",
13
+ "test:hours": "node test/hoursParser.test.js"
14
+ },
15
+ "files": [
16
+ "src/**/*.js",
17
+ "README.md",
18
+ "LICENSE"
19
+ ],
20
+ "dependencies": {
21
+ "@modelcontextprotocol/sdk": "^1.0.0",
22
+ "openai": "^4.52.0",
23
+ "puppeteer": "^24.15.0",
24
+ "dotenv": "^16.4.5",
25
+ "readline-sync": "^1.4.10",
26
+ "p-limit": "^3.1.0"
27
+ },
28
+ "engines": {
29
+ "node": ">=18.0.0"
30
+ },
31
+ "keywords": [
32
+ "mcp",
33
+ "claude",
34
+ "scraper",
35
+ "shopping-mall",
36
+ "store-directory",
37
+ "puppeteer"
38
+ ],
39
+ "author": "",
40
+ "license": "MIT"
41
+ }
@@ -0,0 +1,272 @@
1
+ 'use strict';
2
+
3
+ /**
4
+ * Brand-site logo fallback.
5
+ *
6
+ * When the mall-hosted logo is missing or in a problematic format (e.g. GIF
7
+ * that the downstream CMS can't ingest), and the store has a known website
8
+ * field, load the brand homepage and try to extract a logo from there.
9
+ *
10
+ * Priority order:
11
+ * 1. JSON-LD `logo` property (Organization / LocalBusiness / WebSite schema)
12
+ * 2. <link rel="icon" sizes="..."> with the largest size — modern brand
13
+ * sites publish a high-res icon (192x192+) that's a usable logo
14
+ * 3. <img> with class/id/parent containing "logo"
15
+ * 4. og:image if it looks logo-shaped (square-ish, < 1000px)
16
+ *
17
+ * GIFs are skipped at every stage — the whole point of the fallback is to
18
+ * find a non-GIF.
19
+ *
20
+ * Returns { url, source } or null.
21
+ */
22
+
23
+ const { URL } = require('url');
24
+ const http = require('http');
25
+ const https = require('https');
26
+ const { loadPage, newPage } = require('./browser');
27
+
28
+ const TIMEOUT_MS = 15000;
29
+
30
+ /**
31
+ * HEAD/GET-validate that a URL actually serves an image.
32
+ * Many brand sites have hardcoded `<link rel="apple-touch-icon" href="/foo.png">`
33
+ * pointing at files that don't exist — the server returns HTML 200 instead of 404.
34
+ *
35
+ * Returns the validated URL (possibly after redirect) or null.
36
+ */
37
+ function validateImageUrl(url, { timeout = 8000, allowGif = false } = {}) {
38
+ return new Promise((resolve) => {
39
+ let finalUrl = url;
40
+ let redirectsLeft = 3;
41
+ const attempt = (u) => {
42
+ const mod = u.startsWith('https') ? https : http;
43
+ const req = mod.request(u, {
44
+ method: 'HEAD',
45
+ timeout,
46
+ headers: {
47
+ 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
48
+ 'Accept': 'image/*,*/*;q=0.5',
49
+ },
50
+ }, (res) => {
51
+ // Follow redirects
52
+ if (res.statusCode >= 300 && res.statusCode < 400 && res.headers.location && redirectsLeft > 0) {
53
+ redirectsLeft--;
54
+ try {
55
+ const next = new URL(res.headers.location, u).toString();
56
+ finalUrl = next;
57
+ return attempt(next);
58
+ } catch { return resolve(null); }
59
+ }
60
+ if (res.statusCode !== 200) return resolve(null);
61
+ const ct = (res.headers['content-type'] || '').toLowerCase();
62
+ if (!ct.startsWith('image/')) return resolve(null);
63
+ if (!allowGif && ct.includes('gif')) return resolve(null);
64
+ // Content-length sanity (filter tiny tracking pixels)
65
+ const cl = parseInt(res.headers['content-length'] || '0', 10);
66
+ if (cl > 0 && cl < 200) return resolve(null);
67
+ resolve(finalUrl);
68
+ });
69
+ req.on('error', () => resolve(null));
70
+ req.on('timeout', () => { req.destroy(); resolve(null); });
71
+ req.end();
72
+ };
73
+ attempt(url);
74
+ });
75
+ }
76
+
77
+ async function fetchBrandLogo(browser, websiteUrl, storeName, { logger } = {}) {
78
+ if (!websiteUrl) return null;
79
+ let normalized;
80
+ try { normalized = new URL(websiteUrl).origin + '/'; } catch { return null; }
81
+
82
+ const page = await newPage(browser);
83
+ try {
84
+ let data;
85
+ try {
86
+ data = await loadPage(page, normalized, { waitUntil: 'domcontentloaded', timeout: TIMEOUT_MS });
87
+ } catch (err) {
88
+ if (logger) logger.warn(` ⚠ brand-site fetch failed (${normalized}): ${err.message}`);
89
+ return null;
90
+ }
91
+
92
+ const origin = normalized.replace(/\/+$/, '');
93
+
94
+ // Try each layer's candidates and return the first one that VALIDATES
95
+ // (HEAD request returns image/* content-type with HTTP 200).
96
+
97
+ // --- Layer 1: JSON-LD logo / image ---
98
+ const ldLogo = findJsonLdLogo(data.jsonLd, origin);
99
+ if (ldLogo && !isGif(ldLogo)) {
100
+ const valid = await validateImageUrl(ldLogo);
101
+ if (valid) return { url: valid, source: 'brand-jsonld' };
102
+ if (logger) logger.warn(` ⚠ brand JSON-LD logo failed validation: ${ldLogo}`);
103
+ }
104
+
105
+ // --- Layer 2: ranked <link rel="icon"> candidates ---
106
+ const iconCandidates = await page.evaluate(() => {
107
+ const links = Array.from(document.querySelectorAll('link[rel~="icon" i], link[rel~="apple-touch-icon" i], link[rel="mask-icon"]'));
108
+ return links.map(l => ({
109
+ href: l.href,
110
+ sizes: l.getAttribute('sizes') || '',
111
+ type: l.getAttribute('type') || '',
112
+ rel: l.getAttribute('rel') || '',
113
+ })).filter(o => o.href);
114
+ });
115
+ const rankedIcons = rankIcons(iconCandidates, origin);
116
+ for (const icon of rankedIcons) {
117
+ if (isGif(icon)) continue;
118
+ const valid = await validateImageUrl(icon);
119
+ if (valid) return { url: valid, source: 'brand-icon' };
120
+ }
121
+
122
+ // --- Layer 3: img with class/id/parent containing "logo" ---
123
+ const cleanName = (storeName || '').toLowerCase().replace(/[^a-z0-9]/g, '');
124
+ const imgLogo = await page.evaluate(({ origin, cleanName }) => {
125
+ const resolve = (src) => {
126
+ if (!src) return null;
127
+ if (src.startsWith('data:')) return null;
128
+ if (src.startsWith('http')) return src;
129
+ if (src.startsWith('//')) return 'https:' + src;
130
+ if (src.startsWith('/')) return origin + src;
131
+ return null;
132
+ };
133
+ const SKIP_FILE = /(banner|hero|cover|background|placeholder)/i;
134
+ const out = [];
135
+ document.querySelectorAll('img').forEach(img => {
136
+ const src = img.currentSrc || img.src || img.getAttribute('data-src') || img.getAttribute('data-lazy-src');
137
+ const url = resolve(src);
138
+ if (!url) return;
139
+ if (/\.gif(\?|$)/i.test(url)) return; // skip GIFs
140
+ const w = img.naturalWidth || 0;
141
+ const h = img.naturalHeight || 0;
142
+ if (w < 32 || h < 32) return;
143
+ const file = (() => { try { return new URL(url).pathname.split('/').pop().toLowerCase(); } catch { return ''; } })();
144
+ const cls = (img.className || '').toLowerCase();
145
+ const id = (img.id || '').toLowerCase();
146
+ const parent = img.parentElement;
147
+ const parentCls = parent ? (parent.className || '').toLowerCase() : '';
148
+ const alt = (img.getAttribute('alt') || '').toLowerCase();
149
+
150
+ let score = 0;
151
+ if (/\blogo\b/.test(cls)) score += 80;
152
+ if (/\blogo\b/.test(parentCls)) score += 60;
153
+ if (/\blogo\b/.test(id)) score += 60;
154
+ if (/logo/.test(file)) score += 50;
155
+ if (cleanName.length >= 4 && file.includes(cleanName.slice(0, 4))) score += 40;
156
+ if (cleanName.length >= 4 && alt.replace(/[^a-z0-9]/g, '').includes(cleanName.slice(0, 4))) score += 30;
157
+ // Square-ish aspect bonus
158
+ const aspect = h ? w / h : 1;
159
+ if (aspect >= 0.5 && aspect <= 2.0) score += 15;
160
+ // Reasonable logo size
161
+ if (w >= 60 && w <= 600 && h >= 60 && h <= 600) score += 15;
162
+ if (SKIP_FILE.test(file)) score -= 80;
163
+ // Header position bonus
164
+ const rect = img.getBoundingClientRect();
165
+ if (rect.top < 200) score += 10;
166
+ if (score > 0) out.push({ url, score, file, w, h });
167
+ });
168
+ return out.sort((a, b) => b.score - a.score).slice(0, 3);
169
+ }, { origin, cleanName });
170
+
171
+ for (const cand of (imgLogo || [])) {
172
+ if (cand.score < 50) break;
173
+ const valid = await validateImageUrl(cand.url);
174
+ if (valid) return { url: valid, source: 'brand-img' };
175
+ }
176
+
177
+ // --- Layer 4: og:image if it validates as an image ---
178
+ if (data.metaTags && data.metaTags['og:image']) {
179
+ const ogUrl = data.metaTags['og:image'];
180
+ if (!isGif(ogUrl)) {
181
+ const valid = await validateImageUrl(ogUrl);
182
+ if (valid) return { url: valid, source: 'brand-og' };
183
+ }
184
+ }
185
+
186
+ return null;
187
+ } finally {
188
+ try { await page.close(); } catch (_) {}
189
+ }
190
+ }
191
+
192
+ function isGif(url) {
193
+ return /\.gif(\?|$)/i.test(url || '');
194
+ }
195
+
196
+ function findJsonLdLogo(jsonLd, origin) {
197
+ if (!jsonLd) return null;
198
+ const arr = Array.isArray(jsonLd) ? jsonLd : [jsonLd];
199
+ const visit = (node) => {
200
+ if (!node || typeof node !== 'object') return null;
201
+ if (Array.isArray(node)) {
202
+ for (const n of node) { const r = visit(n); if (r) return r; }
203
+ return null;
204
+ }
205
+ // Direct logo property (Schema.org Organization)
206
+ if (node.logo) {
207
+ if (typeof node.logo === 'string') return resolveLogoUrl(node.logo, origin);
208
+ if (node.logo.url) return resolveLogoUrl(node.logo.url, origin);
209
+ }
210
+ // image fallback only when @type is Organization or similar
211
+ if (node['@type'] && /Organization|LocalBusiness|Restaurant|Store/i.test(String(node['@type']))) {
212
+ if (node.image) {
213
+ if (typeof node.image === 'string') return resolveLogoUrl(node.image, origin);
214
+ if (Array.isArray(node.image) && node.image[0]) {
215
+ return resolveLogoUrl(typeof node.image[0] === 'string' ? node.image[0] : node.image[0].url, origin);
216
+ }
217
+ if (node.image.url) return resolveLogoUrl(node.image.url, origin);
218
+ }
219
+ }
220
+ if (node['@graph']) { const r = visit(node['@graph']); if (r) return r; }
221
+ for (const k of Object.keys(node)) {
222
+ if (k.startsWith('@')) continue;
223
+ const v = node[k];
224
+ if (v && typeof v === 'object') { const r = visit(v); if (r) return r; }
225
+ }
226
+ return null;
227
+ };
228
+ for (const node of arr) { const r = visit(node); if (r) return r; }
229
+ return null;
230
+ }
231
+
232
+ function resolveLogoUrl(raw, origin) {
233
+ if (!raw) return null;
234
+ if (raw.startsWith('http')) return raw;
235
+ if (raw.startsWith('//')) return 'https:' + raw;
236
+ if (raw.startsWith('/')) return origin + raw;
237
+ return null;
238
+ }
239
+
240
+ function rankIcons(icons, origin) {
241
+ if (!icons || icons.length === 0) return [];
242
+ const score = (i) => {
243
+ let s = 0;
244
+ if (/\.ico(\?|$)/i.test(i.href) || i.type === 'image/x-icon') s -= 30;
245
+ if (i.rel && /apple-touch-icon/i.test(i.rel)) s += 30;
246
+ if (i.sizes) {
247
+ const m = String(i.sizes).match(/(\d{2,4})x\d{2,4}/);
248
+ if (m) {
249
+ const px = parseInt(m[1], 10);
250
+ if (px >= 192) s += 50;
251
+ else if (px >= 96) s += 30;
252
+ else if (px >= 64) s += 10;
253
+ else s -= 20;
254
+ }
255
+ }
256
+ if (/\.svg(\?|$)/i.test(i.href)) s += 20;
257
+ if (/\.png(\?|$)/i.test(i.href)) s += 15;
258
+ return s;
259
+ };
260
+ return icons
261
+ .map(i => ({ ...i, _s: score(i) }))
262
+ .filter(i => i._s > 0)
263
+ .sort((a, b) => b._s - a._s)
264
+ .map(i => {
265
+ if (i.href.startsWith('http')) return i.href;
266
+ if (i.href.startsWith('/')) return origin + i.href;
267
+ return null;
268
+ })
269
+ .filter(Boolean);
270
+ }
271
+
272
+ module.exports = { fetchBrandLogo };