@imenam/simple-scraper 1.0.6 → 1.0.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,186 +1,335 @@
1
- # simple-scraper-mcp
2
-
3
- A Model Context Protocol (MCP) server for web scraping and JavaScript execution using a headless browser (Puppeteer). Includes an optional GUI for cookie management.
4
-
5
- ## Features
6
-
7
- - **`scrape_page`** — Navigate to a URL and return the full rendered HTML of the page
8
- - **`execute_js`** — Navigate to a URL and execute custom JavaScript in the page context
9
- - **`get_page_inputs`** — Extract all form inputs from a page as a structured JSON object
10
- - **`get_show_page`** — Parse a detail/show page and extract key-value blocks and tables as structured JSON
11
- - **Cookie support** — Load Netscape-format cookie files automatically before each request
12
- - **Optional GUI** Cookie manager interface, available when integrated with the [MCP proxy](https://www.npmjs.com/package/@imenam/mcp-http-gateway)
13
-
14
- ## Requirements
15
-
16
- - Node.js >= 18
17
- - Puppeteer will automatically download a compatible Chromium browser (~300 MB) on first install
18
-
19
- ## Installation
20
-
21
- ```bash
22
- npx @imenam/simple-scraper
23
- ```
24
-
25
- Or install globally:
26
-
27
- ```bash
28
- npm install -g @imenam/simple-scraper
29
- simple-scraper
30
- ```
31
-
32
- ## Environment Variables
33
-
34
- | Variable | Required | Default | Description |
35
- |----------|----------|---------|-------------|
36
- | `PUPPETEER_HEADLESS` | No | `true` | Run Chromium in headless mode. Set to `false` to display the browser window. |
37
- | `PUPPETEER_TIMEOUT` | No | `30000` | Default timeout in milliseconds for page navigation and waits. |
38
- | `COOKIES_DIR` | No | | Absolute path to a folder containing Netscape-format `.txt` cookie files. All files are loaded and merged automatically before each request. |
39
- | `PROXY_URL` | No | | Base URL of the [MCP HTTP Gateway](https://www.npmjs.com/package/@imenam/mcp-http-gateway). Required to enable the GUI. |
40
- | `PROXY_APP_PATH` | No | `/simple-scraper-mcp` | URL path under which the GUI is registered on the proxy. |
41
- | `PROXY_APP_NAME` | No | `Simple Scraper MCP` | Display name shown in the proxy's app list. |
42
-
43
- ## Configuration
44
-
45
- Copy `.env.example` to `.env` and configure the variables:
46
-
47
- ```env
48
- # Puppeteer options (optional)
49
- PUPPETEER_HEADLESS=true
50
- PUPPETEER_TIMEOUT=30000
51
-
52
- # Optional: path to a folder containing Netscape-format cookie files (.txt)
53
- # All files in this folder will be loaded automatically before each request.
54
- # COOKIES_DIR=/path/to/cookies
55
-
56
- # GUI (optional) — required to enable the cookie manager interface
57
- # PROXY_URL=http://localhost:3000
58
- # PROXY_APP_PATH=/simple-scraper-mcp
59
- # PROXY_APP_NAME=Simple Scraper MCP
60
- ```
61
-
62
- ## Usage with Claude Desktop
63
-
64
- Add the following to your `claude_desktop_config.json`. Full example with all available options:
65
-
66
- ```json
67
- {
68
- "mcpServers": {
69
- "simple-scraper": {
70
- "command": "npx",
71
- "args": ["@imenam/simple-scraper"],
72
- "env": {
73
- "PUPPETEER_HEADLESS": "true",
74
- "PUPPETEER_TIMEOUT": "30000",
75
- "COOKIES_DIR": "/path/to/your/cookies",
76
- "MCP_LOG_DIR": "/path/to/your/logs",
77
- "PROXY_URL": "http://localhost:4500",
78
- "PROXY_APP_PATH": "/simple-scraper",
79
- "PROXY_APP_NAME": "Simple Scraper"
80
- }
81
- }
82
- }
83
- }
84
- ```
85
-
86
- To load cookies automatically, add `COOKIES_DIR` pointing to a folder containing `.txt` files in [Netscape cookie format](http://www.cookiecentral.com/faq/#3.5):
87
-
88
- ```json
89
- {
90
- "mcpServers": {
91
- "simple-scraper": {
92
- "command": "npx",
93
- "args": ["@imenam/simple-scraper"],
94
- "env": {
95
- "PUPPETEER_HEADLESS": "true",
96
- "COOKIES_DIR": "/path/to/your/cookies"
97
- }
98
- }
99
- }
100
- }
101
- ```
102
-
103
- ## Usage with Cursor
104
-
105
- In Cursor, MCP servers are configured in `.cursor/mcp.json`. You can pass environment variables directly in the config. Full example with all available options:
106
-
107
- ```json
108
- {
109
- "mcpServers": {
110
- "simple-scraper": {
111
- "command": "npx",
112
- "args": ["-y", "@imenam/simple-scraper"],
113
- "env": {
114
- "PUPPETEER_HEADLESS": "true",
115
- "PUPPETEER_TIMEOUT": "30000",
116
- "COOKIES_DIR": "/path/to/your/cookies",
117
- "PROXY_URL": "http://localhost:4500",
118
- "PROXY_APP_PATH": "/simple-scraper",
119
- "PROXY_APP_NAME": "Simple Scraper"
120
- }
121
- }
122
- }
123
- }
124
- ```
125
-
126
- > **Note:** The `-y` flag in `args` avoids the interactive confirmation prompt when using `npx`.
127
-
128
- ## MCP Tools
129
-
130
- ### `scrape_page`
131
-
132
- Navigate to a URL and return the full rendered HTML.
133
-
134
- | Parameter | Type | Required | Description |
135
- |-----------|------|----------|-------------|
136
- | `url` | string | ✅ | URL of the page to scrape |
137
- | `wait_for` | string | | CSS selector to wait for before capturing HTML |
138
- | `timeout` | number | | Timeout in ms (default: 30000) |
139
-
140
- ### `execute_js`
141
-
142
- Navigate to a URL and execute custom JavaScript in the page context.
143
-
144
- | Parameter | Type | Required | Description |
145
- |-----------|------|----------|-------------|
146
- | `url` | string | ✅ | URL of the page |
147
- | `script` | string | ✅ | JavaScript code to execute |
148
- | `wait_for` | string | | CSS selector to wait for before executing |
149
- | `timeout` | number | | Timeout in ms (default: 30000) |
150
-
151
- ### `get_page_inputs`
152
-
153
- Extract all form inputs from a page as a structured JSON object.
154
-
155
- | Parameter | Type | Required | Description |
156
- |-----------|------|----------|-------------|
157
- | `url` | string | ✅ | URL of the page |
158
- | `selector` | string | | CSS selector to scope the search (e.g. `#my-form`) |
159
- | `wait_for` | string | | CSS selector to wait for before extracting |
160
- | `show_hidden` | boolean | | Include `input[type=hidden]` fields (default: false) |
161
- | `timeout` | number | | Timeout in ms (default: 30000) |
162
-
163
- ### `get_show_page`
164
-
165
- Parse a detail page and extract key-value blocks and tables as structured JSON.
166
-
167
- | Parameter | Type | Required | Description |
168
- |-----------|------|----------|-------------|
169
- | `url` | string | ✅ | URL of the page |
170
- | `keys_map` | object | | Map of HTML label → JS key for field name translation |
171
- | `box_selector` | string | | CSS selector for section containers (default: `.box.box-primary`) |
172
- | `tables_max_items` | number | | Max rows per table (default: 2) |
173
- | `wait_for` | string | | CSS selector to wait for before extraction |
174
- | `timeout` | number | | Timeout in ms (default: 30000) |
175
-
176
- ## Cookie Files
177
-
178
- Cookies are loaded from `.txt` files in [Netscape format](http://www.cookiecentral.com/faq/#3.5). Place them in the folder specified by `COOKIES_DIR`. All files in the folder are loaded and merged automatically before each request.
179
-
180
- ## GUI (Optional)
181
-
182
- When `PROXY_URL` is set, a cookie manager web interface is registered with the MCP proxy. It allows you to list, upload, and delete cookie files through a browser UI.
183
-
184
- ## License
185
-
186
- ISC
1
+ # simple-scraper-mcp
2
+
3
+ A Model Context Protocol (MCP) server for web scraping and JavaScript execution using a headless browser (Puppeteer). Includes an optional GUI for cookie management.
4
+
5
+ ## Features
6
+
7
+ - **`scrape_page`** — Navigate to a URL and return the full rendered HTML of the page
8
+ - **`execute_js`** — Navigate to a URL and execute custom JavaScript in the page context
9
+ - **`get_page_inputs`** — Extract all form inputs from a page as a structured JSON object
10
+ - **`get_show_page`** — Parse a detail/show page and extract key-value blocks and tables as structured JSON
11
+ - **Interactive sessions** — Keep a browser page alive across multiple tool calls to navigate, click, type, run JS, and screenshot the page in any desired state
12
+ - **`screenshot`**Capture a screenshot of any page or active session, with inline or file output
13
+ - **Cookie support** — Load Netscape-format cookie files automatically before each request
14
+ - **Optional GUI** — Cookie manager interface, available when integrated with the [MCP proxy](https://www.npmjs.com/package/@imenam/mcp-http-gateway)
15
+
16
+ ## Requirements
17
+
18
+ - Node.js >= 18
19
+ - Puppeteer will automatically download a compatible Chromium browser (~300 MB) on first install
20
+
21
+ ## Installation
22
+
23
+ ```bash
24
+ npx @imenam/simple-scraper
25
+ ```
26
+
27
+ Or install globally:
28
+
29
+ ```bash
30
+ npm install -g @imenam/simple-scraper
31
+ simple-scraper
32
+ ```
33
+
34
+ ## Environment Variables
35
+
36
+ | Variable | Required | Default | Description |
37
+ |----------|----------|---------|-------------|
38
+ | `PUPPETEER_HEADLESS` | No | `true` | Run Chromium in headless mode. Set to `false` to display the browser window. |
39
+ | `PUPPETEER_TIMEOUT` | No | `30000` | Default timeout in milliseconds for page navigation and waits. |
40
+ | `COOKIES_DIR` | No | - | Absolute path to a folder containing Netscape-format `.txt` cookie files. All files are loaded and merged automatically before each request. |
41
+ | `MCP_LOG_DIR` | No | `.mcp-gui/logs` | Absolute path to the directory where log files are written. |
42
+ | `PROXY_URL` | No | - | Base URL of the [MCP HTTP Gateway](https://www.npmjs.com/package/@imenam/mcp-http-gateway). Required to enable the GUI. |
43
+ | `PROXY_APP_PATH` | No | `/simple-scraper-mcp` | URL path under which the GUI is registered on the proxy. |
44
+ | `PROXY_APP_NAME` | No | `Simple Scraper MCP` | Display name shown in the proxy's app list. |
45
+ | `SCRAPER_MAX_SESSIONS` | No | `5` | Maximum number of concurrent interactive sessions. |
46
+ | `SCRAPER_SESSION_TTL_MS` | No | `600000` | Inactivity TTL for sessions in milliseconds (default: 10 minutes). Sessions unused beyond this duration are closed automatically. |
47
+
48
+ ## Configuration
49
+
50
+ Copy `.env.example` to `.env` and configure the variables:
51
+
52
+ ```env
53
+ # Puppeteer options (optional)
54
+ PUPPETEER_HEADLESS=true
55
+ PUPPETEER_TIMEOUT=30000
56
+
57
+ # Optional: path to a folder containing Netscape-format cookie files (.txt)
58
+ # All files in this folder will be loaded automatically before each request.
59
+ # COOKIES_DIR=/path/to/cookies
60
+
61
+ # GUI (optional) — required to enable the cookie manager interface
62
+ # PROXY_URL=http://localhost:3000
63
+ # PROXY_APP_PATH=/simple-scraper-mcp
64
+ # PROXY_APP_NAME=Simple Scraper MCP
65
+ ```
66
+
67
+ ## Usage with Claude Desktop
68
+
69
+ Add the following to your `claude_desktop_config.json`. Full example with all available options:
70
+
71
+ ```json
72
+ {
73
+ "mcpServers": {
74
+ "simple-scraper": {
75
+ "command": "npx",
76
+ "args": ["@imenam/simple-scraper"],
77
+ "env": {
78
+ "PUPPETEER_HEADLESS": "true",
79
+ "PUPPETEER_TIMEOUT": "30000",
80
+ "COOKIES_DIR": "/path/to/your/cookies",
81
+ "MCP_LOG_DIR": "/path/to/your/logs",
82
+ "PROXY_URL": "http://localhost:4500",
83
+ "PROXY_APP_PATH": "/simple-scraper",
84
+ "PROXY_APP_NAME": "Simple Scraper"
85
+ }
86
+ }
87
+ }
88
+ }
89
+ ```
90
+
91
+ To load cookies automatically, add `COOKIES_DIR` pointing to a folder containing `.txt` files in [Netscape cookie format](http://www.cookiecentral.com/faq/#3.5):
92
+
93
+ ```json
94
+ {
95
+ "mcpServers": {
96
+ "simple-scraper": {
97
+ "command": "npx",
98
+ "args": ["@imenam/simple-scraper"],
99
+ "env": {
100
+ "PUPPETEER_HEADLESS": "true",
101
+ "COOKIES_DIR": "/path/to/your/cookies"
102
+ }
103
+ }
104
+ }
105
+ }
106
+ ```
107
+
108
+ ## Usage with Cursor
109
+
110
+ In Cursor, MCP servers are configured in `.cursor/mcp.json`. You can pass environment variables directly in the config. Full example with all available options:
111
+
112
+ ```json
113
+ {
114
+ "mcpServers": {
115
+ "simple-scraper": {
116
+ "command": "npx",
117
+ "args": ["-y", "@imenam/simple-scraper"],
118
+ "env": {
119
+ "PUPPETEER_HEADLESS": "true",
120
+ "PUPPETEER_TIMEOUT": "30000",
121
+ "COOKIES_DIR": "/path/to/your/cookies",
122
+ "MCP_LOG_DIR": "/path/to/your/logs",
123
+ "PROXY_URL": "http://localhost:4500",
124
+ "PROXY_APP_PATH": "/simple-scraper",
125
+ "PROXY_APP_NAME": "Simple Scraper"
126
+ }
127
+ }
128
+ }
129
+ }
130
+ ```
131
+
132
+ > **Note:** The `-y` flag in `args` avoids the interactive confirmation prompt when using `npx`.
133
+
134
+ ## MCP Tools
135
+
136
+ ### `scrape_page`
137
+
138
+ Navigate to a URL and return the full rendered HTML.
139
+
140
+ | Parameter | Type | Required | Description |
141
+ |-----------|------|----------|-------------|
142
+ | `url` | string | | URL of the page to scrape |
143
+ | `wait_for` | string | | CSS selector to wait for before capturing HTML |
144
+ | `timeout` | number | | Timeout in ms (default: 30000) |
145
+
146
+ ### `execute_js`
147
+
148
+ Navigate to a URL and execute custom JavaScript in the page context.
149
+
150
+ The `script` parameter is executed as the body of a JavaScript function in the browser page context, equivalent to:
151
+
152
+ ```js
153
+ new Function(script)();
154
+ ```
155
+
156
+ To return data from the tool, the script must use an explicit `return`. A bare expression such as `document.title` will evaluate but the tool will receive `undefined`.
157
+
158
+ Example:
159
+
160
+ ```js
161
+ return {
162
+ title: document.title,
163
+ url: window.location.href,
164
+ text: document.body.innerText.slice(0, 500)
165
+ };
166
+ ```
167
+
168
+ For asynchronous work, return a promise, for example with an async IIFE:
169
+
170
+ ```js
171
+ return (async () => {
172
+ const response = await fetch('/api/data');
173
+ return await response.json();
174
+ })();
175
+ ```
176
+
177
+ Returned objects and arrays are serialized as formatted JSON. Primitive values are returned as text.
178
+
179
+ | Parameter | Type | Required | Description |
180
+ |-----------|------|----------|-------------|
181
+ | `url` | string | ✅ | URL of the page |
182
+ | `script` | string | | JavaScript function body to execute in the page context. Use `return` to send a result back to the tool. |
183
+ | `wait_for` | string | | CSS selector to wait for before executing |
184
+ | `timeout` | number | | Timeout in ms (default: 30000) |
185
+
186
+ ### `get_page_inputs`
187
+
188
+ Extract all form inputs from a page as a structured JSON object.
189
+
190
+ | Parameter | Type | Required | Description |
191
+ |-----------|------|----------|-------------|
192
+ | `url` | string | ✅ | URL of the page |
193
+ | `selector` | string | | CSS selector to scope the search (e.g. `#my-form`) |
194
+ | `wait_for` | string | | CSS selector to wait for before extracting |
195
+ | `show_hidden` | boolean | | Include `input[type=hidden]` fields (default: false) |
196
+ | `timeout` | number | | Timeout in ms (default: 30000) |
197
+
198
+ ### `get_show_page`
199
+
200
+ Parse a detail page and extract key-value blocks and tables as structured JSON.
201
+
202
+ | Parameter | Type | Required | Description |
203
+ |-----------|------|----------|-------------|
204
+ | `url` | string | ✅ | URL of the page |
205
+ | `keys_map` | object | | Map of HTML label → JS key for field name translation |
206
+ | `box_selector` | string | | CSS selector for section containers (default: `.box.box-primary`) |
207
+ | `tables_max_items` | number | | Max rows per table (default: 2) |
208
+ | `wait_for` | string | | CSS selector to wait for before extraction |
209
+ | `timeout` | number | | Timeout in ms (default: 30000) |
210
+
211
+ ---
212
+
213
+ ## Interactive Sessions
214
+
215
+ Interactive sessions let you keep a browser page alive across multiple tool calls, so you can bring the page into the exact state you need before extracting data or taking a screenshot.
216
+
217
+ ### Typical workflow
218
+
219
+ ```
220
+ open_session → session_click / session_type / session_evaluate → screenshot / session_html → close_session
221
+ ```
222
+
223
+ ### `open_session`
224
+
225
+ Open a persistent browser session. Returns a `session_id` to use with all `session_*` tools and `screenshot`.
226
+
227
+ | Parameter | Type | Required | Description |
228
+ |-----------|------|----------|-------------|
229
+ | `url` | string | ✅ | URL to navigate to |
230
+ | `wait_for` | string | | CSS selector to wait for before the session is considered ready |
231
+ | `timeout` | number | | Timeout in ms (default: 30000) |
232
+
233
+ ### `close_session`
234
+
235
+ Close a session and free its resources. Always call this when you are done.
236
+
237
+ | Parameter | Type | Required | Description |
238
+ |-----------|------|----------|-------------|
239
+ | `session_id` | string | ✅ | Session ID returned by `open_session` |
240
+
241
+ ### `list_sessions`
242
+
243
+ List all currently active sessions with their IDs and timestamps. No parameters.
244
+
245
+ ### `session_goto`
246
+
247
+ Navigate the session to a new URL without closing it.
248
+
249
+ | Parameter | Type | Required | Description |
250
+ |-----------|------|----------|-------------|
251
+ | `session_id` | string | ✅ | Session ID |
252
+ | `url` | string | ✅ | URL to navigate to |
253
+ | `wait_for` | string | | CSS selector to wait for after navigation |
254
+ | `timeout` | number | | Timeout in ms (default: 30000) |
255
+
256
+ ### `session_click`
257
+
258
+ Click an element in the session page.
259
+
260
+ | Parameter | Type | Required | Description |
261
+ |-----------|------|----------|-------------|
262
+ | `session_id` | string | ✅ | Session ID |
263
+ | `selector` | string | ✅ | CSS selector of the element to click |
264
+ | `timeout` | number | | Timeout in ms to wait for the element (default: 30000) |
265
+
266
+ ### `session_type`
267
+
268
+ Type text into an input element in the session page.
269
+
270
+ | Parameter | Type | Required | Description |
271
+ |-----------|------|----------|-------------|
272
+ | `session_id` | string | ✅ | Session ID |
273
+ | `selector` | string | ✅ | CSS selector of the input element |
274
+ | `text` | string | ✅ | Text to type |
275
+ | `clear` | boolean | | Clear the field before typing (default: false) |
276
+ | `timeout` | number | | Timeout in ms to wait for the element (default: 30000) |
277
+
278
+ ### `session_wait_for`
279
+
280
+ Wait for a CSS selector to appear in the session page.
281
+
282
+ | Parameter | Type | Required | Description |
283
+ |-----------|------|----------|-------------|
284
+ | `session_id` | string | ✅ | Session ID |
285
+ | `selector` | string | ✅ | CSS selector to wait for |
286
+ | `timeout` | number | | Timeout in ms (default: 30000) |
287
+
288
+ ### `session_evaluate`
289
+
290
+ Execute JavaScript in the context of the session page. Same conventions as `execute_js`.
291
+
292
+ | Parameter | Type | Required | Description |
293
+ |-----------|------|----------|-------------|
294
+ | `session_id` | string | ✅ | Session ID |
295
+ | `script` | string | ✅ | JavaScript function body. Use `return` to get a result back. |
296
+ | `wait_for` | string | | CSS selector to wait for before executing |
297
+ | `timeout` | number | | Timeout in ms (default: 30000) |
298
+
299
+ ### `session_html`
300
+
301
+ Return the current full rendered HTML of the session page.
302
+
303
+ | Parameter | Type | Required | Description |
304
+ |-----------|------|----------|-------------|
305
+ | `session_id` | string | ✅ | Session ID |
306
+
307
+ ### `screenshot`
308
+
309
+ Capture a screenshot of a page. Use `session_id` to capture an active session in its current state, or `url` for a one-shot capture.
310
+
311
+ | Parameter | Type | Required | Description |
312
+ |-----------|------|----------|-------------|
313
+ | `session_id` | string | | Session ID. If provided, `url` is ignored and the current page state is captured. |
314
+ | `url` | string | | URL for a one-shot screenshot. Required if `session_id` is not provided. |
315
+ | `wait_for` | string | | CSS selector to wait for (one-shot mode only) |
316
+ | `timeout` | number | | Timeout in ms (default: 30000) |
317
+ | `selector` | string | | CSS selector of a specific element to capture |
318
+ | `full_page` | boolean | | Capture the full scrollable page height (default: false, ignored when `selector` is provided) |
319
+ | `format` | `png` \| `jpeg` | | Image format (default: `png`) |
320
+ | `output` | `inline` \| `file` \| `both` | ✅ | `inline` embeds the image in the response, `file` saves to disk and returns the path, `both` does both |
321
+ | `path` | string | | Absolute or relative path for the saved file. Defaults to `./screenshots/screenshot-<timestamp>.<format>` |
322
+
323
+ ---
324
+
325
+ ## Cookie Files
326
+
327
+ Cookies are loaded from `.txt` files in [Netscape format](http://www.cookiecentral.com/faq/#3.5). Place them in the folder specified by `COOKIES_DIR`. All files in the folder are loaded and merged automatically before each request.
328
+
329
+ ## GUI (Optional)
330
+
331
+ When `PROXY_URL` is set, a cookie manager web interface is registered with the MCP proxy. It allows you to list, upload, and delete cookie files through a browser UI.
332
+
333
+ ## License
334
+
335
+ ISC
package/dist/browser.js CHANGED
@@ -30,6 +30,11 @@ export async function closeBrowser() {
30
30
  }
31
31
  for (const sig of ['SIGTERM', 'SIGINT']) {
32
32
  process.on(sig, async () => {
33
+ // Sessions must be closed before the browser to avoid dangling page handles.
34
+ // closeAllSessions is imported dynamically to avoid a circular dependency
35
+ // (sessions.ts → browser.ts → sessions.ts).
36
+ const { closeAllSessions } = await import('./sessions.js');
37
+ await closeAllSessions();
33
38
  await closeBrowser();
34
39
  process.exit(0);
35
40
  });