@vakra-dev/reader 0.1.1 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (3) hide show
  1. package/LICENSE +1 -16
  2. package/README.md +64 -64
  3. package/package.json +1 -1
package/LICENSE CHANGED
@@ -187,19 +187,4 @@
187
187
  distributed under the License is distributed on an "AS IS" BASIS,
188
188
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
189
189
  See the License for the specific language governing permissions and
190
- limitations under the License.
191
-
192
- ---
193
-
194
- Attribution Requirement
195
-
196
- All distributions, publications, or public uses of this software, in whole or in
197
- part, must include the following attribution in a clearly visible location (such
198
- as documentation, a README file, or an "About" section in any user interface):
199
-
200
- "This product includes software developed by Nihal Kaul
201
- (https://www.linkedin.com/in/nihalwashere/) as part of the Reader project
202
- (https://github.com/vakra-dev/reader)."
203
-
204
- This attribution requirement is in addition to, and does not limit, any
205
- obligations imposed by the Apache License, Version 2.0.
190
+ limitations under the License.
package/README.md CHANGED
@@ -19,7 +19,11 @@
19
19
  </p>
20
20
 
21
21
  <p align="center">
22
- If you find Reader useful, please consider giving it a star on GitHub! It helps others discover the project.
22
+ <a href="https://docs.reader.dev">Docs</a> · <a href="https://docs.reader.dev/home/examples">Examples</a> · <a href="https://discord.gg/6tjkq7J5WV">Discord</a>
23
+ </p>
24
+
25
+ <p align="center">
26
+ <img src="./docs/assets/demo.gif" alt="Reader demo — scrape any URL to clean markdown" width="700" />
23
27
  </p>
24
28
 
25
29
  ## The Problem
@@ -63,6 +67,9 @@ console.log(`Found ${pages.urls.length} pages`);
63
67
 
64
68
  All the hard stuff, browser pooling, challenge detection, proxy rotation, retries, happens under the hood. You get clean markdown. Your agents get the web.
65
69
 
70
+ > [!TIP]
71
+ > If Reader is useful to you, a [star on GitHub](https://github.com/vakra-dev/reader) helps others discover the project.
72
+
66
73
  ## Features
67
74
 
68
75
  - **Cloudflare Bypass** - TLS fingerprinting, DNS over TLS, WebRTC masking
@@ -252,20 +259,20 @@ npx reader scrape https://example.com https://example.org -c 2
252
259
  npx reader scrape https://example.com -o output.md
253
260
  ```
254
261
 
255
- | Option | Type | Default | Description |
256
- | ------------------------ | ------ | ------------ | --------------------------------------------------------- |
257
- | `-f, --format <formats>` | string | `"markdown"` | Output formats (comma-separated: markdown,html) |
258
- | `-o, --output <file>` | string | stdout | Output file path |
259
- | `-c, --concurrency <n>` | number | `1` | Parallel requests |
260
- | `-t, --timeout <ms>` | number | `30000` | Request timeout in milliseconds |
261
- | `--batch-timeout <ms>` | number | `300000` | Total timeout for entire batch operation |
262
- | `--proxy <url>` | string | - | Proxy URL (e.g., http://user:pass@host:port) |
263
- | `--user-agent <string>` | string | - | Custom user agent string |
264
- | `--show-chrome` | flag | - | Show browser window for debugging |
265
- | `--no-main-content` | flag | - | Disable main content extraction (include full page) |
266
- | `--include-tags <sel>` | string | - | CSS selectors for elements to include (comma-separated) |
267
- | `--exclude-tags <sel>` | string | - | CSS selectors for elements to exclude (comma-separated) |
268
- | `-v, --verbose` | flag | - | Enable verbose logging |
262
+ | Option | Type | Default | Description |
263
+ | ------------------------ | ------ | ------------ | ------------------------------------------------------- |
264
+ | `-f, --format <formats>` | string | `"markdown"` | Output formats (comma-separated: markdown,html) |
265
+ | `-o, --output <file>` | string | stdout | Output file path |
266
+ | `-c, --concurrency <n>` | number | `1` | Parallel requests |
267
+ | `-t, --timeout <ms>` | number | `30000` | Request timeout in milliseconds |
268
+ | `--batch-timeout <ms>` | number | `300000` | Total timeout for entire batch operation |
269
+ | `--proxy <url>` | string | - | Proxy URL (e.g., http://user:pass@host:port) |
270
+ | `--user-agent <string>` | string | - | Custom user agent string |
271
+ | `--show-chrome` | flag | - | Show browser window for debugging |
272
+ | `--no-main-content` | flag | - | Disable main content extraction (include full page) |
273
+ | `--include-tags <sel>` | string | - | CSS selectors for elements to include (comma-separated) |
274
+ | `--exclude-tags <sel>` | string | - | CSS selectors for elements to exclude (comma-separated) |
275
+ | `-v, --verbose` | flag | - | Enable verbose logging |
269
276
 
270
277
  ### `reader crawl <url>`
271
278
 
@@ -355,26 +362,26 @@ await reader.close();
355
362
 
356
363
  Scrape one or more URLs. Can be used directly or via `ReaderClient`.
357
364
 
358
- | Option | Type | Required | Default | Description |
359
- | ------------------ | --------------------------------- | -------- | -------------- | --------------------------------------------------------------- |
360
- | `urls` | `string[]` | Yes | - | Array of URLs to scrape |
361
- | `formats` | `Array<"markdown" \| "html">` | No | `["markdown"]` | Output formats |
362
- | `onlyMainContent` | `boolean` | No | `true` | Extract only main content (removes nav/header/footer) |
363
- | `includeTags` | `string[]` | No | `[]` | CSS selectors for elements to keep |
364
- | `excludeTags` | `string[]` | No | `[]` | CSS selectors for elements to remove |
365
- | `userAgent` | `string` | No | - | Custom user agent string |
366
- | `timeoutMs` | `number` | No | `30000` | Request timeout in milliseconds |
367
- | `includePatterns` | `string[]` | No | `[]` | URL patterns to include (regex strings) |
368
- | `excludePatterns` | `string[]` | No | `[]` | URL patterns to exclude (regex strings) |
369
- | `batchConcurrency` | `number` | No | `1` | Number of URLs to process in parallel |
370
- | `batchTimeoutMs` | `number` | No | `300000` | Total timeout for entire batch operation |
371
- | `maxRetries` | `number` | No | `2` | Maximum retry attempts for failed URLs |
372
- | `onProgress` | `function` | No | - | Progress callback: `({ completed, total, currentUrl }) => void` |
373
- | `proxy` | `ProxyConfig` | No | - | Proxy configuration object |
374
- | `waitForSelector` | `string` | No | - | CSS selector to wait for before page is loaded |
375
- | `verbose` | `boolean` | No | `false` | Enable verbose logging |
376
- | `showChrome` | `boolean` | No | `false` | Show Chrome window for debugging |
377
- | `connectionToCore` | `any` | No | - | Connection to shared Hero Core (for production) |
365
+ | Option | Type | Required | Default | Description |
366
+ | ------------------ | ----------------------------- | -------- | -------------- | --------------------------------------------------------------- |
367
+ | `urls` | `string[]` | Yes | - | Array of URLs to scrape |
368
+ | `formats` | `Array<"markdown" \| "html">` | No | `["markdown"]` | Output formats |
369
+ | `onlyMainContent` | `boolean` | No | `true` | Extract only main content (removes nav/header/footer) |
370
+ | `includeTags` | `string[]` | No | `[]` | CSS selectors for elements to keep |
371
+ | `excludeTags` | `string[]` | No | `[]` | CSS selectors for elements to remove |
372
+ | `userAgent` | `string` | No | - | Custom user agent string |
373
+ | `timeoutMs` | `number` | No | `30000` | Request timeout in milliseconds |
374
+ | `includePatterns` | `string[]` | No | `[]` | URL patterns to include (regex strings) |
375
+ | `excludePatterns` | `string[]` | No | `[]` | URL patterns to exclude (regex strings) |
376
+ | `batchConcurrency` | `number` | No | `1` | Number of URLs to process in parallel |
377
+ | `batchTimeoutMs` | `number` | No | `300000` | Total timeout for entire batch operation |
378
+ | `maxRetries` | `number` | No | `2` | Maximum retry attempts for failed URLs |
379
+ | `onProgress` | `function` | No | - | Progress callback: `({ completed, total, currentUrl }) => void` |
380
+ | `proxy` | `ProxyConfig` | No | - | Proxy configuration object |
381
+ | `waitForSelector` | `string` | No | - | CSS selector to wait for before page is loaded |
382
+ | `verbose` | `boolean` | No | `false` | Enable verbose logging |
383
+ | `showChrome` | `boolean` | No | `false` | Show Chrome window for debugging |
384
+ | `connectionToCore` | `any` | No | - | Connection to shared Hero Core (for production) |
378
385
 
379
386
  **Returns:** `Promise<ScrapeResult>`
380
387
 
@@ -582,13 +589,13 @@ When you're scraping the web, you encounter messy, malformed HTML that breaks mo
582
589
 
583
590
  **What supermarkdown offers:**
584
591
 
585
- | Feature | Benefit |
586
- |---------|---------|
587
- | **Written in Rust** | Native performance with Node.js bindings via napi-rs |
588
- | **Full GFM support** | Tables, task lists, strikethrough, autolinks |
589
- | **LLM-optimized** | Clean output designed for AI consumption |
590
- | **Battle-tested** | Handles malformed HTML from real web pages |
591
- | **CSS selectors** | Include/exclude elements during conversion |
592
+ | Feature | Benefit |
593
+ | -------------------- | ---------------------------------------------------- |
594
+ | **Written in Rust** | Native performance with Node.js bindings via napi-rs |
595
+ | **Full GFM support** | Tables, task lists, strikethrough, autolinks |
596
+ | **LLM-optimized** | Clean output designed for AI consumption |
597
+ | **Battle-tested** | Handles malformed HTML from real web pages |
598
+ | **CSS selectors** | Include/exclude elements during conversion |
592
599
 
593
600
  supermarkdown is open source and available as both a Rust crate and npm package:
594
601
 
@@ -602,32 +609,24 @@ cargo add supermarkdown
602
609
 
603
610
  Check out the [supermarkdown repository](https://github.com/vakra-dev/supermarkdown) for examples and documentation.
604
611
 
605
- ## Documentation
612
+ ## Server Deployment
606
613
 
607
- | Guide | Description |
608
- | ------------------------------------------ | ------------------------------ |
609
- | [Getting Started](docs/getting-started.md) | Detailed setup and first steps |
610
- | [Architecture](docs/architecture.md) | System design and data flow |
611
- | [API Reference](docs/api-reference.md) | Complete API documentation |
612
- | [Troubleshooting](docs/troubleshooting.md) | Common errors and solutions |
614
+ Reader uses a real Chromium browser under the hood. On headless Linux servers (VPS, EC2, etc.), you need to install Chrome's system dependencies:
613
615
 
614
- ### Guides
616
+ ```bash
617
+ # Debian/Ubuntu
618
+ sudo apt-get install -y libnspr4 libnss3 libatk1.0-0 libatk-bridge2.0-0 \
619
+ libcups2 libxcb1 libatspi2.0-0 libx11-6 libxcomposite1 libxdamage1 \
620
+ libxext6 libxfixes3 libxrandr2 libgbm1 libcairo2 libpango-1.0-0 libasound2
621
+ ```
615
622
 
616
- | Guide | Description |
617
- | --------------------------------------------------------- | ----------------------------- |
618
- | [Cloudflare Bypass](docs/guides/cloudflare-bypass.md) | How antibot bypass works |
619
- | [Proxy Configuration](docs/guides/proxy-configuration.md) | Setting up proxies |
620
- | [Browser Pool](docs/guides/browser-pool.md) | Production browser management |
621
- | [Output Formats](docs/guides/output-formats.md) | Understanding output formats |
623
+ This is the same requirement that Puppeteer and Playwright have on headless Linux. macOS, Windows, and Linux desktops already have these libraries.
622
624
 
623
- ### Deployment
625
+ For Docker and production deployment guides, see the [deployment documentation](https://docs.reader.dev/documentation/guides/deployment).
626
+
627
+ ## Documentation
624
628
 
625
- | Guide | Description |
626
- | --------------------------------------------------------- | -------------------------- |
627
- | [Docker](docs/deployment/docker.md) | Container deployment |
628
- | [Production Server](docs/deployment/production-server.md) | Express + shared Hero Core |
629
- | [Job Queues](docs/deployment/job-queues.md) | BullMQ async scheduling |
630
- | [Serverless](docs/deployment/serverless.md) | Lambda, Vercel, Workers |
629
+ Full documentation is available at **[docs.reader.dev](https://docs.reader.dev)**, including guides for scraping, crawling, proxy configuration, browser pool management, and deployment.
631
630
 
632
631
  ### Examples
633
632
 
@@ -688,4 +687,5 @@ If you use Reader in your research or project, please cite it:
688
687
  ## Support
689
688
 
690
689
  - [GitHub Issues](https://github.com/vakra-dev/reader/issues)
691
- - [Documentation](https://github.com/vakra-dev/reader)
690
+ - [Documentation](https://docs.reader.dev)
691
+ - [Discord](https://discord.gg/6tjkq7J5WV)
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@vakra-dev/reader",
3
- "version": "0.1.1",
3
+ "version": "0.1.2",
4
4
  "description": "Open source, production grade web scraping engine for LLMs. Clean markdown output, ready for your agents.",
5
5
  "license": "Apache-2.0",
6
6
  "type": "module",