npm - @vakra-dev/reader - Versions diffs - 0.1.0 → 0.1.2 - Mend

@vakra-dev/reader 0.1.0 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

package/LICENSE CHANGED Viewed

@@ -187,19 +187,4 @@
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
-   limitations under the License.
----
-Attribution Requirement
-All distributions, publications, or public uses of this software, in whole or in
-part, must include the following attribution in a clearly visible location (such
-as documentation, a README file, or an "About" section in any user interface):
-   "This product includes software developed by Nihal Kaul
-   (https://www.linkedin.com/in/nihalwashere/) as part of the Reader project
-   (https://github.com/vakra-dev/reader)."
-This attribution requirement is in addition to, and does not limit, any
-obligations imposed by the Apache License, Version 2.0.
+   limitations under the License.

package/README.md CHANGED Viewed

@@ -19,7 +19,11 @@
 </p>
 <p align="center">
-  If you find Reader useful, please consider giving it a star on GitHub! It helps others discover the project.
+  <a href="https://docs.reader.dev">Docs</a> · <a href="https://docs.reader.dev/home/examples">Examples</a> · <a href="https://discord.gg/6tjkq7J5WV">Discord</a>
+</p>
+<p align="center">
+  <img src="./docs/assets/demo.gif" alt="Reader demo — scrape any URL to clean markdown" width="700" />
 </p>
 ## The Problem
@@ -63,6 +67,9 @@ console.log(`Found ${pages.urls.length} pages`);
 All the hard stuff, browser pooling, challenge detection, proxy rotation, retries, happens under the hood. You get clean markdown. Your agents get the web.
+> [!TIP]
+> If Reader is useful to you, a [star on GitHub](https://github.com/vakra-dev/reader) helps others discover the project.
 ## Features
 - **Cloudflare Bypass** - TLS fingerprinting, DNS over TLS, WebRTC masking
@@ -252,20 +259,20 @@ npx reader scrape https://example.com https://example.org -c 2
 npx reader scrape https://example.com -o output.md
 ```
-| Option                   | Type   | Default      | Description                                               |
-| ------------------------ | ------ | ------------ | --------------------------------------------------------- |
-| `-f, --format <formats>` | string | `"markdown"` | Output formats (comma-separated: markdown,html)           |
-| `-o, --output <file>`    | string | stdout       | Output file path                                          |
-| `-c, --concurrency <n>`  | number | `1`          | Parallel requests                                         |
-| `-t, --timeout <ms>`     | number | `30000`      | Request timeout in milliseconds                           |
-| `--batch-timeout <ms>`   | number | `300000`     | Total timeout for entire batch operation                  |
-| `--proxy <url>`          | string | -            | Proxy URL (e.g., http://user:pass@host:port)              |
-| `--user-agent <string>`  | string | -            | Custom user agent string                                  |
-| `--show-chrome`          | flag   | -            | Show browser window for debugging                         |
-| `--no-main-content`      | flag   | -            | Disable main content extraction (include full page)       |
-| `--include-tags <sel>`   | string | -            | CSS selectors for elements to include (comma-separated)   |
-| `--exclude-tags <sel>`   | string | -            | CSS selectors for elements to exclude (comma-separated)   |
-| `-v, --verbose`          | flag   | -            | Enable verbose logging                                    |
+| Option                   | Type   | Default      | Description                                             |
+| ------------------------ | ------ | ------------ | ------------------------------------------------------- |
+| `-f, --format <formats>` | string | `"markdown"` | Output formats (comma-separated: markdown,html)         |
+| `-o, --output <file>`    | string | stdout       | Output file path                                        |
+| `-c, --concurrency <n>`  | number | `1`          | Parallel requests                                       |
+| `-t, --timeout <ms>`     | number | `30000`      | Request timeout in milliseconds                         |
+| `--batch-timeout <ms>`   | number | `300000`     | Total timeout for entire batch operation                |
+| `--proxy <url>`          | string | -            | Proxy URL (e.g., http://user:pass@host:port)            |
+| `--user-agent <string>`  | string | -            | Custom user agent string                                |
+| `--show-chrome`          | flag   | -            | Show browser window for debugging                       |
+| `--no-main-content`      | flag   | -            | Disable main content extraction (include full page)     |
+| `--include-tags <sel>`   | string | -            | CSS selectors for elements to include (comma-separated) |
+| `--exclude-tags <sel>`   | string | -            | CSS selectors for elements to exclude (comma-separated) |
+| `-v, --verbose`          | flag   | -            | Enable verbose logging                                  |
 ### `reader crawl <url>`
@@ -355,26 +362,26 @@ await reader.close();
 Scrape one or more URLs. Can be used directly or via `ReaderClient`.
-| Option             | Type                              | Required | Default        | Description                                                     |
-| ------------------ | --------------------------------- | -------- | -------------- | --------------------------------------------------------------- |
-| `urls`             | `string[]`                        | Yes      | -              | Array of URLs to scrape                                         |
-| `formats`          | `Array<"markdown" \| "html">`     | No       | `["markdown"]` | Output formats                                                  |
-| `onlyMainContent`  | `boolean`                         | No       | `true`         | Extract only main content (removes nav/header/footer)           |
-| `includeTags`      | `string[]`                        | No       | `[]`           | CSS selectors for elements to keep                              |
-| `excludeTags`      | `string[]`                        | No       | `[]`           | CSS selectors for elements to remove                            |
-| `userAgent`        | `string`                          | No       | -              | Custom user agent string                                        |
-| `timeoutMs`        | `number`                          | No       | `30000`        | Request timeout in milliseconds                                 |
-| `includePatterns`  | `string[]`                        | No       | `[]`           | URL patterns to include (regex strings)                         |
-| `excludePatterns`  | `string[]`                        | No       | `[]`           | URL patterns to exclude (regex strings)                         |
-| `batchConcurrency` | `number`                          | No       | `1`            | Number of URLs to process in parallel                           |
-| `batchTimeoutMs`   | `number`                          | No       | `300000`       | Total timeout for entire batch operation                        |
-| `maxRetries`       | `number`                          | No       | `2`            | Maximum retry attempts for failed URLs                          |
-| `onProgress`       | `function`                        | No       | -              | Progress callback: `({ completed, total, currentUrl }) => void` |
-| `proxy`            | `ProxyConfig`                     | No       | -              | Proxy configuration object                                      |
-| `waitForSelector`  | `string`                          | No       | -              | CSS selector to wait for before page is loaded                  |
-| `verbose`          | `boolean`                         | No       | `false`        | Enable verbose logging                                          |
-| `showChrome`       | `boolean`                         | No       | `false`        | Show Chrome window for debugging                                |
-| `connectionToCore` | `any`                             | No       | -              | Connection to shared Hero Core (for production)                 |
+| Option             | Type                          | Required | Default        | Description                                                     |
+| ------------------ | ----------------------------- | -------- | -------------- | --------------------------------------------------------------- |
+| `urls`             | `string[]`                    | Yes      | -              | Array of URLs to scrape                                         |
+| `formats`          | `Array<"markdown" \| "html">` | No       | `["markdown"]` | Output formats                                                  |
+| `onlyMainContent`  | `boolean`                     | No       | `true`         | Extract only main content (removes nav/header/footer)           |
+| `includeTags`      | `string[]`                    | No       | `[]`           | CSS selectors for elements to keep                              |
+| `excludeTags`      | `string[]`                    | No       | `[]`           | CSS selectors for elements to remove                            |
+| `userAgent`        | `string`                      | No       | -              | Custom user agent string                                        |
+| `timeoutMs`        | `number`                      | No       | `30000`        | Request timeout in milliseconds                                 |
+| `includePatterns`  | `string[]`                    | No       | `[]`           | URL patterns to include (regex strings)                         |
+| `excludePatterns`  | `string[]`                    | No       | `[]`           | URL patterns to exclude (regex strings)                         |
+| `batchConcurrency` | `number`                      | No       | `1`            | Number of URLs to process in parallel                           |
+| `batchTimeoutMs`   | `number`                      | No       | `300000`       | Total timeout for entire batch operation                        |
+| `maxRetries`       | `number`                      | No       | `2`            | Maximum retry attempts for failed URLs                          |
+| `onProgress`       | `function`                    | No       | -              | Progress callback: `({ completed, total, currentUrl }) => void` |
+| `proxy`            | `ProxyConfig`                 | No       | -              | Proxy configuration object                                      |
+| `waitForSelector`  | `string`                      | No       | -              | CSS selector to wait for before page is loaded                  |
+| `verbose`          | `boolean`                     | No       | `false`        | Enable verbose logging                                          |
+| `showChrome`       | `boolean`                     | No       | `false`        | Show Chrome window for debugging                                |
+| `connectionToCore` | `any`                         | No       | -              | Connection to shared Hero Core (for production)                 |
 **Returns:** `Promise<ScrapeResult>`
@@ -572,32 +579,54 @@ Reader uses [Ulixee Hero](https://ulixee.org/), a headless browser with advanced
 - **Health Monitoring** - Background health checks every 5 minutes
 - **Request Queuing** - Queues requests when pool is full (max 100)
-## Documentation
+### HTML to Markdown: supermarkdown
+Reader uses [**supermarkdown**](https://github.com/vakra-dev/supermarkdown) for HTML to Markdown conversion - a sister project we built from scratch specifically for web scraping and LLM pipelines.
-| Guide                                      | Description                    |
-| ------------------------------------------ | ------------------------------ |
-| [Getting Started](docs/getting-started.md) | Detailed setup and first steps |
-| [Architecture](docs/architecture.md)       | System design and data flow    |
-| [API Reference](docs/api-reference.md)     | Complete API documentation     |
-| [Troubleshooting](docs/troubleshooting.md) | Common errors and solutions    |
+**Why we built it:**
-### Guides
+When you're scraping the web, you encounter messy, malformed HTML that breaks most converters. And when you're feeding content to LLMs, you need clean output without artifacts or noise. We needed a converter that handles real-world HTML reliably while producing high-quality markdown.
-| Guide                                                     | Description                   |
-| --------------------------------------------------------- | ----------------------------- |
-| [Cloudflare Bypass](docs/guides/cloudflare-bypass.md)     | How antibot bypass works      |
-| [Proxy Configuration](docs/guides/proxy-configuration.md) | Setting up proxies            |
-| [Browser Pool](docs/guides/browser-pool.md)               | Production browser management |
-| [Output Formats](docs/guides/output-formats.md)           | Understanding output formats  |
+**What supermarkdown offers:**
-### Deployment
+| Feature              | Benefit                                              |
+| -------------------- | ---------------------------------------------------- |
+| **Written in Rust**  | Native performance with Node.js bindings via napi-rs |
+| **Full GFM support** | Tables, task lists, strikethrough, autolinks         |
+| **LLM-optimized**    | Clean output designed for AI consumption             |
+| **Battle-tested**    | Handles malformed HTML from real web pages           |
+| **CSS selectors**    | Include/exclude elements during conversion           |
+supermarkdown is open source and available as both a Rust crate and npm package:
+```bash
+# npm
+npm install @vakra-dev/supermarkdown
+# Rust
+cargo add supermarkdown
+```
+Check out the [supermarkdown repository](https://github.com/vakra-dev/supermarkdown) for examples and documentation.
+## Server Deployment
+Reader uses a real Chromium browser under the hood. On headless Linux servers (VPS, EC2, etc.), you need to install Chrome's system dependencies:
+```bash
+# Debian/Ubuntu
+sudo apt-get install -y libnspr4 libnss3 libatk1.0-0 libatk-bridge2.0-0 \
+  libcups2 libxcb1 libatspi2.0-0 libx11-6 libxcomposite1 libxdamage1 \
+  libxext6 libxfixes3 libxrandr2 libgbm1 libcairo2 libpango-1.0-0 libasound2
+```
+This is the same requirement that Puppeteer and Playwright have on headless Linux. macOS, Windows, and Linux desktops already have these libraries.
+For Docker and production deployment guides, see the [deployment documentation](https://docs.reader.dev/documentation/guides/deployment).
+## Documentation
-| Guide                                                     | Description                |
-| --------------------------------------------------------- | -------------------------- |
-| [Docker](docs/deployment/docker.md)                       | Container deployment       |
-| [Production Server](docs/deployment/production-server.md) | Express + shared Hero Core |
-| [Job Queues](docs/deployment/job-queues.md)               | BullMQ async scheduling    |
-| [Serverless](docs/deployment/serverless.md)               | Lambda, Vercel, Workers    |
+Full documentation is available at **[docs.reader.dev](https://docs.reader.dev)**, including guides for scraping, crawling, proxy configuration, browser pool management, and deployment.
 ### Examples
@@ -658,4 +687,5 @@ If you use Reader in your research or project, please cite it:
 ## Support
 - [GitHub Issues](https://github.com/vakra-dev/reader/issues)
-- [Documentation](https://github.com/vakra-dev/reader)
+- [Documentation](https://docs.reader.dev)
+- [Discord](https://discord.gg/6tjkq7J5WV)

package/dist/cli/index.js CHANGED Viewed

@@ -18,21 +18,15 @@ import { ConnectionToHeroCore } from "@ulixee/hero";
 import pLimit from "p-limit";
 // src/formatters/markdown.ts
-import TurndownService from "turndown";
-var turndownService = new TurndownService({
-  headingStyle: "atx",
-  hr: "---",
-  bulletListMarker: "-",
-  codeBlockStyle: "fenced",
-  fence: "```",
-  emDelimiter: "*",
-  strongDelimiter: "**",
-  linkStyle: "inlined",
-  linkReferenceStyle: "full"
-});
+import { convert } from "@vakra-dev/supermarkdown";
 function htmlToMarkdown(html) {
   try {
-    return turndownService.turndown(html);
+    return convert(html, {
+      headingStyle: "atx",
+      bulletMarker: "-",
+      codeFence: "`",
+      linkStyle: "inline"
+    });
   } catch (error) {
     console.warn("Error converting HTML to Markdown:", error);
     return html.replace(/<[^>]*>/g, "").trim();
@@ -1723,7 +1717,7 @@ var EngineOrchestrator = class {
       return true;
     }
     if (error instanceof HttpError) {
-      return error.statusCode === 403 || error.statusCode === 429 || error.statusCode >= 500;
+      return error.statusCode === 403 || error.statusCode === 404 || error.statusCode === 429 || error.statusCode >= 500;
     }
     if (error instanceof EngineUnavailableError) {
       return true;