@vakra-dev/reader 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,658 @@
1
+ <p align="center">
2
+ <img src="docs/assets/logo.png" alt="Reader Logo" width="200" />
3
+ </p>
4
+
5
+ <h1 align="center">Reader</h1>
6
+
7
+ <p align="center">
8
+ <strong>Open-source, production-grade web scraping engine built for LLMs.</strong>
9
+ </p>
10
+
11
+ <p align="center">
12
+ Scrape and crawl the entire web, clean markdown, ready for your agents.
13
+ </p>
14
+
15
+ <p align="center">
16
+ <a href="https://opensource.org/licenses/Apache-2.0"><img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg" alt="License: Apache 2.0"></a>
17
+ <a href="https://www.npmjs.com/package/@vakra-dev/reader"><img src="https://img.shields.io/npm/v/@vakra-dev/reader.svg" alt="npm version"></a>
18
+ <a href="https://github.com/vakra-dev/reader/stargazers"><img src="https://img.shields.io/github/stars/vakra-dev/reader.svg?style=social" alt="GitHub stars"></a>
19
+ </p>
20
+
21
+ <p align="center">
22
+ If you find Reader useful, please consider giving it a star on GitHub! It helps others discover the project.
23
+ </p>
24
+
25
+ ## The Problem
26
+
27
+ Building agents that need web access is frustrating. You piece together Puppeteer, add stealth plugins, fight Cloudflare, manage proxies and it still breaks in production.
28
+
29
+ Because production grade web scraping isn't about rendering a page and converting HTML to markdown. It's about everything underneath:
30
+
31
+ | Layer | What it actually takes |
32
+ | ------------------------ | ------------------------------------------------------------------- |
33
+ | **Browser architecture** | Managing browser instances at scale, not one-off scripts |
34
+ | **Anti-bot bypass** | Cloudflare, Turnstile, JS challenges, they all block naive scrapers |
35
+ | **TLS fingerprinting** | Real browsers have fingerprints. Puppeteer doesn't. Sites know. |
36
+ | **Proxy infrastructure** | Datacenter vs residential, rotation strategies, sticky sessions |
37
+ | **Resource management** | Browser pooling, memory limits, graceful recycling |
38
+ | **Reliability** | Rate limiting, retries, timeouts, caching, graceful degradation |
39
+
40
+ I built **Reader**, a production-grade web scraping engine on top of [Ulixee Hero](https://ulixee.org/), a headless browser designed for exactly this.
41
+
42
+ ## The Solution
43
+
44
+ Two primitives. That's it.
45
+
46
+ ```typescript
47
+ import { ReaderClient } from "@vakra-dev/reader";
48
+
49
+ const reader = new ReaderClient();
50
+
51
+ // Scrape URLs → clean markdown
52
+ const result = await reader.scrape({ urls: ["https://example.com"] });
53
+ console.log(result.data[0].markdown);
54
+
55
+ // Crawl a site → discover + scrape pages
56
+ const pages = await reader.crawl({
57
+ url: "https://example.com",
58
+ depth: 2,
59
+ scrape: true,
60
+ });
61
+ console.log(`Found ${pages.urls.length} pages`);
62
+ ```
63
+
64
+ All the hard stuff, browser pooling, challenge detection, proxy rotation, retries, happens under the hood. You get clean markdown. Your agents get the web.
65
+
66
+ ## Features
67
+
68
+ - **Cloudflare Bypass** - TLS fingerprinting, DNS over TLS, WebRTC masking
69
+ - **Multiple Formats** - Markdown, HTML, JSON, plain text
70
+ - **CLI & API** - Use from command line or programmatically
71
+ - **Browser Pool** - Auto-recycling, health monitoring, queue management
72
+ - **Concurrent Scraping** - Parallel URL processing with progress tracking
73
+ - **Website Crawling** - BFS link discovery with depth/page limits
74
+ - **Proxy Support** - Datacenter and residential with sticky sessions
75
+
76
+ ## Installation
77
+
78
+ ```bash
79
+ npm install @vakra-dev/reader
80
+ ```
81
+
82
+ **Requirements:** Node.js >= 18
83
+
84
+ ## Quick Start
85
+
86
+ ### Basic Scrape
87
+
88
+ ```typescript
89
+ import { ReaderClient } from "@vakra-dev/reader";
90
+
91
+ const reader = new ReaderClient();
92
+
93
+ const result = await reader.scrape({
94
+ urls: ["https://example.com"],
95
+ formats: ["markdown", "text"],
96
+ });
97
+
98
+ console.log(result.data[0].markdown);
99
+ console.log(result.data[0].text);
100
+
101
+ await reader.close();
102
+ ```
103
+
104
+ ### Batch Scraping with Concurrency
105
+
106
+ ```typescript
107
+ import { ReaderClient } from "@vakra-dev/reader";
108
+
109
+ const reader = new ReaderClient();
110
+
111
+ const result = await reader.scrape({
112
+ urls: ["https://example.com", "https://example.org", "https://example.net"],
113
+ formats: ["markdown"],
114
+ batchConcurrency: 3,
115
+ onProgress: (progress) => {
116
+ console.log(`${progress.completed}/${progress.total}: ${progress.currentUrl}`);
117
+ },
118
+ });
119
+
120
+ console.log(`Scraped ${result.batchMetadata.successfulUrls} URLs`);
121
+
122
+ await reader.close();
123
+ ```
124
+
125
+ ### Crawling
126
+
127
+ ```typescript
128
+ import { ReaderClient } from "@vakra-dev/reader";
129
+
130
+ const reader = new ReaderClient();
131
+
132
+ const result = await reader.crawl({
133
+ url: "https://example.com",
134
+ depth: 2,
135
+ maxPages: 20,
136
+ scrape: true,
137
+ });
138
+
139
+ console.log(`Discovered ${result.urls.length} URLs`);
140
+ console.log(`Scraped ${result.scraped?.batchMetadata.successfulUrls} pages`);
141
+
142
+ await reader.close();
143
+ ```
144
+
145
+ ### With Proxy
146
+
147
+ ```typescript
148
+ import { ReaderClient } from "@vakra-dev/reader";
149
+
150
+ const reader = new ReaderClient();
151
+
152
+ const result = await reader.scrape({
153
+ urls: ["https://example.com"],
154
+ formats: ["markdown"],
155
+ proxy: {
156
+ type: "residential",
157
+ host: "proxy.example.com",
158
+ port: 8080,
159
+ username: "username",
160
+ password: "password",
161
+ country: "us",
162
+ },
163
+ });
164
+
165
+ await reader.close();
166
+ ```
167
+
168
+ ### With Proxy Rotation
169
+
170
+ ```typescript
171
+ import { ReaderClient } from "@vakra-dev/reader";
172
+
173
+ const reader = new ReaderClient({
174
+ proxies: [
175
+ { host: "proxy1.example.com", port: 8080, username: "user", password: "pass" },
176
+ { host: "proxy2.example.com", port: 8080, username: "user", password: "pass" },
177
+ ],
178
+ proxyRotation: "round-robin", // or "random"
179
+ });
180
+
181
+ const result = await reader.scrape({
182
+ urls: ["https://example.com", "https://example.org"],
183
+ formats: ["markdown"],
184
+ batchConcurrency: 2,
185
+ });
186
+
187
+ await reader.close();
188
+ ```
189
+
190
+ ### With Browser Pool Configuration
191
+
192
+ ```typescript
193
+ import { ReaderClient } from "@vakra-dev/reader";
194
+
195
+ const reader = new ReaderClient({
196
+ browserPool: {
197
+ size: 5, // 5 browser instances
198
+ retireAfterPages: 50, // Recycle after 50 pages
199
+ retireAfterMinutes: 15, // Recycle after 15 minutes
200
+ },
201
+ verbose: true,
202
+ });
203
+
204
+ const result = await reader.scrape({
205
+ urls: manyUrls,
206
+ batchConcurrency: 5,
207
+ });
208
+
209
+ await reader.close();
210
+ ```
211
+
212
+ ## CLI Reference
213
+
214
+ ### Daemon Mode
215
+
216
+ For multiple requests, start a daemon to keep browser pool warm:
217
+
218
+ ```bash
219
+ # Start daemon with browser pool
220
+ npx reader start --pool-size 5
221
+
222
+ # All subsequent commands auto-connect to daemon
223
+ npx reader scrape https://example.com
224
+ npx reader crawl https://example.com -d 2
225
+
226
+ # Check daemon status
227
+ npx reader status
228
+
229
+ # Stop daemon
230
+ npx reader stop
231
+
232
+ # Force standalone mode (bypass daemon)
233
+ npx reader scrape https://example.com --standalone
234
+ ```
235
+
236
+ ### `reader scrape <urls...>`
237
+
238
+ Scrape one or more URLs.
239
+
240
+ ```bash
241
+ # Scrape a single URL
242
+ npx reader scrape https://example.com
243
+
244
+ # Scrape with multiple formats
245
+ npx reader scrape https://example.com -f markdown,text
246
+
247
+ # Scrape multiple URLs concurrently
248
+ npx reader scrape https://example.com https://example.org -c 2
249
+
250
+ # Save to file
251
+ npx reader scrape https://example.com -o output.md
252
+ ```
253
+
254
+ | Option | Type | Default | Description |
255
+ | ------------------------ | ------ | ------------ | --------------------------------------------------------- |
256
+ | `-f, --format <formats>` | string | `"markdown"` | Output formats (comma-separated: markdown,html,json,text) |
257
+ | `-o, --output <file>` | string | stdout | Output file path |
258
+ | `-c, --concurrency <n>` | number | `1` | Parallel requests |
259
+ | `-t, --timeout <ms>` | number | `30000` | Request timeout in milliseconds |
260
+ | `--batch-timeout <ms>` | number | `300000` | Total timeout for entire batch operation |
261
+ | `--proxy <url>` | string | - | Proxy URL (e.g., http://user:pass@host:port) |
262
+ | `--user-agent <string>` | string | - | Custom user agent string |
263
+ | `--show-chrome` | flag | - | Show browser window for debugging |
264
+ | `--no-metadata` | flag | - | Exclude metadata from output |
265
+ | `-v, --verbose` | flag | - | Enable verbose logging |
266
+
267
+ ### `reader crawl <url>`
268
+
269
+ Crawl a website to discover pages.
270
+
271
+ ```bash
272
+ # Crawl with default settings
273
+ npx reader crawl https://example.com
274
+
275
+ # Crawl deeper with more pages
276
+ npx reader crawl https://example.com -d 3 -m 50
277
+
278
+ # Crawl and scrape content
279
+ npx reader crawl https://example.com -d 2 --scrape
280
+
281
+ # Filter URLs with patterns
282
+ npx reader crawl https://example.com --include "blog/*" --exclude "admin/*"
283
+ ```
284
+
285
+ | Option | Type | Default | Description |
286
+ | ------------------------ | ------ | ------------ | ----------------------------------------------- |
287
+ | `-d, --depth <n>` | number | `1` | Maximum crawl depth |
288
+ | `-m, --max-pages <n>` | number | `20` | Maximum pages to discover |
289
+ | `-s, --scrape` | flag | - | Also scrape content of discovered pages |
290
+ | `-f, --format <formats>` | string | `"markdown"` | Output formats when scraping (comma-separated) |
291
+ | `-o, --output <file>` | string | stdout | Output file path |
292
+ | `--delay <ms>` | number | `1000` | Delay between requests in milliseconds |
293
+ | `-t, --timeout <ms>` | number | - | Total timeout for crawl operation |
294
+ | `--include <patterns>` | string | - | URL patterns to include (comma-separated regex) |
295
+ | `--exclude <patterns>` | string | - | URL patterns to exclude (comma-separated regex) |
296
+ | `--proxy <url>` | string | - | Proxy URL (e.g., http://user:pass@host:port) |
297
+ | `--user-agent <string>` | string | - | Custom user agent string |
298
+ | `--show-chrome` | flag | - | Show browser window for debugging |
299
+ | `-v, --verbose` | flag | - | Enable verbose logging |
300
+
301
+ ## API Reference
302
+
303
+ ### `ReaderClient`
304
+
305
+ The recommended way to use Reader. Manages HeroCore lifecycle automatically.
306
+
307
+ ```typescript
308
+ import { ReaderClient } from "@vakra-dev/reader";
309
+
310
+ const reader = new ReaderClient({ verbose: true });
311
+
312
+ // Scrape
313
+ const result = await reader.scrape({ urls: ["https://example.com"] });
314
+
315
+ // Crawl
316
+ const crawlResult = await reader.crawl({ url: "https://example.com", depth: 2 });
317
+
318
+ // Close when done (optional - auto-closes on exit)
319
+ await reader.close();
320
+ ```
321
+
322
+ #### Constructor Options
323
+
324
+ | Option | Type | Default | Description |
325
+ | --------------- | ------------------- | --------------- | ------------------------------------------------ |
326
+ | `verbose` | `boolean` | `false` | Enable verbose logging |
327
+ | `showChrome` | `boolean` | `false` | Show browser window for debugging |
328
+ | `browserPool` | `BrowserPoolConfig` | `undefined` | Browser pool configuration (size, recycling) |
329
+ | `proxies` | `ProxyConfig[]` | `undefined` | Array of proxies for rotation |
330
+ | `proxyRotation` | `string` | `"round-robin"` | Rotation strategy: `"round-robin"` or `"random"` |
331
+
332
+ #### BrowserPoolConfig
333
+
334
+ | Option | Type | Default | Description |
335
+ | -------------------- | -------- | ------- | ----------------------------------- |
336
+ | `size` | `number` | `2` | Number of browser instances in pool |
337
+ | `retireAfterPages` | `number` | `100` | Recycle browser after N page loads |
338
+ | `retireAfterMinutes` | `number` | `30` | Recycle browser after N minutes |
339
+ | `maxQueueSize` | `number` | `100` | Max pending requests in queue |
340
+
341
+ #### Methods
342
+
343
+ | Method | Description |
344
+ | ----------------- | ---------------------------------- |
345
+ | `scrape(options)` | Scrape one or more URLs |
346
+ | `crawl(options)` | Crawl a website to discover pages |
347
+ | `start()` | Pre-initialize HeroCore (optional) |
348
+ | `isReady()` | Check if client is initialized |
349
+ | `close()` | Close client and release resources |
350
+
351
+ ### `scrape(options): Promise<ScrapeResult>`
352
+
353
+ Scrape one or more URLs. Can be used directly or via `ReaderClient`.
354
+
355
+ | Option | Type | Required | Default | Description |
356
+ | ------------------ | ------------------------------------------------- | -------- | -------------- | --------------------------------------------------------------- |
357
+ | `urls` | `string[]` | Yes | - | Array of URLs to scrape |
358
+ | `formats` | `Array<"markdown" \| "html" \| "json" \| "text">` | No | `["markdown"]` | Output formats |
359
+ | `includeMetadata` | `boolean` | No | `true` | Include URL, title, timestamp in output |
360
+ | `userAgent` | `string` | No | - | Custom user agent string |
361
+ | `timeoutMs` | `number` | No | `30000` | Request timeout in milliseconds |
362
+ | `includePatterns` | `string[]` | No | `[]` | URL patterns to include (regex strings) |
363
+ | `excludePatterns` | `string[]` | No | `[]` | URL patterns to exclude (regex strings) |
364
+ | `batchConcurrency` | `number` | No | `1` | Number of URLs to process in parallel |
365
+ | `batchTimeoutMs` | `number` | No | `300000` | Total timeout for entire batch operation |
366
+ | `maxRetries` | `number` | No | `2` | Maximum retry attempts for failed URLs |
367
+ | `onProgress` | `function` | No | - | Progress callback: `({ completed, total, currentUrl }) => void` |
368
+ | `proxy` | `ProxyConfig` | No | - | Proxy configuration object |
369
+ | `waitForSelector` | `string` | No | - | CSS selector to wait for before page is loaded |
370
+ | `verbose` | `boolean` | No | `false` | Enable verbose logging |
371
+ | `showChrome` | `boolean` | No | `false` | Show Chrome window for debugging |
372
+ | `connectionToCore` | `any` | No | - | Connection to shared Hero Core (for production) |
373
+
374
+ **Returns:** `Promise<ScrapeResult>`
375
+
376
+ ```typescript
377
+ interface ScrapeResult {
378
+ data: WebsiteScrapeResult[];
379
+ batchMetadata: BatchMetadata;
380
+ }
381
+
382
+ interface WebsiteScrapeResult {
383
+ markdown?: string;
384
+ html?: string;
385
+ json?: string;
386
+ text?: string;
387
+ metadata: {
388
+ baseUrl: string;
389
+ totalPages: number;
390
+ scrapedAt: string;
391
+ duration: number;
392
+ website: WebsiteMetadata;
393
+ };
394
+ }
395
+
396
+ interface BatchMetadata {
397
+ totalUrls: number;
398
+ successfulUrls: number;
399
+ failedUrls: number;
400
+ scrapedAt: string;
401
+ totalDuration: number;
402
+ errors?: Array<{ url: string; error: string }>;
403
+ }
404
+ ```
405
+
406
+ ### `crawl(options): Promise<CrawlResult>`
407
+
408
+ Crawl a website to discover pages.
409
+
410
+ | Option | Type | Required | Default | Description |
411
+ | ------------------- | ------------------------------------------------- | -------- | ---------------------- | ----------------------------------------------- |
412
+ | `url` | `string` | Yes | - | Single seed URL to start crawling from |
413
+ | `depth` | `number` | No | `1` | Maximum depth to crawl |
414
+ | `maxPages` | `number` | No | `20` | Maximum pages to discover |
415
+ | `scrape` | `boolean` | No | `false` | Also scrape full content of discovered pages |
416
+ | `delayMs` | `number` | No | `1000` | Delay between requests in milliseconds |
417
+ | `timeoutMs` | `number` | No | - | Total timeout for entire crawl operation |
418
+ | `includePatterns` | `string[]` | No | - | URL patterns to include (regex strings) |
419
+ | `excludePatterns` | `string[]` | No | - | URL patterns to exclude (regex strings) |
420
+ | `formats` | `Array<"markdown" \| "html" \| "json" \| "text">` | No | `["markdown", "html"]` | Output formats for scraped content |
421
+ | `scrapeConcurrency` | `number` | No | `2` | Number of URLs to scrape in parallel |
422
+ | `proxy` | `ProxyConfig` | No | - | Proxy configuration object |
423
+ | `userAgent` | `string` | No | - | Custom user agent string |
424
+ | `verbose` | `boolean` | No | `false` | Enable verbose logging |
425
+ | `showChrome` | `boolean` | No | `false` | Show Chrome window for debugging |
426
+ | `connectionToCore` | `any` | No | - | Connection to shared Hero Core (for production) |
427
+
428
+ **Returns:** `Promise<CrawlResult>`
429
+
430
+ ```typescript
431
+ interface CrawlResult {
432
+ urls: CrawlUrl[];
433
+ scraped?: ScrapeResult;
434
+ metadata: CrawlMetadata;
435
+ }
436
+
437
+ interface CrawlUrl {
438
+ url: string;
439
+ title: string;
440
+ description: string | null;
441
+ }
442
+
443
+ interface CrawlMetadata {
444
+ totalUrls: number;
445
+ maxDepth: number;
446
+ totalDuration: number;
447
+ seedUrl: string;
448
+ }
449
+ ```
450
+
451
+ ### ProxyConfig
452
+
453
+ | Option | Type | Required | Default | Description |
454
+ | ---------- | ------------------------------- | -------- | ------- | ------------------------------------------------------- |
455
+ | `url` | `string` | No | - | Full proxy URL (takes precedence over other fields) |
456
+ | `type` | `"datacenter" \| "residential"` | No | - | Proxy type |
457
+ | `host` | `string` | No | - | Proxy host |
458
+ | `port` | `number` | No | - | Proxy port |
459
+ | `username` | `string` | No | - | Proxy username |
460
+ | `password` | `string` | No | - | Proxy password |
461
+ | `country` | `string` | No | - | Country code for residential proxies (e.g., 'us', 'uk') |
462
+
463
+ ## Advanced Usage
464
+
465
+ ### Browser Pool
466
+
467
+ For high-volume scraping, use the browser pool directly:
468
+
469
+ ```typescript
470
+ import { BrowserPool } from "@vakra-dev/reader";
471
+
472
+ const pool = new BrowserPool({ size: 5 });
473
+ await pool.initialize();
474
+
475
+ // Use withBrowser for automatic acquire/release
476
+ const title = await pool.withBrowser(async (hero) => {
477
+ await hero.goto("https://example.com");
478
+ return await hero.document.title;
479
+ });
480
+
481
+ // Check pool health
482
+ const health = await pool.healthCheck();
483
+ console.log(`Pool healthy: ${health.healthy}`);
484
+
485
+ await pool.shutdown();
486
+ ```
487
+
488
+ ### Shared Hero Core (Production)
489
+
490
+ For production servers, use a shared Hero Core to avoid spawning new Chrome for each request:
491
+
492
+ ```typescript
493
+ import HeroCore from "@ulixee/hero-core";
494
+ import { TransportBridge } from "@ulixee/net";
495
+ import { ConnectionToHeroCore } from "@ulixee/hero";
496
+ import { scrape } from "@vakra-dev/reader";
497
+
498
+ // Initialize once at startup
499
+ const heroCore = new HeroCore();
500
+ await heroCore.start();
501
+
502
+ // Create connection for each request
503
+ function createConnection() {
504
+ const bridge = new TransportBridge();
505
+ heroCore.addConnection(bridge.transportToClient);
506
+ return new ConnectionToHeroCore(bridge.transportToCore);
507
+ }
508
+
509
+ // Use in requests
510
+ const result = await scrape({
511
+ urls: ["https://example.com"],
512
+ connectionToCore: createConnection(),
513
+ });
514
+
515
+ // Shutdown on exit
516
+ await heroCore.close();
517
+ ```
518
+
519
+ ### Cloudflare Challenge Detection
520
+
521
+ ```typescript
522
+ import { detectChallenge, waitForChallengeResolution } from "@vakra-dev/reader";
523
+
524
+ const detection = await detectChallenge(hero);
525
+
526
+ if (detection.isChallenge) {
527
+ console.log(`Challenge detected: ${detection.type}`);
528
+
529
+ const result = await waitForChallengeResolution(hero, {
530
+ maxWaitMs: 45000,
531
+ pollIntervalMs: 500,
532
+ verbose: true,
533
+ initialUrl: await hero.url,
534
+ });
535
+
536
+ if (result.resolved) {
537
+ console.log(`Challenge resolved via ${result.method} in ${result.waitedMs}ms`);
538
+ }
539
+ }
540
+ ```
541
+
542
+ ### Custom Formatters
543
+
544
+ ```typescript
545
+ import { formatToMarkdown, formatToText, formatToHTML, formatToJson } from "@vakra-dev/reader";
546
+
547
+ // Format pages to different outputs
548
+ const markdown = formatToMarkdown(pages, baseUrl, scrapedAt, duration, metadata);
549
+ const text = formatToText(pages, baseUrl, scrapedAt, duration, metadata);
550
+ const html = formatToHTML(pages, baseUrl, scrapedAt, duration, metadata);
551
+ const json = formatToJson(pages, baseUrl, scrapedAt, duration, metadata);
552
+ ```
553
+
554
+ ## How It Works
555
+
556
+ ### Cloudflare Bypass
557
+
558
+ Reader uses [Ulixee Hero](https://ulixee.org/), a headless browser with advanced anti-detection:
559
+
560
+ 1. **TLS Fingerprinting** - Emulates real Chrome browser fingerprints
561
+ 2. **DNS over TLS** - Uses Cloudflare DNS (1.1.1.1) to mimic Chrome behavior
562
+ 3. **WebRTC IP Masking** - Prevents IP leaks
563
+ 4. **Multi-Signal Detection** - Detects challenges using DOM elements and text patterns
564
+ 5. **Dynamic Waiting** - Polls for challenge resolution with URL redirect detection
565
+
566
+ ### Browser Pool
567
+
568
+ - **Auto-Recycling** - Browsers recycled after 100 requests or 30 minutes
569
+ - **Health Monitoring** - Background health checks every 5 minutes
570
+ - **Request Queuing** - Queues requests when pool is full (max 100)
571
+
572
+ ## Documentation
573
+
574
+ | Guide | Description |
575
+ | ------------------------------------------ | ------------------------------ |
576
+ | [Getting Started](docs/getting-started.md) | Detailed setup and first steps |
577
+ | [Architecture](docs/architecture.md) | System design and data flow |
578
+ | [API Reference](docs/api-reference.md) | Complete API documentation |
579
+ | [Troubleshooting](docs/troubleshooting.md) | Common errors and solutions |
580
+
581
+ ### Guides
582
+
583
+ | Guide | Description |
584
+ | --------------------------------------------------------- | ----------------------------- |
585
+ | [Cloudflare Bypass](docs/guides/cloudflare-bypass.md) | How antibot bypass works |
586
+ | [Proxy Configuration](docs/guides/proxy-configuration.md) | Setting up proxies |
587
+ | [Browser Pool](docs/guides/browser-pool.md) | Production browser management |
588
+ | [Output Formats](docs/guides/output-formats.md) | Understanding output formats |
589
+
590
+ ### Deployment
591
+
592
+ | Guide | Description |
593
+ | --------------------------------------------------------- | -------------------------- |
594
+ | [Docker](docs/deployment/docker.md) | Container deployment |
595
+ | [Production Server](docs/deployment/production-server.md) | Express + shared Hero Core |
596
+ | [Job Queues](docs/deployment/job-queues.md) | BullMQ async scheduling |
597
+ | [Serverless](docs/deployment/serverless.md) | Lambda, Vercel, Workers |
598
+
599
+ ### Examples
600
+
601
+ | Example | Description |
602
+ | ------------------------------------------------------------ | ------------------------------------------ |
603
+ | [Basic Scraping](examples/basic/basic-scrape.ts) | Simple single-URL scraping |
604
+ | [Batch Scraping](examples/basic/batch-scrape.ts) | Concurrent multi-URL scraping |
605
+ | [Browser Pool Config](examples/basic/browser-pool-config.ts) | Configure browser pool for high throughput |
606
+ | [Proxy Pool](examples/basic/proxy-pool.ts) | Proxy rotation with multiple proxies |
607
+ | [Cloudflare Bypass](examples/basic/cloudflare-bypass.ts) | Scrape Cloudflare-protected sites |
608
+ | [All Formats](examples/basic/all-formats.ts) | Output in markdown, html, json, text |
609
+ | [Crawl Website](examples/basic/crawl-website.ts) | Crawl and discover pages |
610
+ | [AI Tools](examples/ai-tools/) | OpenAI, Anthropic, LangChain integrations |
611
+ | [Production](examples/production/) | Express server, job queues |
612
+ | [Deployment](examples/deployment/) | Docker, Lambda, Vercel |
613
+
614
+ ## Development
615
+
616
+ ```bash
617
+ # Install dependencies
618
+ npm install
619
+
620
+ # Run linting
621
+ npm run lint
622
+
623
+ # Format code
624
+ npm run format
625
+
626
+ # Type check
627
+ npm run typecheck
628
+
629
+ # Find TODOs
630
+ npm run todo
631
+ ```
632
+
633
+ ## Contributing
634
+
635
+ Contributions welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
636
+
637
+ ## License
638
+
639
+ [Apache 2.0](LICENSE) - See LICENSE for details.
640
+
641
+ ## Citation
642
+
643
+ If you use Reader in your research or project, please cite it:
644
+
645
+ ```bibtex
646
+ @software{reader2026,
647
+ author = {Kaul, Nihal},
648
+ title = {Reader: Open-source, production-grade web scraping engine built for LLMs},
649
+ year = {2026},
650
+ publisher = {GitHub},
651
+ url = {https://github.com/vakra-dev/reader}
652
+ }
653
+ ```
654
+
655
+ ## Support
656
+
657
+ - [GitHub Issues](https://github.com/vakra-dev/reader/issues)
658
+ - [Documentation](https://github.com/vakra-dev/reader)