llm-search-tools 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (126) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +244 -0
  3. package/dist/index.d.ts +18 -0
  4. package/dist/index.js +40 -0
  5. package/dist/index.js.map +1 -0
  6. package/dist/integration.test.d.ts +1 -0
  7. package/dist/integration.test.js +237 -0
  8. package/dist/modules/answerbox.test.d.ts +1 -0
  9. package/dist/modules/answerbox.test.js +105 -0
  10. package/dist/modules/autocomplete.d.ts +11 -0
  11. package/dist/modules/autocomplete.js +159 -0
  12. package/dist/modules/autocomplete.test.d.ts +1 -0
  13. package/dist/modules/autocomplete.test.js +188 -0
  14. package/dist/modules/common.d.ts +26 -0
  15. package/dist/modules/common.js +263 -0
  16. package/dist/modules/common.test.d.ts +1 -0
  17. package/dist/modules/common.test.js +87 -0
  18. package/dist/modules/crawl.d.ts +9 -0
  19. package/dist/modules/crawl.js +117 -0
  20. package/dist/modules/crawl.test.d.ts +1 -0
  21. package/dist/modules/crawl.test.js +48 -0
  22. package/dist/modules/events.d.ts +8 -0
  23. package/dist/modules/events.js +129 -0
  24. package/dist/modules/events.test.d.ts +1 -0
  25. package/dist/modules/events.test.js +104 -0
  26. package/dist/modules/finance.d.ts +10 -0
  27. package/dist/modules/finance.js +20 -0
  28. package/dist/modules/finance.test.d.ts +1 -0
  29. package/dist/modules/finance.test.js +77 -0
  30. package/dist/modules/flights.d.ts +8 -0
  31. package/dist/modules/flights.js +135 -0
  32. package/dist/modules/flights.test.d.ts +1 -0
  33. package/dist/modules/flights.test.js +128 -0
  34. package/dist/modules/hackernews.d.ts +8 -0
  35. package/dist/modules/hackernews.js +87 -0
  36. package/dist/modules/hackernews.js.map +1 -0
  37. package/dist/modules/images.test.d.ts +1 -0
  38. package/dist/modules/images.test.js +145 -0
  39. package/dist/modules/integrations.test.d.ts +1 -0
  40. package/dist/modules/integrations.test.js +93 -0
  41. package/dist/modules/media.d.ts +11 -0
  42. package/dist/modules/media.js +132 -0
  43. package/dist/modules/media.test.d.ts +1 -0
  44. package/dist/modules/media.test.js +186 -0
  45. package/dist/modules/news.d.ts +3 -0
  46. package/dist/modules/news.js +39 -0
  47. package/dist/modules/news.test.d.ts +1 -0
  48. package/dist/modules/news.test.js +88 -0
  49. package/dist/modules/parser.d.ts +19 -0
  50. package/dist/modules/parser.js +361 -0
  51. package/dist/modules/parser.test.d.ts +1 -0
  52. package/dist/modules/parser.test.js +151 -0
  53. package/dist/modules/reddit.d.ts +21 -0
  54. package/dist/modules/reddit.js +107 -0
  55. package/dist/modules/scrape.d.ts +16 -0
  56. package/dist/modules/scrape.js +272 -0
  57. package/dist/modules/scrape.test.d.ts +1 -0
  58. package/dist/modules/scrape.test.js +232 -0
  59. package/dist/modules/scraper.d.ts +12 -0
  60. package/dist/modules/scraper.js +640 -0
  61. package/dist/modules/scrapers/anidb.d.ts +8 -0
  62. package/dist/modules/scrapers/anidb.js +156 -0
  63. package/dist/modules/scrapers/duckduckgo.d.ts +6 -0
  64. package/dist/modules/scrapers/duckduckgo.js +284 -0
  65. package/dist/modules/scrapers/google-news.d.ts +2 -0
  66. package/dist/modules/scrapers/google-news.js +60 -0
  67. package/dist/modules/scrapers/google.d.ts +6 -0
  68. package/dist/modules/scrapers/google.js +211 -0
  69. package/dist/modules/scrapers/searxng.d.ts +2 -0
  70. package/dist/modules/scrapers/searxng.js +93 -0
  71. package/dist/modules/scrapers/thetvdb.d.ts +3 -0
  72. package/dist/modules/scrapers/thetvdb.js +147 -0
  73. package/dist/modules/scrapers/tmdb.d.ts +3 -0
  74. package/dist/modules/scrapers/tmdb.js +172 -0
  75. package/dist/modules/scrapers/yahoo-finance.d.ts +2 -0
  76. package/dist/modules/scrapers/yahoo-finance.js +33 -0
  77. package/dist/modules/search.d.ts +5 -0
  78. package/dist/modules/search.js +45 -0
  79. package/dist/modules/search.js.map +1 -0
  80. package/dist/modules/search.test.d.ts +1 -0
  81. package/dist/modules/search.test.js +219 -0
  82. package/dist/modules/urbandictionary.d.ts +12 -0
  83. package/dist/modules/urbandictionary.js +26 -0
  84. package/dist/modules/webpage.d.ts +4 -0
  85. package/dist/modules/webpage.js +150 -0
  86. package/dist/modules/webpage.js.map +1 -0
  87. package/dist/modules/wikipedia.d.ts +5 -0
  88. package/dist/modules/wikipedia.js +85 -0
  89. package/dist/modules/wikipedia.js.map +1 -0
  90. package/dist/scripts/interactive-search.d.ts +1 -0
  91. package/dist/scripts/interactive-search.js +98 -0
  92. package/dist/test.d.ts +1 -0
  93. package/dist/test.js +179 -0
  94. package/dist/test.js.map +1 -0
  95. package/dist/testBraveSearch.d.ts +1 -0
  96. package/dist/testBraveSearch.js +34 -0
  97. package/dist/testDuckDuckGo.d.ts +1 -0
  98. package/dist/testDuckDuckGo.js +52 -0
  99. package/dist/testEcosia.d.ts +1 -0
  100. package/dist/testEcosia.js +57 -0
  101. package/dist/testSearchModule.d.ts +1 -0
  102. package/dist/testSearchModule.js +95 -0
  103. package/dist/testwebpage.d.ts +1 -0
  104. package/dist/testwebpage.js +81 -0
  105. package/dist/types.d.ts +174 -0
  106. package/dist/types.js +3 -0
  107. package/dist/types.js.map +1 -0
  108. package/dist/utils/createTestDocx.d.ts +1 -0
  109. package/dist/utils/createTestDocx.js +58 -0
  110. package/dist/utils/htmlcleaner.d.ts +20 -0
  111. package/dist/utils/htmlcleaner.js +172 -0
  112. package/docs/README.md +275 -0
  113. package/docs/autocomplete.md +73 -0
  114. package/docs/crawling.md +88 -0
  115. package/docs/events.md +58 -0
  116. package/docs/examples.md +158 -0
  117. package/docs/finance.md +60 -0
  118. package/docs/flights.md +71 -0
  119. package/docs/hackernews.md +121 -0
  120. package/docs/media.md +87 -0
  121. package/docs/news.md +75 -0
  122. package/docs/parser.md +197 -0
  123. package/docs/scraper.md +347 -0
  124. package/docs/search.md +106 -0
  125. package/docs/wikipedia.md +91 -0
  126. package/package.json +97 -0
@@ -0,0 +1,71 @@
1
+ # Flights Search Documentation
2
+
3
+ The flights module allows you to search for flight prices and schedules using Google Flights. It uses headless browser automation to navigate the interface and extract flight data.
4
+
5
+ ## Usage
6
+
7
+ ```typescript
8
+ import { searchFlights } from "llm-search-tools";
9
+
10
+ // Simple string query
11
+ const results = await searchFlights("flights from JFK to LHR");
12
+
13
+ // Structured query
14
+ const results = await searchFlights({
15
+ from: "SFO",
16
+ to: "HND",
17
+ departureDate: "2025-05-01",
18
+ returnDate: "2025-05-15",
19
+ });
20
+
21
+ results.flights.forEach((flight) => {
22
+ console.log(`${flight.airline}: ${flight.price} (${flight.duration})`);
23
+ });
24
+ ```
25
+
26
+ ## API Reference
27
+
28
+ ### `searchFlights(query, options?)`
29
+
30
+ - **query**: `string | FlightSearchOptions` - Either a natural language string or an options object
31
+ - **options**: `FlightSearchOptions` - Additional options (if query is a string)
32
+
33
+ ### Options (`FlightSearchOptions`)
34
+
35
+ | Option | Type | Description |
36
+ | --------------- | ----------------------- | ----------------------------------------------- |
37
+ | `from` | `string` | Origin airport code or city |
38
+ | `to` | `string` | Destination airport code or city |
39
+ | `departureDate` | `string` | Departure date (YYYY-MM-DD or natural language) |
40
+ | `returnDate` | `string` | Return date (YYYY-MM-DD or natural language) |
41
+ | `limit` | `number` | Maximum results (default: 10) |
42
+ | `proxy` | `string \| ProxyConfig` | Proxy configuration |
43
+
44
+ ## Output Structure
45
+
46
+ Returns a `Promise<FlightResult>`:
47
+
48
+ ```typescript
49
+ interface FlightResult {
50
+ flights: Flight[];
51
+ url: string; // Google Flights URL
52
+ source: string; // 'google-flights'
53
+ }
54
+
55
+ interface Flight {
56
+ airline: string;
57
+ departureTime: string;
58
+ arrivalTime: string;
59
+ duration: string;
60
+ price: string;
61
+ stops: string;
62
+ origin?: string; // Inferred if available
63
+ destination?: string; // Inferred if available
64
+ }
65
+ ```
66
+
67
+ ## Limitations
68
+
69
+ - Google Flights is a complex, dynamic application. Selectors are heavily obfuscated and change frequently.
70
+ - This module uses heuristics to extract data (e.g., regex for times and prices).
71
+ - Performance depends on network speed and proxy quality, as it loads a full React application.
@@ -0,0 +1,121 @@
1
+ # HackerNews Module 💻
2
+
3
+ Get the latest tech news and discussions from Hacker News.
4
+
5
+ ## Functions
6
+
7
+ ### getTopStories(limit?: number)
8
+
9
+ Get the current top stories.
10
+
11
+ ```typescript
12
+ import { getTopStories } from 'llm-search-tools';
13
+
14
+ const stories = await getTopStories(10);
15
+ ```
16
+
17
+ ### getNewStories(limit?: number)
18
+
19
+ Get the newest stories.
20
+
21
+ ```typescript
22
+ import { getNewStories } from 'llm-search-tools';
23
+
24
+ const stories = await getNewStories(10);
25
+ ```
26
+
27
+ ### getBestStories(limit?: number)
28
+
29
+ Get the best stories of all time.
30
+
31
+ ```typescript
32
+ import { getBestStories } from 'llm-search-tools';
33
+
34
+ const stories = await getBestStories(10);
35
+ ```
36
+
37
+ ### getAskStories(limit?: number)
38
+
39
+ Get "Ask HN" posts.
40
+
41
+ ```typescript
42
+ import { getAskStories } from 'llm-search-tools';
43
+
44
+ const stories = await getAskStories(10);
45
+ ```
46
+
47
+ ### getShowStories(limit?: number)
48
+
49
+ Get "Show HN" posts.
50
+
51
+ ```typescript
52
+ import { getShowStories } from 'llm-search-tools';
53
+
54
+ const stories = await getShowStories(10);
55
+ ```
56
+
57
+ ### getJobStories(limit?: number)
58
+
59
+ Get job postings.
60
+
61
+ ```typescript
62
+ import { getJobStories } from 'llm-search-tools';
63
+
64
+ const stories = await getJobStories(10);
65
+ ```
66
+
67
+ ### getStoryById(id: number)
68
+
69
+ Get a specific story by its ID.
70
+
71
+ ```typescript
72
+ import { getStoryById } from 'llm-search-tools';
73
+
74
+ const story = await getStoryById(123456);
75
+ ```
76
+
77
+ ## Result Format
78
+
79
+ ```typescript
80
+ interface HackerNewsResult extends SearchResult {
81
+ points?: number; // story points/score
82
+ author?: string; // post author
83
+ comments?: number; // number of comments
84
+ time?: Date; // post timestamp
85
+ }
86
+ ```
87
+
88
+ ## Error Handling
89
+
90
+ All functions throw a `SearchError` on failure:
91
+
92
+ ```typescript
93
+ try {
94
+ const stories = await getTopStories();
95
+ } catch (err) {
96
+ if (err.code === 'HN_TOP_ERROR') {
97
+ console.error('failed to get top stories:', err.message);
98
+ }
99
+ }
100
+ ```
101
+
102
+ ## Tips
103
+
104
+ - Default limit is 10 stories for all functions
105
+ - Stories are returned in descending order (highest score first)
106
+ - Use `getStoryById()` to fetch full details of a story
107
+ - Comments count might be null for very new stories
108
+ - URLs point to HN discussion if no external URL exists
109
+ - Job stories won't have points or comments
110
+ - "Ask HN" and "Show HN" often have interesting discussions
111
+
112
+ ## Story Types
113
+
114
+ Each function gets different types of content:
115
+
116
+ - `getTopStories()` - Currently trending stories
117
+ - `getNewStories()` - Most recent submissions
118
+ - `getBestStories()` - Highest voted all time
119
+ - `getAskStories()` - Questions and discussions
120
+ - `getShowStories()` - Project showcases
121
+ - `getJobStories()` - Job listings
package/docs/media.md ADDED
@@ -0,0 +1,87 @@
1
+ # Media Search Module 🎬
2
+
3
+ The media search module provides unified access to metadata from **TMDB**, **TheTVDB**, and **AniDB** without requiring API keys. It uses a smart fallback strategy to find the best results for Movies, TV Shows, and Anime.
4
+
5
+ ## Functions
6
+
7
+ ### searchMedia(query: string, options?: MediaSearchOptions)
8
+
9
+ Main function that orchestrates the search based on the requested media type.
10
+
11
+ ```typescript
12
+ import { searchMedia } from "llm-search-tools";
13
+
14
+ // General search (defaults to TMDB -> TheTVDB)
15
+ const results = await searchMedia("Breaking Bad");
16
+
17
+ // Specific Anime search (AniDB -> TMDB)
18
+ const animeResults = await searchMedia("Neon Genesis Evangelion", {
19
+ type: "anime",
20
+ });
21
+ ```
22
+
23
+ ## Strategies
24
+
25
+ The module uses different strategies based on the `type` option:
26
+
27
+ 1. **Anime** (`type: "anime"`)
28
+ - **Primary**: AniDB (Best for anime metadata)
29
+ - **Fallback**: TMDB (filtered for Animation/TV)
30
+
31
+ 2. **TV** (`type: "tv"`)
32
+ - **Primary**: TMDB (Fast, good coverage)
33
+ - **Fallback**: TheTVDB (Specialized TV database)
34
+
35
+ 3. **Movie / General** (Default)
36
+ - **Primary**: TMDB
37
+ - **Fallback**: TheTVDB (Only if not explicitly searching for movies)
38
+
39
+ ## Options
40
+
41
+ ```typescript
42
+ interface MediaSearchOptions extends ScraperOptions {
43
+ type?: "movie" | "tv" | "anime";
44
+
45
+ // Inherited from ScraperOptions
46
+ limit?: number;
47
+ proxy?: ProxyConfig | string;
48
+ forcePuppeteer?: boolean; // Useful if getting blocked
49
+ }
50
+ ```
51
+
52
+ ## Result Format
53
+
54
+ Returns a consistent `MediaResult` object regardless of the source.
55
+
56
+ ```typescript
57
+ interface MediaResult {
58
+ title: string;
59
+ url: string;
60
+ description?: string;
61
+ rating?: string; // e.g. "85%" or "8.5"
62
+ releaseDate?: string;
63
+ posterUrl?: string;
64
+ genres?: string[];
65
+ cast?: string[];
66
+
67
+ // Source identification
68
+ source: "tmdb" | "thetvdb" | "anidb";
69
+ mediaType: "movie" | "tv" | "anime";
70
+
71
+ // Availability (if scraped)
72
+ watchProviders?: {
73
+ name: string;
74
+ type: "stream" | "rent" | "buy";
75
+ }[];
76
+ }
77
+ ```
78
+
79
+ ## Anti-Bot Features
80
+
81
+ - **AniDB**: Enforces strict rate limiting (2.5s between requests) and always uses a stealth browser to avoid IP bans.
82
+ - **TMDB/TheTVDB**: Starts with lightweight HTTP requests and automatically falls back to a stealth Puppeteer browser if bot protection is detected.
83
+
84
+ ## Tips
85
+
86
+ - **Anime**: Always specify `{ type: "anime" }` for anime queries. AniDB has very strict matching but high-quality data.
87
+ - **Rate Limiting**: If you are making many requests, add delays between them. The module handles internal rate limits for specific providers, but aggressive usage might still trigger captchas.
package/docs/news.md ADDED
@@ -0,0 +1,75 @@
1
+ # News Module 📰
2
+
3
+ The news module provides specialized news search capabilities using Google News and DuckDuckGo News.
4
+
5
+ ## Functions
6
+
7
+ ### searchNews(query: string, options?: ScraperOptions)
8
+
9
+ Main news search function that orchestrates multiple providers for resilience:
10
+
11
+ 1. **Google News** (Primary source, rich metadata)
12
+ 2. **DuckDuckGo News** (Fallback source)
13
+
14
+ ```typescript
15
+ import { searchNews } from "llm-search-tools";
16
+
17
+ const results = await searchNews("technology trends", {
18
+ limit: 10,
19
+ timeout: 10000,
20
+ });
21
+ ```
22
+
23
+ ### searchGoogleNews(query: string, options?: ScraperOptions)
24
+
25
+ Search using Google News specifically. Uses the `google-news-scraper` library.
26
+
27
+ ```typescript
28
+ import { searchGoogleNews } from "llm-search-tools";
29
+
30
+ const results = await searchGoogleNews("AI developments");
31
+ ```
32
+
33
+ ## Options
34
+
35
+ ```typescript
36
+ interface ScraperOptions {
37
+ limit?: number; // max number of results (default: 10)
38
+ timeout?: number; // request timeout in ms (default: 10000)
39
+ safeSearch?: boolean; // enable safe search (default: true)
40
+
41
+ // Proxy configuration
42
+ proxy?: string | ProxyConfig;
43
+ }
44
+ ```
45
+
46
+ ## Result Format
47
+
48
+ The module returns `NewsResult` objects which extend the standard `SearchResult`:
49
+
50
+ ```typescript
51
+ interface NewsResult extends SearchResult {
52
+ source: "google-news" | "duckduckgo-news";
53
+ sourceName?: string; // The specific publisher (e.g., "The Verge", "CNN")
54
+ publishedAt?: string; // Publication time string (e.g., "2 hours ago")
55
+ imageUrl?: string; // URL to the article's thumbnail image
56
+ }
57
+ ```
58
+
59
+ ## Caching
60
+
61
+ The Google News provider implements in-memory caching (TTL: 30 minutes) to prevent rate limiting and improve performance for repeated queries.
62
+
63
+ ## Error Handling
64
+
65
+ Like the search module, `searchNews` aggregates errors and only throws `ALL_NEWS_ENGINES_FAILED` if all providers fail.
66
+
67
+ ```typescript
68
+ try {
69
+ const results = await searchNews("query");
70
+ } catch (err) {
71
+ if (err.code === "ALL_NEWS_ENGINES_FAILED") {
72
+ console.error("News search failed:", err.errors);
73
+ }
74
+ }
75
+ ```
package/docs/parser.md ADDED
@@ -0,0 +1,197 @@
1
+ # Parser Module Documentation
2
+
3
+ The parser module provides a unified interface for parsing various file types into text and structured data. It now supports both file paths and raw buffers as input.
4
+
5
+ ## Supported File Types
6
+
7
+ - PDF documents (`.pdf`)
8
+ - Word documents (`.docx`) via Mammoth
9
+ - CSV files (`.csv`)
10
+ - Images (`.png`, `.jpg`, `.jpeg`, `.bmp`, `.gif`) via OCR
11
+ - Plain text files (`.txt`)
12
+ - XML documents (`.xml`)
13
+ - JSON files (`.json`)
14
+
15
+ ## Installation
16
+
17
+ The module requires several dependencies:
18
+
19
+ ```bash
20
+ npm install pdf-parse mammoth csv-parse tesseract.js fast-xml-parser
21
+ ```
22
+
23
+ ## Usage
24
+
25
+ ### Basic Usage
26
+
27
+ ```typescript
28
+ import { parse } from "llm-search-tools";
29
+
30
+ // parse a file by path (ez mode)
31
+ const result = await parse("path/to/file.pdf");
32
+ console.log(result.text);
33
+
34
+ // parse a buffer with known filename (recommended for buffers)
35
+ const buffer = readFileSync("path/to/file.docx");
36
+ const result = await parse(buffer, {}, "file.docx");
37
+ console.log(result.text);
38
+
39
+ // parse a buffer without filename (we'll try our best to detect type)
40
+ const buffer = someBuffer;
41
+ const result = await parse(buffer);
42
+ console.log(result.text);
43
+ ```
44
+
45
+ ### Type Detection for Buffers
46
+
47
+ When parsing buffers, the parser attempts to detect the file type in several ways:
48
+
49
+ 1. Using the provided filename hint (most reliable)
50
+ 2. Checking file magic numbers for binary formats
51
+ 3. Attempting JSON parsing for potential JSON data
52
+ 4. Looking for CSV patterns
53
+ 5. Falling back to plain text if nothing else matches
54
+
55
+ ### With Options
56
+
57
+ ```typescript
58
+ // Parse CSV with custom options
59
+ const csvResult = await parse("data.csv", {
60
+ csv: {
61
+ delimiter: ";",
62
+ columns: true,
63
+ },
64
+ });
65
+
66
+ // OCR with different language
67
+ const imageResult = await parse("image.png", {
68
+ language: "spa", // Spanish
69
+ });
70
+ ```
71
+
72
+ ## Return Type
73
+
74
+ The parser returns a `ParseResult` object:
75
+
76
+ ```typescript
77
+ interface ParseResult {
78
+ type:
79
+ | "pdf"
80
+ | "docx"
81
+ | "csv"
82
+ | "image"
83
+ | "text"
84
+ | "xml"
85
+ | "json"
86
+ | "unknown";
87
+ text: string; // extracted text content
88
+ metadata?: Record<string, unknown>; // file metadata
89
+ data?: unknown; // structured data if we got it (xml/json/csv mostly)
90
+ }
91
+ ```
92
+
93
+ ## File Type Specific Features
94
+
95
+ ### PDF Files
96
+
97
+ - Extracts text content
98
+ - Provides metadata (page count, PDF info, version)
99
+ - Preserves document structure
100
+
101
+ ### DOCX Files
102
+
103
+ - Extracts text content
104
+ - Enhanced document handling via Mammoth
105
+ - Supports both HTML and raw text extraction
106
+ - Handles document formatting and structure
107
+ - Automatically cleans and preserves formatting
108
+
109
+ ### CSV Files
110
+
111
+ - Extracts raw text
112
+ - Provides structured data array
113
+ - Supports custom delimiters
114
+ - Optional column headers
115
+
116
+ ### Images (OCR)
117
+
118
+ - Extracts text via Tesseract OCR
119
+ - Supports multiple languages
120
+ - Returns confidence scores
121
+ - Handles common image formats
122
+
123
+ ## Error Handling
124
+
125
+ ```typescript
126
+ try {
127
+ const result = await parse("file.pdf");
128
+ } catch (error) {
129
+ if (error.code === "PDF_PARSE_ERROR") {
130
+ // Handle PDF-specific error
131
+ } else if (error.code === "DOCX_PARSE_ERROR") {
132
+ // Handle DOCX-specific error
133
+ }
134
+ // Generic error handling
135
+ console.error(error.message);
136
+ }
137
+ ```
138
+
139
+ ## Supported Error Codes
140
+
141
+ - `PDF_PARSE_ERROR`: pdf parsing failed (ugh these pdfs i swear)
142
+ - `DOCX_PARSE_ERROR`: docx parsing failed (word docs are the worst)
143
+ - `CSV_PARSE_ERROR`: csv parsing went sideways
144
+ - `IMAGE_PARSE_ERROR`: ocr failed (probably bad image quality or smth)
145
+ - `TEXT_PARSE_ERROR`: somehow failed to parse plain text (how even?)
146
+ - `XML_PARSE_ERROR`: xml parsing went wrong (invalid format probs)
147
+ - `JSON_PARSE_ERROR`: json parsing died (check your brackets)
148
+ - `PARSE_ERROR`: generic error when everything else fails
149
+
150
+ ## Language Support
151
+
152
+ For OCR (image parsing), the following languages are supported:
153
+
154
+ - English (default, 'eng')
155
+ - Spanish ('spa')
156
+ - French ('fra')
157
+ - German ('deu')
158
+ - And [many more](https://tesseract-ocr.github.io/tessdoc/Data-Files#data-files-for-version-400-november-29-2016)
159
+
160
+ ## Performance Considerations
161
+
162
+ - Image OCR is CPU-intensive and may take longer
163
+ - Large PDFs may require more memory
164
+ - Consider using streams for large files
165
+ - CSV parsing is typically very fast
166
+
167
+ ## Examples
168
+
169
+ ### Parse PDF and Extract Text
170
+
171
+ ```typescript
172
+ const result = await parse("document.pdf");
173
+ console.log(`Pages: ${result.metadata.pages}`);
174
+ console.log(`Text: ${result.text}`);
175
+ ```
176
+
177
+ ### Parse CSV with Custom Options
178
+
179
+ ```typescript
180
+ const result = await parse("data.csv", {
181
+ csv: {
182
+ delimiter: ";",
183
+ columns: true,
184
+ },
185
+ });
186
+ console.log(`Rows: ${result.metadata.rowCount}`);
187
+ console.log(`Data:`, result.data);
188
+ ```
189
+
190
+ ### OCR Image in Different Language
191
+
192
+ ```typescript
193
+ const result = await parse("chinese-text.png", {
194
+ language: "chi_sim",
195
+ });
196
+ console.log(`Extracted Text: ${result.text}`);
197
+ ```