llm-search-tools 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +244 -0
- package/dist/index.d.ts +18 -0
- package/dist/index.js +40 -0
- package/dist/index.js.map +1 -0
- package/dist/integration.test.d.ts +1 -0
- package/dist/integration.test.js +237 -0
- package/dist/modules/answerbox.test.d.ts +1 -0
- package/dist/modules/answerbox.test.js +105 -0
- package/dist/modules/autocomplete.d.ts +11 -0
- package/dist/modules/autocomplete.js +159 -0
- package/dist/modules/autocomplete.test.d.ts +1 -0
- package/dist/modules/autocomplete.test.js +188 -0
- package/dist/modules/common.d.ts +26 -0
- package/dist/modules/common.js +263 -0
- package/dist/modules/common.test.d.ts +1 -0
- package/dist/modules/common.test.js +87 -0
- package/dist/modules/crawl.d.ts +9 -0
- package/dist/modules/crawl.js +117 -0
- package/dist/modules/crawl.test.d.ts +1 -0
- package/dist/modules/crawl.test.js +48 -0
- package/dist/modules/events.d.ts +8 -0
- package/dist/modules/events.js +129 -0
- package/dist/modules/events.test.d.ts +1 -0
- package/dist/modules/events.test.js +104 -0
- package/dist/modules/finance.d.ts +10 -0
- package/dist/modules/finance.js +20 -0
- package/dist/modules/finance.test.d.ts +1 -0
- package/dist/modules/finance.test.js +77 -0
- package/dist/modules/flights.d.ts +8 -0
- package/dist/modules/flights.js +135 -0
- package/dist/modules/flights.test.d.ts +1 -0
- package/dist/modules/flights.test.js +128 -0
- package/dist/modules/hackernews.d.ts +8 -0
- package/dist/modules/hackernews.js +87 -0
- package/dist/modules/hackernews.js.map +1 -0
- package/dist/modules/images.test.d.ts +1 -0
- package/dist/modules/images.test.js +145 -0
- package/dist/modules/integrations.test.d.ts +1 -0
- package/dist/modules/integrations.test.js +93 -0
- package/dist/modules/media.d.ts +11 -0
- package/dist/modules/media.js +132 -0
- package/dist/modules/media.test.d.ts +1 -0
- package/dist/modules/media.test.js +186 -0
- package/dist/modules/news.d.ts +3 -0
- package/dist/modules/news.js +39 -0
- package/dist/modules/news.test.d.ts +1 -0
- package/dist/modules/news.test.js +88 -0
- package/dist/modules/parser.d.ts +19 -0
- package/dist/modules/parser.js +361 -0
- package/dist/modules/parser.test.d.ts +1 -0
- package/dist/modules/parser.test.js +151 -0
- package/dist/modules/reddit.d.ts +21 -0
- package/dist/modules/reddit.js +107 -0
- package/dist/modules/scrape.d.ts +16 -0
- package/dist/modules/scrape.js +272 -0
- package/dist/modules/scrape.test.d.ts +1 -0
- package/dist/modules/scrape.test.js +232 -0
- package/dist/modules/scraper.d.ts +12 -0
- package/dist/modules/scraper.js +640 -0
- package/dist/modules/scrapers/anidb.d.ts +8 -0
- package/dist/modules/scrapers/anidb.js +156 -0
- package/dist/modules/scrapers/duckduckgo.d.ts +6 -0
- package/dist/modules/scrapers/duckduckgo.js +284 -0
- package/dist/modules/scrapers/google-news.d.ts +2 -0
- package/dist/modules/scrapers/google-news.js +60 -0
- package/dist/modules/scrapers/google.d.ts +6 -0
- package/dist/modules/scrapers/google.js +211 -0
- package/dist/modules/scrapers/searxng.d.ts +2 -0
- package/dist/modules/scrapers/searxng.js +93 -0
- package/dist/modules/scrapers/thetvdb.d.ts +3 -0
- package/dist/modules/scrapers/thetvdb.js +147 -0
- package/dist/modules/scrapers/tmdb.d.ts +3 -0
- package/dist/modules/scrapers/tmdb.js +172 -0
- package/dist/modules/scrapers/yahoo-finance.d.ts +2 -0
- package/dist/modules/scrapers/yahoo-finance.js +33 -0
- package/dist/modules/search.d.ts +5 -0
- package/dist/modules/search.js +45 -0
- package/dist/modules/search.js.map +1 -0
- package/dist/modules/search.test.d.ts +1 -0
- package/dist/modules/search.test.js +219 -0
- package/dist/modules/urbandictionary.d.ts +12 -0
- package/dist/modules/urbandictionary.js +26 -0
- package/dist/modules/webpage.d.ts +4 -0
- package/dist/modules/webpage.js +150 -0
- package/dist/modules/webpage.js.map +1 -0
- package/dist/modules/wikipedia.d.ts +5 -0
- package/dist/modules/wikipedia.js +85 -0
- package/dist/modules/wikipedia.js.map +1 -0
- package/dist/scripts/interactive-search.d.ts +1 -0
- package/dist/scripts/interactive-search.js +98 -0
- package/dist/test.d.ts +1 -0
- package/dist/test.js +179 -0
- package/dist/test.js.map +1 -0
- package/dist/testBraveSearch.d.ts +1 -0
- package/dist/testBraveSearch.js +34 -0
- package/dist/testDuckDuckGo.d.ts +1 -0
- package/dist/testDuckDuckGo.js +52 -0
- package/dist/testEcosia.d.ts +1 -0
- package/dist/testEcosia.js +57 -0
- package/dist/testSearchModule.d.ts +1 -0
- package/dist/testSearchModule.js +95 -0
- package/dist/testwebpage.d.ts +1 -0
- package/dist/testwebpage.js +81 -0
- package/dist/types.d.ts +174 -0
- package/dist/types.js +3 -0
- package/dist/types.js.map +1 -0
- package/dist/utils/createTestDocx.d.ts +1 -0
- package/dist/utils/createTestDocx.js +58 -0
- package/dist/utils/htmlcleaner.d.ts +20 -0
- package/dist/utils/htmlcleaner.js +172 -0
- package/docs/README.md +275 -0
- package/docs/autocomplete.md +73 -0
- package/docs/crawling.md +88 -0
- package/docs/events.md +58 -0
- package/docs/examples.md +158 -0
- package/docs/finance.md +60 -0
- package/docs/flights.md +71 -0
- package/docs/hackernews.md +121 -0
- package/docs/media.md +87 -0
- package/docs/news.md +75 -0
- package/docs/parser.md +197 -0
- package/docs/scraper.md +347 -0
- package/docs/search.md +106 -0
- package/docs/wikipedia.md +91 -0
- package/package.json +97 -0
package/docs/flights.md
ADDED
|
@@ -0,0 +1,71 @@
|
|
|
1
|
+
# Flights Search Documentation
|
|
2
|
+
|
|
3
|
+
The flights module allows you to search for flight prices and schedules using Google Flights. It uses headless browser automation to navigate the interface and extract flight data.
|
|
4
|
+
|
|
5
|
+
## Usage
|
|
6
|
+
|
|
7
|
+
```typescript
|
|
8
|
+
import { searchFlights } from "llm-search-tools";
|
|
9
|
+
|
|
10
|
+
// Simple string query
|
|
11
|
+
const results = await searchFlights("flights from JFK to LHR");
|
|
12
|
+
|
|
13
|
+
// Structured query
|
|
14
|
+
const results = await searchFlights({
|
|
15
|
+
from: "SFO",
|
|
16
|
+
to: "HND",
|
|
17
|
+
departureDate: "2025-05-01",
|
|
18
|
+
returnDate: "2025-05-15",
|
|
19
|
+
});
|
|
20
|
+
|
|
21
|
+
results.flights.forEach((flight) => {
|
|
22
|
+
console.log(`${flight.airline}: ${flight.price} (${flight.duration})`);
|
|
23
|
+
});
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
## API Reference
|
|
27
|
+
|
|
28
|
+
### `searchFlights(query, options?)`
|
|
29
|
+
|
|
30
|
+
- **query**: `string | FlightSearchOptions` - Either a natural language string or an options object
|
|
31
|
+
- **options**: `FlightSearchOptions` - Additional options (if query is a string)
|
|
32
|
+
|
|
33
|
+
### Options (`FlightSearchOptions`)
|
|
34
|
+
|
|
35
|
+
| Option | Type | Description |
|
|
36
|
+
| --------------- | ----------------------- | ----------------------------------------------- |
|
|
37
|
+
| `from` | `string` | Origin airport code or city |
|
|
38
|
+
| `to` | `string` | Destination airport code or city |
|
|
39
|
+
| `departureDate` | `string` | Departure date (YYYY-MM-DD or natural language) |
|
|
40
|
+
| `returnDate` | `string` | Return date (YYYY-MM-DD or natural language) |
|
|
41
|
+
| `limit` | `number` | Maximum results (default: 10) |
|
|
42
|
+
| `proxy` | `string \| ProxyConfig` | Proxy configuration |
|
|
43
|
+
|
|
44
|
+
## Output Structure
|
|
45
|
+
|
|
46
|
+
Returns a `Promise<FlightResult>`:
|
|
47
|
+
|
|
48
|
+
```typescript
|
|
49
|
+
interface FlightResult {
|
|
50
|
+
flights: Flight[];
|
|
51
|
+
url: string; // Google Flights URL
|
|
52
|
+
source: string; // 'google-flights'
|
|
53
|
+
}
|
|
54
|
+
|
|
55
|
+
interface Flight {
|
|
56
|
+
airline: string;
|
|
57
|
+
departureTime: string;
|
|
58
|
+
arrivalTime: string;
|
|
59
|
+
duration: string;
|
|
60
|
+
price: string;
|
|
61
|
+
stops: string;
|
|
62
|
+
origin?: string; // Inferred if available
|
|
63
|
+
destination?: string; // Inferred if available
|
|
64
|
+
}
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
## Limitations
|
|
68
|
+
|
|
69
|
+
- Google Flights is a complex, dynamic application. Selectors are heavily obfuscated and change frequently.
|
|
70
|
+
- This module uses heuristics to extract data (e.g., regex for times and prices).
|
|
71
|
+
- Performance depends on network speed and proxy quality, as it loads a full React application.
|
|
@@ -0,0 +1,121 @@
|
|
|
1
|
+
# HackerNews Module 💻
|
|
2
|
+
|
|
3
|
+
Get the latest tech news and discussions from Hacker News.
|
|
4
|
+
|
|
5
|
+
## Functions
|
|
6
|
+
|
|
7
|
+
### getTopStories(limit?: number)
|
|
8
|
+
|
|
9
|
+
Get the current top stories.
|
|
10
|
+
|
|
11
|
+
```typescript
|
|
12
|
+
import { getTopStories } from 'llm-search-tools';
|
|
13
|
+
|
|
14
|
+
const stories = await getTopStories(10);
|
|
15
|
+
```
|
|
16
|
+
|
|
17
|
+
### getNewStories(limit?: number)
|
|
18
|
+
|
|
19
|
+
Get the newest stories.
|
|
20
|
+
|
|
21
|
+
```typescript
|
|
22
|
+
import { getNewStories } from 'llm-search-tools';
|
|
23
|
+
|
|
24
|
+
const stories = await getNewStories(10);
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
### getBestStories(limit?: number)
|
|
28
|
+
|
|
29
|
+
Get the best stories of all time.
|
|
30
|
+
|
|
31
|
+
```typescript
|
|
32
|
+
import { getBestStories } from 'llm-search-tools';
|
|
33
|
+
|
|
34
|
+
const stories = await getBestStories(10);
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
### getAskStories(limit?: number)
|
|
38
|
+
|
|
39
|
+
Get "Ask HN" posts.
|
|
40
|
+
|
|
41
|
+
```typescript
|
|
42
|
+
import { getAskStories } from 'llm-search-tools';
|
|
43
|
+
|
|
44
|
+
const stories = await getAskStories(10);
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
### getShowStories(limit?: number)
|
|
48
|
+
|
|
49
|
+
Get "Show HN" posts.
|
|
50
|
+
|
|
51
|
+
```typescript
|
|
52
|
+
import { getShowStories } from 'llm-search-tools';
|
|
53
|
+
|
|
54
|
+
const stories = await getShowStories(10);
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
### getJobStories(limit?: number)
|
|
58
|
+
|
|
59
|
+
Get job postings.
|
|
60
|
+
|
|
61
|
+
```typescript
|
|
62
|
+
import { getJobStories } from 'llm-search-tools';
|
|
63
|
+
|
|
64
|
+
const stories = await getJobStories(10);
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
### getStoryById(id: number)
|
|
68
|
+
|
|
69
|
+
Get a specific story by its ID.
|
|
70
|
+
|
|
71
|
+
```typescript
|
|
72
|
+
import { getStoryById } from 'llm-search-tools';
|
|
73
|
+
|
|
74
|
+
const story = await getStoryById(123456);
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
## Result Format
|
|
78
|
+
|
|
79
|
+
```typescript
|
|
80
|
+
interface HackerNewsResult extends SearchResult {
|
|
81
|
+
points?: number; // story points/score
|
|
82
|
+
author?: string; // post author
|
|
83
|
+
comments?: number; // number of comments
|
|
84
|
+
time?: Date; // post timestamp
|
|
85
|
+
}
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
## Error Handling
|
|
89
|
+
|
|
90
|
+
All functions throw a `SearchError` on failure:
|
|
91
|
+
|
|
92
|
+
```typescript
|
|
93
|
+
try {
|
|
94
|
+
const stories = await getTopStories();
|
|
95
|
+
} catch (err) {
|
|
96
|
+
if (err.code === 'HN_TOP_ERROR') {
|
|
97
|
+
console.error('failed to get top stories:', err.message);
|
|
98
|
+
}
|
|
99
|
+
}
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
## Tips
|
|
103
|
+
|
|
104
|
+
- Default limit is 10 stories for all functions
|
|
105
|
+
- Stories are returned in descending order (highest score first)
|
|
106
|
+
- Use `getStoryById()` to fetch full details of a story
|
|
107
|
+
- Comments count might be null for very new stories
|
|
108
|
+
- URLs point to HN discussion if no external URL exists
|
|
109
|
+
- Job stories won't have points or comments
|
|
110
|
+
- "Ask HN" and "Show HN" often have interesting discussions
|
|
111
|
+
|
|
112
|
+
## Story Types
|
|
113
|
+
|
|
114
|
+
Each function gets different types of content:
|
|
115
|
+
|
|
116
|
+
- `getTopStories()` - Currently trending stories
|
|
117
|
+
- `getNewStories()` - Most recent submissions
|
|
118
|
+
- `getBestStories()` - Highest voted all time
|
|
119
|
+
- `getAskStories()` - Questions and discussions
|
|
120
|
+
- `getShowStories()` - Project showcases
|
|
121
|
+
- `getJobStories()` - Job listings
|
package/docs/media.md
ADDED
|
@@ -0,0 +1,87 @@
|
|
|
1
|
+
# Media Search Module 🎬
|
|
2
|
+
|
|
3
|
+
The media search module provides unified access to metadata from **TMDB**, **TheTVDB**, and **AniDB** without requiring API keys. It uses a smart fallback strategy to find the best results for Movies, TV Shows, and Anime.
|
|
4
|
+
|
|
5
|
+
## Functions
|
|
6
|
+
|
|
7
|
+
### searchMedia(query: string, options?: MediaSearchOptions)
|
|
8
|
+
|
|
9
|
+
Main function that orchestrates the search based on the requested media type.
|
|
10
|
+
|
|
11
|
+
```typescript
|
|
12
|
+
import { searchMedia } from "llm-search-tools";
|
|
13
|
+
|
|
14
|
+
// General search (defaults to TMDB -> TheTVDB)
|
|
15
|
+
const results = await searchMedia("Breaking Bad");
|
|
16
|
+
|
|
17
|
+
// Specific Anime search (AniDB -> TMDB)
|
|
18
|
+
const animeResults = await searchMedia("Neon Genesis Evangelion", {
|
|
19
|
+
type: "anime",
|
|
20
|
+
});
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
## Strategies
|
|
24
|
+
|
|
25
|
+
The module uses different strategies based on the `type` option:
|
|
26
|
+
|
|
27
|
+
1. **Anime** (`type: "anime"`)
|
|
28
|
+
- **Primary**: AniDB (Best for anime metadata)
|
|
29
|
+
- **Fallback**: TMDB (filtered for Animation/TV)
|
|
30
|
+
|
|
31
|
+
2. **TV** (`type: "tv"`)
|
|
32
|
+
- **Primary**: TMDB (Fast, good coverage)
|
|
33
|
+
- **Fallback**: TheTVDB (Specialized TV database)
|
|
34
|
+
|
|
35
|
+
3. **Movie / General** (Default)
|
|
36
|
+
- **Primary**: TMDB
|
|
37
|
+
- **Fallback**: TheTVDB (Only if not explicitly searching for movies)
|
|
38
|
+
|
|
39
|
+
## Options
|
|
40
|
+
|
|
41
|
+
```typescript
|
|
42
|
+
interface MediaSearchOptions extends ScraperOptions {
|
|
43
|
+
type?: "movie" | "tv" | "anime";
|
|
44
|
+
|
|
45
|
+
// Inherited from ScraperOptions
|
|
46
|
+
limit?: number;
|
|
47
|
+
proxy?: ProxyConfig | string;
|
|
48
|
+
forcePuppeteer?: boolean; // Useful if getting blocked
|
|
49
|
+
}
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
## Result Format
|
|
53
|
+
|
|
54
|
+
Returns a consistent `MediaResult` object regardless of the source.
|
|
55
|
+
|
|
56
|
+
```typescript
|
|
57
|
+
interface MediaResult {
|
|
58
|
+
title: string;
|
|
59
|
+
url: string;
|
|
60
|
+
description?: string;
|
|
61
|
+
rating?: string; // e.g. "85%" or "8.5"
|
|
62
|
+
releaseDate?: string;
|
|
63
|
+
posterUrl?: string;
|
|
64
|
+
genres?: string[];
|
|
65
|
+
cast?: string[];
|
|
66
|
+
|
|
67
|
+
// Source identification
|
|
68
|
+
source: "tmdb" | "thetvdb" | "anidb";
|
|
69
|
+
mediaType: "movie" | "tv" | "anime";
|
|
70
|
+
|
|
71
|
+
// Availability (if scraped)
|
|
72
|
+
watchProviders?: {
|
|
73
|
+
name: string;
|
|
74
|
+
type: "stream" | "rent" | "buy";
|
|
75
|
+
}[];
|
|
76
|
+
}
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
## Anti-Bot Features
|
|
80
|
+
|
|
81
|
+
- **AniDB**: Enforces strict rate limiting (2.5s between requests) and always uses a stealth browser to avoid IP bans.
|
|
82
|
+
- **TMDB/TheTVDB**: Starts with lightweight HTTP requests and automatically falls back to a stealth Puppeteer browser if bot protection is detected.
|
|
83
|
+
|
|
84
|
+
## Tips
|
|
85
|
+
|
|
86
|
+
- **Anime**: Always specify `{ type: "anime" }` for anime queries. AniDB has very strict matching but high-quality data.
|
|
87
|
+
- **Rate Limiting**: If you are making many requests, add delays between them. The module handles internal rate limits for specific providers, but aggressive usage might still trigger captchas.
|
package/docs/news.md
ADDED
|
@@ -0,0 +1,75 @@
|
|
|
1
|
+
# News Module 📰
|
|
2
|
+
|
|
3
|
+
The news module provides specialized news search capabilities using Google News and DuckDuckGo News.
|
|
4
|
+
|
|
5
|
+
## Functions
|
|
6
|
+
|
|
7
|
+
### searchNews(query: string, options?: ScraperOptions)
|
|
8
|
+
|
|
9
|
+
Main news search function that orchestrates multiple providers for resilience:
|
|
10
|
+
|
|
11
|
+
1. **Google News** (Primary source, rich metadata)
|
|
12
|
+
2. **DuckDuckGo News** (Fallback source)
|
|
13
|
+
|
|
14
|
+
```typescript
|
|
15
|
+
import { searchNews } from "llm-search-tools";
|
|
16
|
+
|
|
17
|
+
const results = await searchNews("technology trends", {
|
|
18
|
+
limit: 10,
|
|
19
|
+
timeout: 10000,
|
|
20
|
+
});
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
### searchGoogleNews(query: string, options?: ScraperOptions)
|
|
24
|
+
|
|
25
|
+
Search using Google News specifically. Uses the `google-news-scraper` library.
|
|
26
|
+
|
|
27
|
+
```typescript
|
|
28
|
+
import { searchGoogleNews } from "llm-search-tools";
|
|
29
|
+
|
|
30
|
+
const results = await searchGoogleNews("AI developments");
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
## Options
|
|
34
|
+
|
|
35
|
+
```typescript
|
|
36
|
+
interface ScraperOptions {
|
|
37
|
+
limit?: number; // max number of results (default: 10)
|
|
38
|
+
timeout?: number; // request timeout in ms (default: 10000)
|
|
39
|
+
safeSearch?: boolean; // enable safe search (default: true)
|
|
40
|
+
|
|
41
|
+
// Proxy configuration
|
|
42
|
+
proxy?: string | ProxyConfig;
|
|
43
|
+
}
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
## Result Format
|
|
47
|
+
|
|
48
|
+
The module returns `NewsResult` objects which extend the standard `SearchResult`:
|
|
49
|
+
|
|
50
|
+
```typescript
|
|
51
|
+
interface NewsResult extends SearchResult {
|
|
52
|
+
source: "google-news" | "duckduckgo-news";
|
|
53
|
+
sourceName?: string; // The specific publisher (e.g., "The Verge", "CNN")
|
|
54
|
+
publishedAt?: string; // Publication time string (e.g., "2 hours ago")
|
|
55
|
+
imageUrl?: string; // URL to the article's thumbnail image
|
|
56
|
+
}
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
## Caching
|
|
60
|
+
|
|
61
|
+
The Google News provider implements in-memory caching (TTL: 30 minutes) to prevent rate limiting and improve performance for repeated queries.
|
|
62
|
+
|
|
63
|
+
## Error Handling
|
|
64
|
+
|
|
65
|
+
Like the search module, `searchNews` aggregates errors and only throws `ALL_NEWS_ENGINES_FAILED` if all providers fail.
|
|
66
|
+
|
|
67
|
+
```typescript
|
|
68
|
+
try {
|
|
69
|
+
const results = await searchNews("query");
|
|
70
|
+
} catch (err) {
|
|
71
|
+
if (err.code === "ALL_NEWS_ENGINES_FAILED") {
|
|
72
|
+
console.error("News search failed:", err.errors);
|
|
73
|
+
}
|
|
74
|
+
}
|
|
75
|
+
```
|
package/docs/parser.md
ADDED
|
@@ -0,0 +1,197 @@
|
|
|
1
|
+
# Parser Module Documentation
|
|
2
|
+
|
|
3
|
+
The parser module provides a unified interface for parsing various file types into text and structured data. It now supports both file paths and raw buffers as input.
|
|
4
|
+
|
|
5
|
+
## Supported File Types
|
|
6
|
+
|
|
7
|
+
- PDF documents (`.pdf`)
|
|
8
|
+
- Word documents (`.docx`) via Mammoth
|
|
9
|
+
- CSV files (`.csv`)
|
|
10
|
+
- Images (`.png`, `.jpg`, `.jpeg`, `.bmp`, `.gif`) via OCR
|
|
11
|
+
- Plain text files (`.txt`)
|
|
12
|
+
- XML documents (`.xml`)
|
|
13
|
+
- JSON files (`.json`)
|
|
14
|
+
|
|
15
|
+
## Installation
|
|
16
|
+
|
|
17
|
+
The module requires several dependencies:
|
|
18
|
+
|
|
19
|
+
```bash
|
|
20
|
+
npm install pdf-parse mammoth csv-parse tesseract.js fast-xml-parser
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
## Usage
|
|
24
|
+
|
|
25
|
+
### Basic Usage
|
|
26
|
+
|
|
27
|
+
```typescript
|
|
28
|
+
import { parse } from "llm-search-tools";
|
|
29
|
+
|
|
30
|
+
// parse a file by path (ez mode)
|
|
31
|
+
const result = await parse("path/to/file.pdf");
|
|
32
|
+
console.log(result.text);
|
|
33
|
+
|
|
34
|
+
// parse a buffer with known filename (recommended for buffers)
|
|
35
|
+
const buffer = readFileSync("path/to/file.docx");
|
|
36
|
+
const result = await parse(buffer, {}, "file.docx");
|
|
37
|
+
console.log(result.text);
|
|
38
|
+
|
|
39
|
+
// parse a buffer without filename (we'll try our best to detect type)
|
|
40
|
+
const buffer = someBuffer;
|
|
41
|
+
const result = await parse(buffer);
|
|
42
|
+
console.log(result.text);
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
### Type Detection for Buffers
|
|
46
|
+
|
|
47
|
+
When parsing buffers, the parser attempts to detect the file type in several ways:
|
|
48
|
+
|
|
49
|
+
1. Using the provided filename hint (most reliable)
|
|
50
|
+
2. Checking file magic numbers for binary formats
|
|
51
|
+
3. Attempting JSON parsing for potential JSON data
|
|
52
|
+
4. Looking for CSV patterns
|
|
53
|
+
5. Falling back to plain text if nothing else matches
|
|
54
|
+
|
|
55
|
+
### With Options
|
|
56
|
+
|
|
57
|
+
```typescript
|
|
58
|
+
// Parse CSV with custom options
|
|
59
|
+
const csvResult = await parse("data.csv", {
|
|
60
|
+
csv: {
|
|
61
|
+
delimiter: ";",
|
|
62
|
+
columns: true,
|
|
63
|
+
},
|
|
64
|
+
});
|
|
65
|
+
|
|
66
|
+
// OCR with different language
|
|
67
|
+
const imageResult = await parse("image.png", {
|
|
68
|
+
language: "spa", // Spanish
|
|
69
|
+
});
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
## Return Type
|
|
73
|
+
|
|
74
|
+
The parser returns a `ParseResult` object:
|
|
75
|
+
|
|
76
|
+
```typescript
|
|
77
|
+
interface ParseResult {
|
|
78
|
+
type:
|
|
79
|
+
| "pdf"
|
|
80
|
+
| "docx"
|
|
81
|
+
| "csv"
|
|
82
|
+
| "image"
|
|
83
|
+
| "text"
|
|
84
|
+
| "xml"
|
|
85
|
+
| "json"
|
|
86
|
+
| "unknown";
|
|
87
|
+
text: string; // extracted text content
|
|
88
|
+
metadata?: Record<string, unknown>; // file metadata
|
|
89
|
+
data?: unknown; // structured data if we got it (xml/json/csv mostly)
|
|
90
|
+
}
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
## File Type Specific Features
|
|
94
|
+
|
|
95
|
+
### PDF Files
|
|
96
|
+
|
|
97
|
+
- Extracts text content
|
|
98
|
+
- Provides metadata (page count, PDF info, version)
|
|
99
|
+
- Preserves document structure
|
|
100
|
+
|
|
101
|
+
### DOCX Files
|
|
102
|
+
|
|
103
|
+
- Extracts text content
|
|
104
|
+
- Enhanced document handling via Mammoth
|
|
105
|
+
- Supports both HTML and raw text extraction
|
|
106
|
+
- Handles document formatting and structure
|
|
107
|
+
- Automatically cleans and preserves formatting
|
|
108
|
+
|
|
109
|
+
### CSV Files
|
|
110
|
+
|
|
111
|
+
- Extracts raw text
|
|
112
|
+
- Provides structured data array
|
|
113
|
+
- Supports custom delimiters
|
|
114
|
+
- Optional column headers
|
|
115
|
+
|
|
116
|
+
### Images (OCR)
|
|
117
|
+
|
|
118
|
+
- Extracts text via Tesseract OCR
|
|
119
|
+
- Supports multiple languages
|
|
120
|
+
- Returns confidence scores
|
|
121
|
+
- Handles common image formats
|
|
122
|
+
|
|
123
|
+
## Error Handling
|
|
124
|
+
|
|
125
|
+
```typescript
|
|
126
|
+
try {
|
|
127
|
+
const result = await parse("file.pdf");
|
|
128
|
+
} catch (error) {
|
|
129
|
+
if (error.code === "PDF_PARSE_ERROR") {
|
|
130
|
+
// Handle PDF-specific error
|
|
131
|
+
} else if (error.code === "DOCX_PARSE_ERROR") {
|
|
132
|
+
// Handle DOCX-specific error
|
|
133
|
+
}
|
|
134
|
+
// Generic error handling
|
|
135
|
+
console.error(error.message);
|
|
136
|
+
}
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
## Supported Error Codes
|
|
140
|
+
|
|
141
|
+
- `PDF_PARSE_ERROR`: pdf parsing failed (ugh these pdfs i swear)
|
|
142
|
+
- `DOCX_PARSE_ERROR`: docx parsing failed (word docs are the worst)
|
|
143
|
+
- `CSV_PARSE_ERROR`: csv parsing went sideways
|
|
144
|
+
- `IMAGE_PARSE_ERROR`: ocr failed (probably bad image quality or smth)
|
|
145
|
+
- `TEXT_PARSE_ERROR`: somehow failed to parse plain text (how even?)
|
|
146
|
+
- `XML_PARSE_ERROR`: xml parsing went wrong (invalid format probs)
|
|
147
|
+
- `JSON_PARSE_ERROR`: json parsing died (check your brackets)
|
|
148
|
+
- `PARSE_ERROR`: generic error when everything else fails
|
|
149
|
+
|
|
150
|
+
## Language Support
|
|
151
|
+
|
|
152
|
+
For OCR (image parsing), the following languages are supported:
|
|
153
|
+
|
|
154
|
+
- English (default, 'eng')
|
|
155
|
+
- Spanish ('spa')
|
|
156
|
+
- French ('fra')
|
|
157
|
+
- German ('deu')
|
|
158
|
+
- And [many more](https://tesseract-ocr.github.io/tessdoc/Data-Files#data-files-for-version-400-november-29-2016)
|
|
159
|
+
|
|
160
|
+
## Performance Considerations
|
|
161
|
+
|
|
162
|
+
- Image OCR is CPU-intensive and may take longer
|
|
163
|
+
- Large PDFs may require more memory
|
|
164
|
+
- Consider using streams for large files
|
|
165
|
+
- CSV parsing is typically very fast
|
|
166
|
+
|
|
167
|
+
## Examples
|
|
168
|
+
|
|
169
|
+
### Parse PDF and Extract Text
|
|
170
|
+
|
|
171
|
+
```typescript
|
|
172
|
+
const result = await parse("document.pdf");
|
|
173
|
+
console.log(`Pages: ${result.metadata.pages}`);
|
|
174
|
+
console.log(`Text: ${result.text}`);
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
### Parse CSV with Custom Options
|
|
178
|
+
|
|
179
|
+
```typescript
|
|
180
|
+
const result = await parse("data.csv", {
|
|
181
|
+
csv: {
|
|
182
|
+
delimiter: ";",
|
|
183
|
+
columns: true,
|
|
184
|
+
},
|
|
185
|
+
});
|
|
186
|
+
console.log(`Rows: ${result.metadata.rowCount}`);
|
|
187
|
+
console.log(`Data:`, result.data);
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
### OCR Image in Different Language
|
|
191
|
+
|
|
192
|
+
```typescript
|
|
193
|
+
const result = await parse("chinese-text.png", {
|
|
194
|
+
language: "chi_sim",
|
|
195
|
+
});
|
|
196
|
+
console.log(`Extracted Text: ${result.text}`);
|
|
197
|
+
```
|