pageindex 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +48 -0
- package/README.md +265 -0
- package/dist/cli.d.ts +7 -0
- package/dist/cli.d.ts.map +1 -0
- package/dist/cli.js +2178 -0
- package/dist/index.cjs +2282 -0
- package/dist/index.d.ts +15 -0
- package/dist/index.d.ts.map +1 -0
- package/dist/index.js +2209 -0
- package/dist/markdown.d.ts +73 -0
- package/dist/markdown.d.ts.map +1 -0
- package/dist/ocr.d.ts +57 -0
- package/dist/ocr.d.ts.map +1 -0
- package/dist/ocr.js +6161 -0
- package/dist/openai.d.ts +58 -0
- package/dist/openai.d.ts.map +1 -0
- package/dist/pageindex.d.ts +64 -0
- package/dist/pageindex.d.ts.map +1 -0
- package/dist/pdf.d.ts +47 -0
- package/dist/pdf.d.ts.map +1 -0
- package/dist/prompts.d.ts +69 -0
- package/dist/prompts.d.ts.map +1 -0
- package/dist/toc.d.ts +77 -0
- package/dist/toc.d.ts.map +1 -0
- package/dist/tree.d.ts +78 -0
- package/dist/tree.d.ts.map +1 -0
- package/dist/types.d.ts +83 -0
- package/dist/types.d.ts.map +1 -0
- package/dist/utils.d.ts +100 -0
- package/dist/utils.d.ts.map +1 -0
- package/package.json +81 -0
- package/src/cli.ts +272 -0
- package/src/index.ts +82 -0
- package/src/markdown.ts +600 -0
- package/src/ocr.ts +303 -0
- package/src/openai.ts +313 -0
- package/src/pageindex.ts +313 -0
- package/src/pdf-poppler.d.ts +42 -0
- package/src/pdf.ts +165 -0
- package/src/prompts.ts +350 -0
- package/src/toc.ts +468 -0
- package/src/tree.ts +531 -0
- package/src/types.ts +93 -0
- package/src/utils.ts +613 -0
package/LICENSE
ADDED
|
@@ -0,0 +1,48 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Antonio Oliveira
|
|
4
|
+
|
|
5
|
+
This project is a TypeScript/Bun port of PageIndex by Vectify AI.
|
|
6
|
+
Original work: https://github.com/VectifyAI/PageIndex
|
|
7
|
+
|
|
8
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
9
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
10
|
+
in the Software without restriction, including without limitation the rights
|
|
11
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
12
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
13
|
+
furnished to do so, subject to the following conditions:
|
|
14
|
+
|
|
15
|
+
The above copyright notice and this permission notice shall be included in all
|
|
16
|
+
copies or substantial portions of the Software.
|
|
17
|
+
|
|
18
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
19
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
20
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
21
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
22
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
23
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
24
|
+
SOFTWARE.
|
|
25
|
+
|
|
26
|
+
---
|
|
27
|
+
|
|
28
|
+
Original PageIndex License (MIT):
|
|
29
|
+
|
|
30
|
+
Copyright (c) 2025 Vectify AI
|
|
31
|
+
|
|
32
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
33
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
34
|
+
in the Software without restriction, including without limitation the rights
|
|
35
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
36
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
37
|
+
furnished to do so, subject to the following conditions:
|
|
38
|
+
|
|
39
|
+
The above copyright notice and this permission notice shall be included in all
|
|
40
|
+
copies or substantial portions of the Software.
|
|
41
|
+
|
|
42
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
43
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
44
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
45
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
46
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
47
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
48
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,265 @@
|
|
|
1
|
+
# bun-pageindex
|
|
2
|
+
|
|
3
|
+
Bun-native vectorless, reasoning-based RAG for document understanding. A TypeScript port of [PageIndex](https://github.com/VectifyAI/PageIndex) optimized for the Bun runtime.
|
|
4
|
+
|
|
5
|
+
## Features
|
|
6
|
+
|
|
7
|
+
- **Vectorless RAG**: Uses LLM reasoning to build hierarchical document indices without vector databases
|
|
8
|
+
- **PDF Support**: Extract structure and content from PDF documents
|
|
9
|
+
- **OCR Mode**: Process scanned PDFs using GLM-OCR vision model (not in original PageIndex!)
|
|
10
|
+
- **Markdown Support**: Convert markdown documents to tree structures
|
|
11
|
+
- **LLM Agnostic**: Works with OpenAI, LM Studio, Ollama, or any OpenAI-compatible API
|
|
12
|
+
- **Bun Native**: Optimized for Bun runtime with minimal dependencies
|
|
13
|
+
- **CLI & API**: Use as a library or command-line tool
|
|
14
|
+
|
|
15
|
+
## Installation
|
|
16
|
+
|
|
17
|
+
```bash
|
|
18
|
+
bun add bun-pageindex
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
### For OCR Mode (Scanned PDFs)
|
|
22
|
+
|
|
23
|
+
OCR mode requires Poppler to be installed on your system:
|
|
24
|
+
|
|
25
|
+
```bash
|
|
26
|
+
# macOS
|
|
27
|
+
brew install poppler
|
|
28
|
+
|
|
29
|
+
# Ubuntu/Debian
|
|
30
|
+
sudo apt-get install poppler-utils
|
|
31
|
+
|
|
32
|
+
# Windows
|
|
33
|
+
# Download from https://github.com/oschwartz10612/poppler-windows/releases
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
## Quick Start
|
|
37
|
+
|
|
38
|
+
### As a Library
|
|
39
|
+
|
|
40
|
+
```typescript
|
|
41
|
+
import { PageIndex, indexPdf, mdToTree } from "bun-pageindex";
|
|
42
|
+
|
|
43
|
+
// Process a PDF with OpenAI
|
|
44
|
+
const result = await indexPdf("document.pdf", {
|
|
45
|
+
apiKey: process.env.OPENAI_API_KEY,
|
|
46
|
+
model: "gpt-4o",
|
|
47
|
+
});
|
|
48
|
+
|
|
49
|
+
console.log(result.structure);
|
|
50
|
+
|
|
51
|
+
// Or use the PageIndex class for more control
|
|
52
|
+
const pageIndex = new PageIndex({
|
|
53
|
+
model: "gpt-4o",
|
|
54
|
+
addNodeSummary: true,
|
|
55
|
+
addDocDescription: true,
|
|
56
|
+
});
|
|
57
|
+
|
|
58
|
+
const pdfResult = await pageIndex.fromPdf("document.pdf");
|
|
59
|
+
|
|
60
|
+
// Process markdown
|
|
61
|
+
const mdResult = await mdToTree("document.md", {
|
|
62
|
+
addNodeSummary: true,
|
|
63
|
+
thinning: true,
|
|
64
|
+
thinningThreshold: 5000,
|
|
65
|
+
});
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
### Using LM Studio (Local LLMs)
|
|
69
|
+
|
|
70
|
+
```typescript
|
|
71
|
+
import { PageIndex } from "bun-pageindex";
|
|
72
|
+
|
|
73
|
+
const pageIndex = new PageIndex({
|
|
74
|
+
model: "local-model", // Your LM Studio model name
|
|
75
|
+
}).useLMStudio();
|
|
76
|
+
|
|
77
|
+
const result = await pageIndex.fromPdf("document.pdf");
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
### Using Ollama
|
|
81
|
+
|
|
82
|
+
```typescript
|
|
83
|
+
import { PageIndex } from "bun-pageindex";
|
|
84
|
+
|
|
85
|
+
const pageIndex = new PageIndex({
|
|
86
|
+
model: "llama3",
|
|
87
|
+
}).useOllama();
|
|
88
|
+
|
|
89
|
+
const result = await pageIndex.fromPdf("document.pdf");
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
### OCR Mode for Scanned PDFs
|
|
93
|
+
|
|
94
|
+
OCR mode converts PDF pages to images and uses a vision model (like GLM-OCR) to extract text, then processes with a reasoning model.
|
|
95
|
+
|
|
96
|
+
```typescript
|
|
97
|
+
import { PageIndex, indexPdfWithOcr, indexPdfWithLMStudioOcr } from "bun-pageindex";
|
|
98
|
+
|
|
99
|
+
// Using OpenAI
|
|
100
|
+
const result = await indexPdfWithOcr("scanned-document.pdf", {
|
|
101
|
+
apiKey: process.env.OPENAI_API_KEY,
|
|
102
|
+
reasoningModel: "gpt-4o",
|
|
103
|
+
ocrModel: "gpt-4o", // OpenAI vision model
|
|
104
|
+
});
|
|
105
|
+
|
|
106
|
+
// Using LM Studio with local models
|
|
107
|
+
const result = await indexPdfWithLMStudioOcr(
|
|
108
|
+
"scanned-document.pdf",
|
|
109
|
+
"qwen/qwen3-vl-30b", // Reasoning model
|
|
110
|
+
"mlx-community/GLM-OCR-bf16" // OCR vision model
|
|
111
|
+
);
|
|
112
|
+
|
|
113
|
+
// Or use the PageIndex class directly
|
|
114
|
+
const pageIndex = new PageIndex({
|
|
115
|
+
model: "qwen/qwen3-vl-30b",
|
|
116
|
+
extractionMode: "ocr",
|
|
117
|
+
ocrModel: "mlx-community/GLM-OCR-bf16",
|
|
118
|
+
imageDpi: 150,
|
|
119
|
+
}).useLMStudio();
|
|
120
|
+
|
|
121
|
+
const result = await pageIndex.fromPdf("scanned-document.pdf");
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
### CLI Usage
|
|
125
|
+
|
|
126
|
+
```bash
|
|
127
|
+
# Process a PDF
|
|
128
|
+
bun-pageindex --pdf document.pdf
|
|
129
|
+
|
|
130
|
+
# Process with LM Studio
|
|
131
|
+
bun-pageindex --pdf document.pdf --lmstudio --model llama3
|
|
132
|
+
|
|
133
|
+
# Process scanned PDF with OCR
|
|
134
|
+
bun-pageindex --pdf scanned.pdf --ocr --lmstudio --model qwen/qwen3-vl-30b
|
|
135
|
+
|
|
136
|
+
# Process markdown with options
|
|
137
|
+
bun-pageindex --md README.md --add-doc-description --thinning
|
|
138
|
+
|
|
139
|
+
# See all options
|
|
140
|
+
bun-pageindex --help
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
## API Reference
|
|
144
|
+
|
|
145
|
+
### PageIndex Class
|
|
146
|
+
|
|
147
|
+
```typescript
|
|
148
|
+
const pageIndex = new PageIndex(options);
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
**Options:**
|
|
152
|
+
- `model`: LLM model to use (default: "gpt-4o-2024-11-20")
|
|
153
|
+
- `apiKey`: OpenAI API key (default: from OPENAI_API_KEY env var)
|
|
154
|
+
- `baseUrl`: Custom API base URL (for LM Studio, Ollama, etc.)
|
|
155
|
+
- `tocCheckPageNum`: Pages to check for TOC (default: 20)
|
|
156
|
+
- `maxPageNumEachNode`: Max pages per node (default: 10)
|
|
157
|
+
- `maxTokenNumEachNode`: Max tokens per node (default: 20000)
|
|
158
|
+
- `addNodeId`: Add node IDs (default: true)
|
|
159
|
+
- `addNodeSummary`: Generate summaries (default: true)
|
|
160
|
+
- `addDocDescription`: Add document description (default: false)
|
|
161
|
+
- `addNodeText`: Include raw text (default: false)
|
|
162
|
+
|
|
163
|
+
**OCR Options:**
|
|
164
|
+
- `extractionMode`: "text" (default) or "ocr" for scanned PDFs
|
|
165
|
+
- `ocrModel`: Vision model for OCR (default: "mlx-community/GLM-OCR-bf16")
|
|
166
|
+
- `ocrPromptType`: "text", "formula", or "table" (default: "text")
|
|
167
|
+
- `imageDpi`: DPI for PDF to image conversion (default: 150)
|
|
168
|
+
- `imageFormat`: "png" or "jpeg" (default: "png")
|
|
169
|
+
- `ocrConcurrency`: Concurrent OCR requests (default: 3)
|
|
170
|
+
|
|
171
|
+
**Methods:**
|
|
172
|
+
- `fromPdf(input)`: Process a PDF file or buffer
|
|
173
|
+
- `useLMStudio()`: Configure for LM Studio
|
|
174
|
+
- `useOllama()`: Configure for Ollama
|
|
175
|
+
- `useOcrMode(ocrModel?)`: Enable OCR mode
|
|
176
|
+
- `setBaseUrl(url)`: Set custom API base URL
|
|
177
|
+
|
|
178
|
+
### mdToTree Function
|
|
179
|
+
|
|
180
|
+
```typescript
|
|
181
|
+
const result = await mdToTree(path, options);
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
**Additional Options:**
|
|
185
|
+
- `thinning`: Apply tree thinning (default: false)
|
|
186
|
+
- `thinningThreshold`: Min tokens for thinning (default: 5000)
|
|
187
|
+
- `summaryTokenThreshold`: Token threshold for summaries (default: 200)
|
|
188
|
+
|
|
189
|
+
### Result Structure
|
|
190
|
+
|
|
191
|
+
```typescript
|
|
192
|
+
interface PageIndexResult {
|
|
193
|
+
docName: string;
|
|
194
|
+
docDescription?: string;
|
|
195
|
+
structure: TreeNode[];
|
|
196
|
+
}
|
|
197
|
+
|
|
198
|
+
interface TreeNode {
|
|
199
|
+
title: string;
|
|
200
|
+
nodeId?: string;
|
|
201
|
+
startIndex?: number;
|
|
202
|
+
endIndex?: number;
|
|
203
|
+
summary?: string;
|
|
204
|
+
prefixSummary?: string;
|
|
205
|
+
text?: string;
|
|
206
|
+
lineNum?: number;
|
|
207
|
+
nodes?: TreeNode[];
|
|
208
|
+
}
|
|
209
|
+
```
|
|
210
|
+
|
|
211
|
+
## Benchmarks
|
|
212
|
+
|
|
213
|
+
Run benchmarks comparing Bun vs Python implementations:
|
|
214
|
+
|
|
215
|
+
```bash
|
|
216
|
+
# Requires LM Studio running on localhost:1234
|
|
217
|
+
bun run benchmark
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
## Development
|
|
221
|
+
|
|
222
|
+
```bash
|
|
223
|
+
# Install dependencies
|
|
224
|
+
bun install
|
|
225
|
+
|
|
226
|
+
# Run tests
|
|
227
|
+
bun test
|
|
228
|
+
|
|
229
|
+
# Build
|
|
230
|
+
bun run build
|
|
231
|
+
```
|
|
232
|
+
|
|
233
|
+
## How It Works
|
|
234
|
+
|
|
235
|
+
PageIndex uses LLM reasoning to:
|
|
236
|
+
|
|
237
|
+
1. **Detect Table of Contents**: Scans initial pages for TOC
|
|
238
|
+
2. **Extract Structure**: Parses TOC or generates structure from content
|
|
239
|
+
3. **Map Page Numbers**: Associates logical page numbers with physical pages
|
|
240
|
+
4. **Build Tree**: Creates hierarchical tree structure
|
|
241
|
+
5. **Generate Summaries**: Creates summaries for each node (optional)
|
|
242
|
+
|
|
243
|
+
This approach provides human-like document understanding without the limitations of vector-based retrieval.
|
|
244
|
+
|
|
245
|
+
### OCR Mode (New in bun-pageindex)
|
|
246
|
+
|
|
247
|
+
For scanned PDFs, OCR mode adds an additional step:
|
|
248
|
+
|
|
249
|
+
1. **Convert PDF to Images**: Uses Poppler to render each page as an image
|
|
250
|
+
2. **OCR Extraction**: Uses a vision model (GLM-OCR) to extract text from images
|
|
251
|
+
3. **Standard Processing**: Continues with the same reasoning-based indexing
|
|
252
|
+
|
|
253
|
+
This enables processing of scanned documents that the original Python PageIndex cannot handle.
|
|
254
|
+
|
|
255
|
+
## Credits
|
|
256
|
+
|
|
257
|
+
This is a Bun/TypeScript port of [PageIndex](https://github.com/VectifyAI/PageIndex) by VectifyAI.
|
|
258
|
+
|
|
259
|
+
## License
|
|
260
|
+
|
|
261
|
+
MIT
|
|
262
|
+
|
|
263
|
+
## Author
|
|
264
|
+
|
|
265
|
+
Antonio Oliveira <antonio@oakoliver.com> ([oakoliver.com](https://oakoliver.com))
|
package/dist/cli.d.ts
ADDED
|
@@ -0,0 +1 @@
|
|
|
1
|
+
{"version":3,"file":"cli.d.ts","sourceRoot":"","sources":["../src/cli.ts"],"names":[],"mappings":";AACA;;;GAGG"}
|