pageindex 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,48 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Antonio Oliveira
4
+
5
+ This project is a TypeScript/Bun port of PageIndex by Vectify AI.
6
+ Original work: https://github.com/VectifyAI/PageIndex
7
+
8
+ Permission is hereby granted, free of charge, to any person obtaining a copy
9
+ of this software and associated documentation files (the "Software"), to deal
10
+ in the Software without restriction, including without limitation the rights
11
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
12
+ copies of the Software, and to permit persons to whom the Software is
13
+ furnished to do so, subject to the following conditions:
14
+
15
+ The above copyright notice and this permission notice shall be included in all
16
+ copies or substantial portions of the Software.
17
+
18
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
19
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
20
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
21
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
22
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
23
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
24
+ SOFTWARE.
25
+
26
+ ---
27
+
28
+ Original PageIndex License (MIT):
29
+
30
+ Copyright (c) 2025 Vectify AI
31
+
32
+ Permission is hereby granted, free of charge, to any person obtaining a copy
33
+ of this software and associated documentation files (the "Software"), to deal
34
+ in the Software without restriction, including without limitation the rights
35
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
36
+ copies of the Software, and to permit persons to whom the Software is
37
+ furnished to do so, subject to the following conditions:
38
+
39
+ The above copyright notice and this permission notice shall be included in all
40
+ copies or substantial portions of the Software.
41
+
42
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
43
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
44
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
45
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
46
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
47
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
48
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,265 @@
1
+ # bun-pageindex
2
+
3
+ Bun-native vectorless, reasoning-based RAG for document understanding. A TypeScript port of [PageIndex](https://github.com/VectifyAI/PageIndex) optimized for the Bun runtime.
4
+
5
+ ## Features
6
+
7
+ - **Vectorless RAG**: Uses LLM reasoning to build hierarchical document indices without vector databases
8
+ - **PDF Support**: Extract structure and content from PDF documents
9
+ - **OCR Mode**: Process scanned PDFs using GLM-OCR vision model (not in original PageIndex!)
10
+ - **Markdown Support**: Convert markdown documents to tree structures
11
+ - **LLM Agnostic**: Works with OpenAI, LM Studio, Ollama, or any OpenAI-compatible API
12
+ - **Bun Native**: Optimized for Bun runtime with minimal dependencies
13
+ - **CLI & API**: Use as a library or command-line tool
14
+
15
+ ## Installation
16
+
17
+ ```bash
18
+ bun add bun-pageindex
19
+ ```
20
+
21
+ ### For OCR Mode (Scanned PDFs)
22
+
23
+ OCR mode requires Poppler to be installed on your system:
24
+
25
+ ```bash
26
+ # macOS
27
+ brew install poppler
28
+
29
+ # Ubuntu/Debian
30
+ sudo apt-get install poppler-utils
31
+
32
+ # Windows
33
+ # Download from https://github.com/oschwartz10612/poppler-windows/releases
34
+ ```
35
+
36
+ ## Quick Start
37
+
38
+ ### As a Library
39
+
40
+ ```typescript
41
+ import { PageIndex, indexPdf, mdToTree } from "bun-pageindex";
42
+
43
+ // Process a PDF with OpenAI
44
+ const result = await indexPdf("document.pdf", {
45
+ apiKey: process.env.OPENAI_API_KEY,
46
+ model: "gpt-4o",
47
+ });
48
+
49
+ console.log(result.structure);
50
+
51
+ // Or use the PageIndex class for more control
52
+ const pageIndex = new PageIndex({
53
+ model: "gpt-4o",
54
+ addNodeSummary: true,
55
+ addDocDescription: true,
56
+ });
57
+
58
+ const pdfResult = await pageIndex.fromPdf("document.pdf");
59
+
60
+ // Process markdown
61
+ const mdResult = await mdToTree("document.md", {
62
+ addNodeSummary: true,
63
+ thinning: true,
64
+ thinningThreshold: 5000,
65
+ });
66
+ ```
67
+
68
+ ### Using LM Studio (Local LLMs)
69
+
70
+ ```typescript
71
+ import { PageIndex } from "bun-pageindex";
72
+
73
+ const pageIndex = new PageIndex({
74
+ model: "local-model", // Your LM Studio model name
75
+ }).useLMStudio();
76
+
77
+ const result = await pageIndex.fromPdf("document.pdf");
78
+ ```
79
+
80
+ ### Using Ollama
81
+
82
+ ```typescript
83
+ import { PageIndex } from "bun-pageindex";
84
+
85
+ const pageIndex = new PageIndex({
86
+ model: "llama3",
87
+ }).useOllama();
88
+
89
+ const result = await pageIndex.fromPdf("document.pdf");
90
+ ```
91
+
92
+ ### OCR Mode for Scanned PDFs
93
+
94
+ OCR mode converts PDF pages to images and uses a vision model (like GLM-OCR) to extract text, then processes with a reasoning model.
95
+
96
+ ```typescript
97
+ import { PageIndex, indexPdfWithOcr, indexPdfWithLMStudioOcr } from "bun-pageindex";
98
+
99
+ // Using OpenAI
100
+ const result = await indexPdfWithOcr("scanned-document.pdf", {
101
+ apiKey: process.env.OPENAI_API_KEY,
102
+ reasoningModel: "gpt-4o",
103
+ ocrModel: "gpt-4o", // OpenAI vision model
104
+ });
105
+
106
+ // Using LM Studio with local models
107
+ const result = await indexPdfWithLMStudioOcr(
108
+ "scanned-document.pdf",
109
+ "qwen/qwen3-vl-30b", // Reasoning model
110
+ "mlx-community/GLM-OCR-bf16" // OCR vision model
111
+ );
112
+
113
+ // Or use the PageIndex class directly
114
+ const pageIndex = new PageIndex({
115
+ model: "qwen/qwen3-vl-30b",
116
+ extractionMode: "ocr",
117
+ ocrModel: "mlx-community/GLM-OCR-bf16",
118
+ imageDpi: 150,
119
+ }).useLMStudio();
120
+
121
+ const result = await pageIndex.fromPdf("scanned-document.pdf");
122
+ ```
123
+
124
+ ### CLI Usage
125
+
126
+ ```bash
127
+ # Process a PDF
128
+ bun-pageindex --pdf document.pdf
129
+
130
+ # Process with LM Studio
131
+ bun-pageindex --pdf document.pdf --lmstudio --model llama3
132
+
133
+ # Process scanned PDF with OCR
134
+ bun-pageindex --pdf scanned.pdf --ocr --lmstudio --model qwen/qwen3-vl-30b
135
+
136
+ # Process markdown with options
137
+ bun-pageindex --md README.md --add-doc-description --thinning
138
+
139
+ # See all options
140
+ bun-pageindex --help
141
+ ```
142
+
143
+ ## API Reference
144
+
145
+ ### PageIndex Class
146
+
147
+ ```typescript
148
+ const pageIndex = new PageIndex(options);
149
+ ```
150
+
151
+ **Options:**
152
+ - `model`: LLM model to use (default: "gpt-4o-2024-11-20")
153
+ - `apiKey`: OpenAI API key (default: from OPENAI_API_KEY env var)
154
+ - `baseUrl`: Custom API base URL (for LM Studio, Ollama, etc.)
155
+ - `tocCheckPageNum`: Pages to check for TOC (default: 20)
156
+ - `maxPageNumEachNode`: Max pages per node (default: 10)
157
+ - `maxTokenNumEachNode`: Max tokens per node (default: 20000)
158
+ - `addNodeId`: Add node IDs (default: true)
159
+ - `addNodeSummary`: Generate summaries (default: true)
160
+ - `addDocDescription`: Add document description (default: false)
161
+ - `addNodeText`: Include raw text (default: false)
162
+
163
+ **OCR Options:**
164
+ - `extractionMode`: "text" (default) or "ocr" for scanned PDFs
165
+ - `ocrModel`: Vision model for OCR (default: "mlx-community/GLM-OCR-bf16")
166
+ - `ocrPromptType`: "text", "formula", or "table" (default: "text")
167
+ - `imageDpi`: DPI for PDF to image conversion (default: 150)
168
+ - `imageFormat`: "png" or "jpeg" (default: "png")
169
+ - `ocrConcurrency`: Concurrent OCR requests (default: 3)
170
+
171
+ **Methods:**
172
+ - `fromPdf(input)`: Process a PDF file or buffer
173
+ - `useLMStudio()`: Configure for LM Studio
174
+ - `useOllama()`: Configure for Ollama
175
+ - `useOcrMode(ocrModel?)`: Enable OCR mode
176
+ - `setBaseUrl(url)`: Set custom API base URL
177
+
178
+ ### mdToTree Function
179
+
180
+ ```typescript
181
+ const result = await mdToTree(path, options);
182
+ ```
183
+
184
+ **Additional Options:**
185
+ - `thinning`: Apply tree thinning (default: false)
186
+ - `thinningThreshold`: Min tokens for thinning (default: 5000)
187
+ - `summaryTokenThreshold`: Token threshold for summaries (default: 200)
188
+
189
+ ### Result Structure
190
+
191
+ ```typescript
192
+ interface PageIndexResult {
193
+ docName: string;
194
+ docDescription?: string;
195
+ structure: TreeNode[];
196
+ }
197
+
198
+ interface TreeNode {
199
+ title: string;
200
+ nodeId?: string;
201
+ startIndex?: number;
202
+ endIndex?: number;
203
+ summary?: string;
204
+ prefixSummary?: string;
205
+ text?: string;
206
+ lineNum?: number;
207
+ nodes?: TreeNode[];
208
+ }
209
+ ```
210
+
211
+ ## Benchmarks
212
+
213
+ Run benchmarks comparing Bun vs Python implementations:
214
+
215
+ ```bash
216
+ # Requires LM Studio running on localhost:1234
217
+ bun run benchmark
218
+ ```
219
+
220
+ ## Development
221
+
222
+ ```bash
223
+ # Install dependencies
224
+ bun install
225
+
226
+ # Run tests
227
+ bun test
228
+
229
+ # Build
230
+ bun run build
231
+ ```
232
+
233
+ ## How It Works
234
+
235
+ PageIndex uses LLM reasoning to:
236
+
237
+ 1. **Detect Table of Contents**: Scans initial pages for TOC
238
+ 2. **Extract Structure**: Parses TOC or generates structure from content
239
+ 3. **Map Page Numbers**: Associates logical page numbers with physical pages
240
+ 4. **Build Tree**: Creates hierarchical tree structure
241
+ 5. **Generate Summaries**: Creates summaries for each node (optional)
242
+
243
+ This approach provides human-like document understanding without the limitations of vector-based retrieval.
244
+
245
+ ### OCR Mode (New in bun-pageindex)
246
+
247
+ For scanned PDFs, OCR mode adds an additional step:
248
+
249
+ 1. **Convert PDF to Images**: Uses Poppler to render each page as an image
250
+ 2. **OCR Extraction**: Uses a vision model (GLM-OCR) to extract text from images
251
+ 3. **Standard Processing**: Continues with the same reasoning-based indexing
252
+
253
+ This enables processing of scanned documents that the original Python PageIndex cannot handle.
254
+
255
+ ## Credits
256
+
257
+ This is a Bun/TypeScript port of [PageIndex](https://github.com/VectifyAI/PageIndex) by VectifyAI.
258
+
259
+ ## License
260
+
261
+ MIT
262
+
263
+ ## Author
264
+
265
+ Antonio Oliveira <antonio@oakoliver.com> ([oakoliver.com](https://oakoliver.com))
package/dist/cli.d.ts ADDED
@@ -0,0 +1,7 @@
1
+ #!/usr/bin/env node
2
+ /**
3
+ * pageindex CLI
4
+ * Command-line interface for processing PDFs and Markdown documents
5
+ */
6
+ export {};
7
+ //# sourceMappingURL=cli.d.ts.map
@@ -0,0 +1 @@
1
+ {"version":3,"file":"cli.d.ts","sourceRoot":"","sources":["../src/cli.ts"],"names":[],"mappings":";AACA;;;GAGG"}