treedex 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Mithun Gowda B
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,523 @@
1
+ # TreeDex
2
+
3
+ **Tree-based, vectorless document RAG framework.**
4
+
5
+ Index any document into a navigable tree structure, then retrieve relevant sections using **any LLM**. No vector databases, no embeddings — just structured tree retrieval.
6
+
7
+ Available for both **Python** and **Node.js** — same API, same index format, fully cross-compatible.
8
+
9
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mithun50/TreeDex/blob/main/treedex_demo.ipynb)
10
+ [![PyPI](https://img.shields.io/pypi/v/treedex)](https://pypi.org/project/treedex/)
11
+ [![npm](https://img.shields.io/npm/v/treedex)](https://www.npmjs.com/package/treedex)
12
+ [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
13
+ [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://python.org)
14
+ [![Node 18+](https://img.shields.io/badge/node-18+-green.svg)](https://nodejs.org)
15
+
16
+ ---
17
+
18
+ ## How It Works
19
+
20
+ <p align="center">
21
+ <img src="assets/how-treedex-works.svg" alt="How TreeDex Works" width="800"/>
22
+ </p>
23
+
24
+ 1. **Load** — Extract pages from any supported format
25
+ 2. **Index** — LLM analyzes page groups and extracts hierarchical structure
26
+ 3. **Build** — Flat sections become a tree with page ranges and embedded text
27
+ 4. **Query** — LLM selects relevant tree nodes for your question
28
+ 5. **Return** — Get context text, source pages, and reasoning
29
+
30
+ ### Why TreeDex instead of Vector DB?
31
+
32
+ <p align="center">
33
+ <img src="assets/treedex-vs-vectordb.svg" alt="TreeDex vs Vector DB" width="800"/>
34
+ </p>
35
+
36
+ ---
37
+
38
+ ## Supported LLM Providers
39
+
40
+ <p align="center">
41
+ <img src="assets/llm-providers.svg" alt="LLM Providers" width="800"/>
42
+ </p>
43
+
44
+ TreeDex works with **every major AI provider** out of the box. Pick what works for you:
45
+
46
+ ### One-liner backends (zero config)
47
+
48
+ | Backend | Provider | Default Model | Python Deps | Node.js Deps |
49
+ |---------|----------|---------------|-------------|-------------|
50
+ | `GeminiLLM` | Google | gemini-2.0-flash | `google-generativeai` | `@google/generative-ai` |
51
+ | `OpenAILLM` | OpenAI | gpt-4o | `openai` | `openai` |
52
+ | `ClaudeLLM` | Anthropic | claude-sonnet-4-20250514 | `anthropic` | `@anthropic-ai/sdk` |
53
+ | `MistralLLM` | Mistral AI | mistral-large-latest | `mistralai` | `@mistralai/mistralai` |
54
+ | `CohereLLM` | Cohere | command-r-plus | `cohere` | `cohere-ai` |
55
+ | `GroqLLM` | Groq | llama-3.3-70b-versatile | `groq` | `groq-sdk` |
56
+ | `TogetherLLM` | Together AI | Llama-3-70b-chat-hf | **None** | **None (fetch)** |
57
+ | `FireworksLLM` | Fireworks | llama-v3p1-70b-instruct | **None** | **None (fetch)** |
58
+ | `OpenRouterLLM` | OpenRouter | claude-sonnet-4 | **None** | **None (fetch)** |
59
+ | `DeepSeekLLM` | DeepSeek | deepseek-chat | **None** | **None (fetch)** |
60
+ | `CerebrasLLM` | Cerebras | llama-3.3-70b | **None** | **None (fetch)** |
61
+ | `SambanovaLLM` | SambaNova | Llama-3.1-70B-Instruct | **None** | **None (fetch)** |
62
+ | `HuggingFaceLLM` | HuggingFace | Mistral-7B-Instruct | **None** | **None (fetch)** |
63
+ | `OllamaLLM` | Ollama (local) | llama3 | **None** | **None (fetch)** |
64
+
65
+ ### Universal backends
66
+
67
+ | Backend | Use case | Dependencies |
68
+ |---------|----------|-------------|
69
+ | `OpenAICompatibleLLM` | **Any** OpenAI-compatible endpoint (URL + key) | **None** |
70
+ | `LiteLLM` | 100+ providers via litellm library (Python only) | `litellm` |
71
+ | `FunctionLLM` | Wrap any function | **None** |
72
+ | `BaseLLM` | Subclass to build your own | **None** |
73
+
74
+ ---
75
+
76
+ ## Quick Start
77
+
78
+ ### Install
79
+
80
+ <table>
81
+ <tr><th>Python</th><th>Node.js</th></tr>
82
+ <tr><td>
83
+
84
+ ```bash
85
+ pip install treedex
86
+
87
+ # With optional LLM SDK
88
+ pip install treedex[gemini]
89
+ pip install treedex[openai]
90
+ pip install treedex[claude]
91
+ pip install treedex[all]
92
+ ```
93
+
94
+ </td><td>
95
+
96
+ ```bash
97
+ npm install treedex
98
+
99
+ # With optional LLM SDK
100
+ npm install treedex openai
101
+ npm install treedex @google/generative-ai
102
+ npm install treedex @anthropic-ai/sdk
103
+ ```
104
+
105
+ </td></tr>
106
+ </table>
107
+
108
+ ### Pick your LLM and go
109
+
110
+ <table>
111
+ <tr><th>Python</th><th>Node.js / TypeScript</th></tr>
112
+ <tr><td>
113
+
114
+ ```python
115
+ from treedex import TreeDex, GeminiLLM
116
+
117
+ llm = GeminiLLM(api_key="YOUR_KEY")
118
+
119
+ index = TreeDex.from_file("doc.pdf", llm=llm)
120
+ result = index.query("What is the main argument?")
121
+
122
+ print(result.context)
123
+ print(result.pages_str) # "pages 5-8, 12-15"
124
+ ```
125
+
126
+ </td><td>
127
+
128
+ ```typescript
129
+ import { TreeDex, GeminiLLM } from "treedex";
130
+
131
+ const llm = new GeminiLLM("YOUR_KEY");
132
+
133
+ const index = await TreeDex.fromFile("doc.pdf", llm);
134
+ const result = await index.query("What is the main argument?");
135
+
136
+ console.log(result.context);
137
+ console.log(result.pagesStr); // "pages 5-8, 12-15"
138
+ ```
139
+
140
+ </td></tr>
141
+ </table>
142
+
143
+ ### All providers work the same way
144
+
145
+ <table>
146
+ <tr><th>Python</th><th>Node.js / TypeScript</th></tr>
147
+ <tr><td>
148
+
149
+ ```python
150
+ from treedex import *
151
+
152
+ # Google Gemini
153
+ llm = GeminiLLM(api_key="YOUR_KEY")
154
+
155
+ # OpenAI
156
+ llm = OpenAILLM(api_key="sk-...")
157
+
158
+ # Claude
159
+ llm = ClaudeLLM(api_key="sk-ant-...")
160
+
161
+ # Groq (fast inference)
162
+ llm = GroqLLM(api_key="gsk_...")
163
+
164
+ # Together AI
165
+ llm = TogetherLLM(api_key="...")
166
+
167
+ # DeepSeek
168
+ llm = DeepSeekLLM(api_key="...")
169
+
170
+ # OpenRouter (access any model)
171
+ llm = OpenRouterLLM(api_key="...")
172
+
173
+ # Local Ollama
174
+ llm = OllamaLLM(model="llama3")
175
+
176
+ # Any OpenAI-compatible endpoint
177
+ llm = OpenAICompatibleLLM(
178
+ base_url="https://your-api.com/v1",
179
+ api_key="...",
180
+ model="model-name",
181
+ )
182
+ ```
183
+
184
+ </td><td>
185
+
186
+ ```typescript
187
+ import { /* any backend */ } from "treedex";
188
+
189
+ // Google Gemini
190
+ const llm = new GeminiLLM("YOUR_KEY");
191
+
192
+ // OpenAI
193
+ const llm = new OpenAILLM("sk-...");
194
+
195
+ // Claude
196
+ const llm = new ClaudeLLM("sk-ant-...");
197
+
198
+ // Groq (fast inference)
199
+ const llm = new GroqLLM("gsk_...");
200
+
201
+ // Together AI
202
+ const llm = new TogetherLLM("...");
203
+
204
+ // DeepSeek
205
+ const llm = new DeepSeekLLM("...");
206
+
207
+ // OpenRouter (access any model)
208
+ const llm = new OpenRouterLLM("...");
209
+
210
+ // Local Ollama
211
+ const llm = new OllamaLLM("llama3");
212
+
213
+ // Any OpenAI-compatible endpoint
214
+ const llm = new OpenAICompatibleLLM({
215
+ baseUrl: "https://your-api.com/v1",
216
+ apiKey: "...",
217
+ model: "model-name",
218
+ });
219
+ ```
220
+
221
+ </td></tr>
222
+ </table>
223
+
224
+ ### Wrap any function
225
+
226
+ <table>
227
+ <tr><th>Python</th><th>Node.js / TypeScript</th></tr>
228
+ <tr><td>
229
+
230
+ ```python
231
+ from treedex import FunctionLLM
232
+
233
+ llm = FunctionLLM(lambda p: my_api(p))
234
+ ```
235
+
236
+ </td><td>
237
+
238
+ ```typescript
239
+ import { FunctionLLM } from "treedex";
240
+
241
+ const llm = new FunctionLLM((p) => myApi(p));
242
+ ```
243
+
244
+ </td></tr>
245
+ </table>
246
+
247
+ ### Build your own backend
248
+
249
+ <table>
250
+ <tr><th>Python</th><th>Node.js / TypeScript</th></tr>
251
+ <tr><td>
252
+
253
+ ```python
254
+ from treedex import BaseLLM
255
+
256
+ class MyLLM(BaseLLM):
257
+ def generate(self, prompt: str) -> str:
258
+ return my_api_call(prompt)
259
+ ```
260
+
261
+ </td><td>
262
+
263
+ ```typescript
264
+ import { BaseLLM } from "treedex";
265
+
266
+ class MyLLM extends BaseLLM {
267
+ async generate(prompt: string): Promise<string> {
268
+ return await myApiCall(prompt);
269
+ }
270
+ }
271
+ ```
272
+
273
+ </td></tr>
274
+ </table>
275
+
276
+ ### Swap LLM at query time
277
+
278
+ ```python
279
+ # Build index with one LLM
280
+ index = TreeDex.from_file("doc.pdf", llm=gemini_llm)
281
+
282
+ # Query with a different one — same index, different brain
283
+ result = index.query("...", llm=groq_llm)
284
+ ```
285
+
286
+ ### Save and load indexes
287
+
288
+ Indexes are saved as JSON. An index created in Python loads in Node.js and vice versa.
289
+
290
+ <table>
291
+ <tr><th>Python</th><th>Node.js / TypeScript</th></tr>
292
+ <tr><td>
293
+
294
+ ```python
295
+ # Save
296
+ index.save("my_index.json")
297
+
298
+ # Load
299
+ index = TreeDex.load("my_index.json", llm=llm)
300
+ ```
301
+
302
+ </td><td>
303
+
304
+ ```typescript
305
+ // Save
306
+ await index.save("my_index.json");
307
+
308
+ // Load
309
+ const index = await TreeDex.load("my_index.json", llm);
310
+ ```
311
+
312
+ </td></tr>
313
+ </table>
314
+
315
+ ---
316
+
317
+ ## Supported Document Formats
318
+
319
+ | Format | Loader | Python Deps | Node.js Deps |
320
+ |--------|--------|-------------|-------------|
321
+ | PDF | `PDFLoader` | `pymupdf` | `pdfjs-dist` (included) |
322
+ | TXT / MD | `TextLoader` | None | None |
323
+ | HTML | `HTMLLoader` | None (stdlib) | `htmlparser2` (optional, has fallback) |
324
+ | DOCX | `DOCXLoader` | `python-docx` | `mammoth` (optional) |
325
+
326
+ Use `auto_loader(path)` / `autoLoader(path)` for automatic format detection.
327
+
328
+ ---
329
+
330
+ ## API Reference
331
+
332
+ ### `TreeDex`
333
+
334
+ | Method | Python | Node.js |
335
+ |--------|--------|---------|
336
+ | Build from file | `TreeDex.from_file(path, llm)` | `await TreeDex.fromFile(path, llm)` |
337
+ | Build from pages | `TreeDex.from_pages(pages, llm)` | `await TreeDex.fromPages(pages, llm)` |
338
+ | Create from tree | `TreeDex.from_tree(tree, pages)` | `TreeDex.fromTree(tree, pages)` |
339
+ | Query | `index.query(question)` | `await index.query(question)` |
340
+ | Save | `index.save(path)` | `await index.save(path)` |
341
+ | Load | `TreeDex.load(path, llm)` | `await TreeDex.load(path, llm)` |
342
+ | Show tree | `index.show_tree()` | `index.showTree()` |
343
+ | Stats | `index.stats()` | `index.stats()` |
344
+ | Find large | `index.find_large_sections()` | `index.findLargeSections()` |
345
+
346
+ ### `QueryResult`
347
+
348
+ | Property | Python | Node.js | Description |
349
+ |----------|--------|---------|-------------|
350
+ | Context | `.context` | `.context` | Concatenated text from relevant sections |
351
+ | Node IDs | `.node_ids` | `.nodeIds` | IDs of selected tree nodes |
352
+ | Page ranges | `.page_ranges` | `.pageRanges` | `[(start, end), ...]` page ranges |
353
+ | Pages string | `.pages_str` | `.pagesStr` | Human-readable: `"pages 5-8, 12-15"` |
354
+ | Reasoning | `.reasoning` | `.reasoning` | LLM's explanation for selection |
355
+
356
+ ### Cross-language Index Compatibility
357
+
358
+ TreeDex uses the **same JSON index format** in both Python and Node.js. All field names use `snake_case` in the JSON:
359
+
360
+ ```json
361
+ {
362
+ "version": "1.0",
363
+ "framework": "TreeDex",
364
+ "tree": [{ "structure": "1", "title": "...", "node_id": "0001", ... }],
365
+ "pages": [{ "page_num": 0, "text": "...", "token_count": 123 }]
366
+ }
367
+ ```
368
+
369
+ Build an index with Python, query it from Node.js (or vice versa).
370
+
371
+ ---
372
+
373
+ ## Benchmarks
374
+
375
+ ### TreeDex vs Vector DB vs Naive Chunking
376
+
377
+ <p align="center">
378
+ <img src="assets/comparison.svg" alt="Comparison Benchmark" width="800"/>
379
+ </p>
380
+
381
+ Real benchmark on the same document (NCERT Electromagnetic Waves, 14 pages, 10 queries). All three methods retrieve from the same content — only the indexing and retrieval approach differs. **Auto-generated by CI on every push.**
382
+
383
+ ### TreeDex Stats
384
+
385
+ <p align="center">
386
+ <img src="assets/benchmarks.svg" alt="Benchmarks" width="800"/>
387
+ </p>
388
+
389
+ | Feature | TreeDex | Vector RAG | Naive Chunking |
390
+ |---------|---------|------------|----------------|
391
+ | **Page Attribution** | Exact source pages | Approximate | None |
392
+ | **Structure Preserved** | Full tree hierarchy | None | None |
393
+ | **Index Format** | Human-readable JSON | Opaque vectors | Text chunks |
394
+ | **Embedding Model** | Not needed | Required | Not needed |
395
+ | **Infrastructure** | None (JSON file) | Vector DB required | None |
396
+ | **Core Dependencies** | 2 | 5-8+ | 2-5 |
397
+
398
+ Run your own benchmarks:
399
+
400
+ ```bash
401
+ # Python
402
+ python benchmarks/run_benchmark.py
403
+
404
+ # Node.js
405
+ npx tsx benchmarks/node/run-benchmark.ts
406
+ ```
407
+
408
+ ---
409
+
410
+ ## Architecture
411
+
412
+ <p align="center">
413
+ <img src="assets/architecture.svg" alt="Architecture" width="800"/>
414
+ </p>
415
+
416
+ ---
417
+
418
+ ## Project Structure
419
+
420
+ ```
421
+ treedex/
422
+ ├── treedex/ # Python package
423
+ │ ├── core.py
424
+ │ ├── llm_backends.py
425
+ │ ├── loaders.py
426
+ │ ├── pdf_parser.py
427
+ │ ├── tree_builder.py
428
+ │ ├── tree_utils.py
429
+ │ └── prompts.py
430
+ ├── src/ # TypeScript source
431
+ │ ├── index.ts
432
+ │ ├── core.ts
433
+ │ ├── llm-backends.ts
434
+ │ ├── loaders.ts
435
+ │ ├── pdf-parser.ts
436
+ │ ├── tree-builder.ts
437
+ │ ├── tree-utils.ts
438
+ │ ├── prompts.ts
439
+ │ └── types.ts
440
+ ├── tests/ # Python tests (pytest)
441
+ ├── test/ # Node.js tests (vitest)
442
+ ├── examples/ # Python examples
443
+ ├── examples/node/ # Node.js examples
444
+ ├── benchmarks/ # Python benchmarks
445
+ ├── benchmarks/node/ # Node.js benchmarks
446
+ ├── pyproject.toml # Python package config
447
+ ├── package.json # npm package config
448
+ ├── tsconfig.json # TypeScript config
449
+ └── tsup.config.ts # Build config (ESM + CJS)
450
+ ```
451
+
452
+ ---
453
+
454
+ ## Running Tests
455
+
456
+ <table>
457
+ <tr><th>Python</th><th>Node.js</th></tr>
458
+ <tr><td>
459
+
460
+ ```bash
461
+ pip install -e ".[dev]"
462
+ pytest
463
+ pytest --cov=treedex
464
+ pytest tests/test_core.py -v
465
+ ```
466
+
467
+ </td><td>
468
+
469
+ ```bash
470
+ npm install
471
+ npm test
472
+ npm run test:watch
473
+ npm run typecheck
474
+ ```
475
+
476
+ </td></tr>
477
+ </table>
478
+
479
+ ---
480
+
481
+ ## Examples
482
+
483
+ ### Python
484
+
485
+ ```bash
486
+ python examples/quickstart.py path/to/document.pdf
487
+ python examples/multi_provider.py
488
+ python examples/custom_llm.py
489
+ python examples/save_load.py path/to/document.pdf
490
+ ```
491
+
492
+ ### Node.js
493
+
494
+ ```bash
495
+ npx tsx examples/node/quickstart.ts path/to/document.pdf
496
+ npx tsx examples/node/multi-provider.ts
497
+ npx tsx examples/node/custom-llm.ts
498
+ npx tsx examples/node/save-load.ts path/to/document.pdf
499
+ ```
500
+
501
+ ---
502
+
503
+ ## Contributing
504
+
505
+ ```bash
506
+ git clone https://github.com/mithun50/TreeDex.git
507
+ cd TreeDex
508
+
509
+ # Python development
510
+ pip install -e ".[dev]"
511
+ pytest
512
+
513
+ # Node.js development
514
+ npm install
515
+ npm run build
516
+ npm test
517
+ ```
518
+
519
+ ---
520
+
521
+ ## License
522
+
523
+ MIT License — Mithun Gowda B