treedex 0.1.4 → 0.1.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +22 -4
- package/package.json +1 -1
- package/dist/index.cjs +0 -1253
- package/dist/index.cjs.map +0 -1
- package/dist/index.d.cts +0 -426
- package/dist/index.d.ts +0 -426
- package/dist/index.js +0 -1177
- package/dist/index.js.map +0 -1
package/README.md
CHANGED
|
@@ -22,10 +22,20 @@ Available for both **Python** and **Node.js** — same API, same index format, f
|
|
|
22
22
|
</p>
|
|
23
23
|
|
|
24
24
|
1. **Load** — Extract pages from any supported format
|
|
25
|
-
2. **
|
|
26
|
-
3. **
|
|
27
|
-
4. **
|
|
28
|
-
5. **
|
|
25
|
+
2. **Detect** — Auto-extract PDF table of contents or detect headings via font-size analysis (`[H1]`/`[H2]`/`[H3]` markers)
|
|
26
|
+
3. **Index** — If a PDF ToC is found, build the tree directly (no LLM needed). Otherwise, LLM analyzes page groups with heading hints to extract hierarchical structure
|
|
27
|
+
4. **Build** — Flat sections become a tree with page ranges and embedded text. Orphaned subsections are auto-repaired
|
|
28
|
+
5. **Query** — LLM selects relevant tree nodes for your question
|
|
29
|
+
6. **Return** — Get context text, source pages, and reasoning
|
|
30
|
+
|
|
31
|
+
### Smart Hierarchy Detection
|
|
32
|
+
|
|
33
|
+
TreeDex uses multiple strategies to accurately extract document structure, especially for large (300+ page) documents:
|
|
34
|
+
|
|
35
|
+
- **PDF ToC extraction** — If the PDF has bookmarks/outline, the tree is built directly from it — zero LLM calls needed
|
|
36
|
+
- **Font-size heading detection** — Analyzes font sizes across the document and injects `[H1]`/`[H2]`/`[H3]` markers so the LLM knows exactly which level each heading belongs to
|
|
37
|
+
- **Capped continuation context** — For multi-chunk documents, the LLM sees a summary of top-level sections + recent sections instead of the full history, preventing prompt bloat
|
|
38
|
+
- **Orphan repair** — If the LLM outputs `"2.3.1"` without a `"2.3"` parent, synthetic parents are auto-inserted to maintain a valid tree
|
|
29
39
|
|
|
30
40
|
### Why TreeDex instead of Vector DB?
|
|
31
41
|
|
|
@@ -390,6 +400,14 @@ Use `auto_loader(path)` / `autoLoader(path)` for automatic format detection.
|
|
|
390
400
|
| Reasoning | `.reasoning` | `.reasoning` | LLM's explanation for selection |
|
|
391
401
|
| Answer | `.answer` | `.answer` | LLM-generated answer (agentic mode only) |
|
|
392
402
|
|
|
403
|
+
### Hierarchy Utilities
|
|
404
|
+
|
|
405
|
+
| Function | Python | Node.js | Description |
|
|
406
|
+
|----------|--------|---------|-------------|
|
|
407
|
+
| ToC → sections | `toc_to_sections(toc)` | `tocToSections(toc)` | Convert PDF ToC entries to numbered sections |
|
|
408
|
+
| Repair orphans | `repair_orphans(sections)` | `repairOrphans(sections)` | Insert synthetic parents for orphaned subsections |
|
|
409
|
+
| Extract PDF ToC | `extract_toc(path)` | `await extractToc(path)` | Get ToC from PDF bookmarks (returns `None`/`null` if unavailable) |
|
|
410
|
+
|
|
393
411
|
### Cross-language Index Compatibility
|
|
394
412
|
|
|
395
413
|
TreeDex uses the **same JSON index format** in both Python and Node.js. All field names use `snake_case` in the JSON:
|