treedex 0.1.2 → 0.1.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -22,10 +22,20 @@ Available for both **Python** and **Node.js** — same API, same index format, f
22
22
  </p>
23
23
 
24
24
  1. **Load** — Extract pages from any supported format
25
- 2. **Index** — LLM analyzes page groups and extracts hierarchical structure
26
- 3. **Build** — Flat sections become a tree with page ranges and embedded text
27
- 4. **Query** — LLM selects relevant tree nodes for your question
28
- 5. **Return** — Get context text, source pages, and reasoning
25
+ 2. **Detect** — Auto-extract PDF table of contents or detect headings via font-size analysis (`[H1]`/`[H2]`/`[H3]` markers)
26
+ 3. **Index** — If a PDF ToC is found, build the tree directly (no LLM needed). Otherwise, LLM analyzes page groups with heading hints to extract hierarchical structure
27
+ 4. **Build** — Flat sections become a tree with page ranges and embedded text. Orphaned subsections are auto-repaired
28
+ 5. **Query** — LLM selects relevant tree nodes for your question
29
+ 6. **Return** — Get context text, source pages, and reasoning
30
+
31
+ ### Smart Hierarchy Detection
32
+
33
+ TreeDex uses multiple strategies to accurately extract document structure, especially for large (300+ page) documents:
34
+
35
+ - **PDF ToC extraction** — If the PDF has bookmarks/outline, the tree is built directly from it — zero LLM calls needed
36
+ - **Font-size heading detection** — Analyzes font sizes across the document and injects `[H1]`/`[H2]`/`[H3]` markers so the LLM knows exactly which level each heading belongs to
37
+ - **Capped continuation context** — For multi-chunk documents, the LLM sees a summary of top-level sections + recent sections instead of the full history, preventing prompt bloat
38
+ - **Orphan repair** — If the LLM outputs `"2.3.1"` without a `"2.3"` parent, synthetic parents are auto-inserted to maintain a valid tree
29
39
 
30
40
  ### Why TreeDex instead of Vector DB?
31
41
 
@@ -390,6 +400,14 @@ Use `auto_loader(path)` / `autoLoader(path)` for automatic format detection.
390
400
  | Reasoning | `.reasoning` | `.reasoning` | LLM's explanation for selection |
391
401
  | Answer | `.answer` | `.answer` | LLM-generated answer (agentic mode only) |
392
402
 
403
+ ### Hierarchy Utilities
404
+
405
+ | Function | Python | Node.js | Description |
406
+ |----------|--------|---------|-------------|
407
+ | ToC → sections | `toc_to_sections(toc)` | `tocToSections(toc)` | Convert PDF ToC entries to numbered sections |
408
+ | Repair orphans | `repair_orphans(sections)` | `repairOrphans(sections)` | Insert synthetic parents for orphaned subsections |
409
+ | Extract PDF ToC | `extract_toc(path)` | `await extractToc(path)` | Get ToC from PDF bookmarks (returns `None`/`null` if unavailable) |
410
+
393
411
  ### Cross-language Index Compatibility
394
412
 
395
413
  TreeDex uses the **same JSON index format** in both Python and Node.js. All field names use `snake_case` in the JSON:
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "treedex",
3
- "version": "0.1.2",
3
+ "version": "0.1.5",
4
4
  "description": "Tree-based, vectorless document RAG framework. Connect any LLM via URL/API key.",
5
5
  "type": "module",
6
6
  "main": "dist/index.cjs",