pi-docparser 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +17 -0
- package/LICENSE +21 -0
- package/README.md +219 -0
- package/THIRD_PARTY_NOTICES.md +29 -0
- package/assets/pi-docparser-preview.jpg +0 -0
- package/extensions/docparser/constants.ts +71 -0
- package/extensions/docparser/deps.ts +522 -0
- package/extensions/docparser/doctor.ts +291 -0
- package/extensions/docparser/index.ts +9 -0
- package/extensions/docparser/input.ts +230 -0
- package/extensions/docparser/request.ts +67 -0
- package/extensions/docparser/schema.ts +82 -0
- package/extensions/docparser/tool.ts +305 -0
- package/extensions/docparser/types.ts +100 -0
- package/licenses/LiteParse-APACHE-2.0.txt +201 -0
- package/package.json +66 -0
- package/skills/parse-document/SKILL.md +139 -0
|
@@ -0,0 +1,139 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: parse-document
|
|
3
|
+
description: Use this skill when the user wants to parse local documents and extract content from PDFs, Word or DOCX files, PowerPoint or PPTX decks, Excel/XLSX or CSV spreadsheets, or images such as PNG, JPG, TIFF, and WebP. It helps the agent use the dedicated `document_parse` tool efficiently for plain text extraction, layout-aware JSON with coordinates and bounding boxes, OCR, page-limited parsing, and PDF screenshots.
|
|
4
|
+
license: MIT
|
|
5
|
+
compatibility: Requires a `document_parse` tool, such as the one provided by the pi-docparser package.
|
|
6
|
+
metadata:
|
|
7
|
+
author: Maximilian Schwarzmüller + pi
|
|
8
|
+
primary-interface: document_parse tool
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# Parse Document
|
|
12
|
+
|
|
13
|
+
Use the dedicated `document_parse` tool as the default interface. Do not fall back to manual `lit` CLI commands unless the user explicitly asks for the raw command line workflow or the extension tool is unavailable.
|
|
14
|
+
|
|
15
|
+
## Efficient tool usage
|
|
16
|
+
|
|
17
|
+
### Choose the smallest useful output
|
|
18
|
+
|
|
19
|
+
- Use `format: "text"` when the user wants to read, summarize, quote, search, or review the document.
|
|
20
|
+
- Use `format: "json"` when the user needs structured page data, text positions, or bounding boxes.
|
|
21
|
+
- Avoid asking for JSON unless coordinates or programmatic structure actually matter.
|
|
22
|
+
|
|
23
|
+
### Limit scope early
|
|
24
|
+
|
|
25
|
+
If the task concerns only part of a document, pass `targetPages` instead of parsing everything.
|
|
26
|
+
|
|
27
|
+
Examples:
|
|
28
|
+
|
|
29
|
+
- a single chapter
|
|
30
|
+
- a cited appendix
|
|
31
|
+
- a page range from the user
|
|
32
|
+
- a specific page mentioned in an error report or screenshot request
|
|
33
|
+
|
|
34
|
+
### Use OCR deliberately
|
|
35
|
+
|
|
36
|
+
- Use `ocr: "off"` for native-text PDFs when OCR is unnecessary.
|
|
37
|
+
- Leave OCR on automatic behavior for scanned PDFs or image-heavy documents.
|
|
38
|
+
- Use `ocrLanguage` for a single OCR language.
|
|
39
|
+
- Use `ocrLanguages` only when you truly need multilingual OCR.
|
|
40
|
+
- Built-in Tesseract usually expects ISO 639-3 codes such as `eng`, `deu`, `fra`, or `jpn`.
|
|
41
|
+
- Many HTTP OCR servers instead expect ISO 639-1 codes such as `en`, `de`, `fr`, or `ja`.
|
|
42
|
+
- Increase `dpi` only when OCR quality or screenshot readability really needs it.
|
|
43
|
+
- Use `ocrServerUrl` only when the user already has or wants an external OCR server.
|
|
44
|
+
|
|
45
|
+
Note: on the first OCR run, LiteParse may download Tesseract language/model data.
|
|
46
|
+
|
|
47
|
+
### Use screenshots only when text is not enough
|
|
48
|
+
|
|
49
|
+
Request `screenshotPages` when the task depends on visual layout, charts, handwriting, dense tables, or page appearance.
|
|
50
|
+
|
|
51
|
+
Do not request screenshots by default.
|
|
52
|
+
|
|
53
|
+
Important constraints:
|
|
54
|
+
|
|
55
|
+
- screenshots are currently generated only for PDF inputs
|
|
56
|
+
- screenshot output is PNG-based in this extension
|
|
57
|
+
- use `screenshotPages: "all"` to render all PDF pages when needed
|
|
58
|
+
|
|
59
|
+
## Follow-up workflow
|
|
60
|
+
|
|
61
|
+
The tool writes parsed output to temporary files and returns their paths.
|
|
62
|
+
|
|
63
|
+
After calling it:
|
|
64
|
+
|
|
65
|
+
1. inspect the returned parsed output path with `read`
|
|
66
|
+
2. inspect screenshot paths with `read` when visual review is needed
|
|
67
|
+
3. only copy files into the project if the user actually wants persistent artifacts
|
|
68
|
+
|
|
69
|
+
Do not ask the tool to inline an entire large document into context. Let it save the full result, then inspect the returned files selectively.
|
|
70
|
+
|
|
71
|
+
## Important constraints and expectations
|
|
72
|
+
|
|
73
|
+
- Office documents and spreadsheets may require LibreOffice on the host machine.
|
|
74
|
+
- Image inputs may require ImageMagick on the host machine.
|
|
75
|
+
- The tool surfaces those missing dependencies as friendly errors; do not misdiagnose them as generic parser failures.
|
|
76
|
+
- Parsed outputs are temporary by default. If the user wants durable files in the repo or a chosen folder, copy the returned temp files afterward.
|
|
77
|
+
- Screenshot output is PNG-based in this extension.
|
|
78
|
+
- The tool accepts PI-style paths such as `@relative/file.pdf` and `~/Documents/file.pdf`.
|
|
79
|
+
|
|
80
|
+
## Good default patterns
|
|
81
|
+
|
|
82
|
+
### Summarize or review a document
|
|
83
|
+
|
|
84
|
+
Use:
|
|
85
|
+
|
|
86
|
+
- `format: "text"`
|
|
87
|
+
- `targetPages` if only part of the document matters
|
|
88
|
+
- `ocr: "off"` for clearly native-text PDFs
|
|
89
|
+
|
|
90
|
+
Then inspect the returned text file with `read`.
|
|
91
|
+
|
|
92
|
+
### Extract positions or bounding boxes
|
|
93
|
+
|
|
94
|
+
Use:
|
|
95
|
+
|
|
96
|
+
- `format: "json"`
|
|
97
|
+
- `targetPages` when possible
|
|
98
|
+
|
|
99
|
+
Then inspect the JSON file with `read`.
|
|
100
|
+
|
|
101
|
+
### Review a visually complex page
|
|
102
|
+
|
|
103
|
+
Use:
|
|
104
|
+
|
|
105
|
+
- text or JSON parsing as needed
|
|
106
|
+
- `screenshotPages` for the relevant PDF pages
|
|
107
|
+
- higher `dpi` only if readability is a problem
|
|
108
|
+
|
|
109
|
+
Then inspect the generated screenshots with `read`.
|
|
110
|
+
|
|
111
|
+
### Parse a scanned or image-based document
|
|
112
|
+
|
|
113
|
+
Use:
|
|
114
|
+
|
|
115
|
+
- OCR enabled
|
|
116
|
+
- `ocrLanguage` or `ocrLanguages` when the document language is known
|
|
117
|
+
- optionally higher `dpi`
|
|
118
|
+
- JSON only if positional data matters
|
|
119
|
+
|
|
120
|
+
## Parameter reminders
|
|
121
|
+
|
|
122
|
+
The tool supports these high-value parameters:
|
|
123
|
+
|
|
124
|
+
- `path`
|
|
125
|
+
- `format`
|
|
126
|
+
- `targetPages`
|
|
127
|
+
- `screenshotPages`
|
|
128
|
+
- `ocr`
|
|
129
|
+
- `ocrLanguage`
|
|
130
|
+
- `ocrLanguages`
|
|
131
|
+
- `ocrServerUrl`
|
|
132
|
+
- `numWorkers`
|
|
133
|
+
- `maxPages`
|
|
134
|
+
- `dpi`
|
|
135
|
+
- `preciseBoundingBox`
|
|
136
|
+
- `preserveSmallText`
|
|
137
|
+
- `preserveLayoutAlignmentAcrossPages`
|
|
138
|
+
|
|
139
|
+
Prefer a minimal parameter set. Only add advanced options when the task clearly benefits from them.
|