pi-docparser 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,139 @@
1
+ ---
2
+ name: parse-document
3
+ description: Use this skill when the user wants to parse local documents and extract content from PDFs, Word or DOCX files, PowerPoint or PPTX decks, Excel/XLSX or CSV spreadsheets, or images such as PNG, JPG, TIFF, and WebP. It helps the agent use the dedicated `document_parse` tool efficiently for plain text extraction, layout-aware JSON with coordinates and bounding boxes, OCR, page-limited parsing, and PDF screenshots.
4
+ license: MIT
5
+ compatibility: Requires a `document_parse` tool, such as the one provided by the pi-docparser package.
6
+ metadata:
7
+ author: Maximilian Schwarzmüller + pi
8
+ primary-interface: document_parse tool
9
+ ---
10
+
11
+ # Parse Document
12
+
13
+ Use the dedicated `document_parse` tool as the default interface. Do not fall back to manual `lit` CLI commands unless the user explicitly asks for the raw command line workflow or the extension tool is unavailable.
14
+
15
+ ## Efficient tool usage
16
+
17
+ ### Choose the smallest useful output
18
+
19
+ - Use `format: "text"` when the user wants to read, summarize, quote, search, or review the document.
20
+ - Use `format: "json"` when the user needs structured page data, text positions, or bounding boxes.
21
+ - Avoid asking for JSON unless coordinates or programmatic structure actually matter.
22
+
23
+ ### Limit scope early
24
+
25
+ If the task concerns only part of a document, pass `targetPages` instead of parsing everything.
26
+
27
+ Examples:
28
+
29
+ - a single chapter
30
+ - a cited appendix
31
+ - a page range from the user
32
+ - a specific page mentioned in an error report or screenshot request
33
+
34
+ ### Use OCR deliberately
35
+
36
+ - Use `ocr: "off"` for native-text PDFs when OCR is unnecessary.
37
+ - Leave OCR on automatic behavior for scanned PDFs or image-heavy documents.
38
+ - Use `ocrLanguage` for a single OCR language.
39
+ - Use `ocrLanguages` only when you truly need multilingual OCR.
40
+ - Built-in Tesseract usually expects ISO 639-3 codes such as `eng`, `deu`, `fra`, or `jpn`.
41
+ - Many HTTP OCR servers instead expect ISO 639-1 codes such as `en`, `de`, `fr`, or `ja`.
42
+ - Increase `dpi` only when OCR quality or screenshot readability really needs it.
43
+ - Use `ocrServerUrl` only when the user already has or wants an external OCR server.
44
+
45
+ Note: on the first OCR run, LiteParse may download Tesseract language/model data.
46
+
47
+ ### Use screenshots only when text is not enough
48
+
49
+ Request `screenshotPages` when the task depends on visual layout, charts, handwriting, dense tables, or page appearance.
50
+
51
+ Do not request screenshots by default.
52
+
53
+ Important constraints:
54
+
55
+ - screenshots are currently generated only for PDF inputs
56
+ - screenshot output is PNG-based in this extension
57
+ - use `screenshotPages: "all"` to render all PDF pages when needed
58
+
59
+ ## Follow-up workflow
60
+
61
+ The tool writes parsed output to temporary files and returns their paths.
62
+
63
+ After calling it:
64
+
65
+ 1. inspect the returned parsed output path with `read`
66
+ 2. inspect screenshot paths with `read` when visual review is needed
67
+ 3. only copy files into the project if the user actually wants persistent artifacts
68
+
69
+ Do not ask the tool to inline an entire large document into context. Let it save the full result, then inspect the returned files selectively.
70
+
71
+ ## Important constraints and expectations
72
+
73
+ - Office documents and spreadsheets may require LibreOffice on the host machine.
74
+ - Image inputs may require ImageMagick on the host machine.
75
+ - The tool surfaces those missing dependencies as friendly errors; do not misdiagnose them as generic parser failures.
76
+ - Parsed outputs are temporary by default. If the user wants durable files in the repo or a chosen folder, copy the returned temp files afterward.
77
+ - Screenshot output is PNG-based in this extension.
78
+ - The tool accepts PI-style paths such as `@relative/file.pdf` and `~/Documents/file.pdf`.
79
+
80
+ ## Good default patterns
81
+
82
+ ### Summarize or review a document
83
+
84
+ Use:
85
+
86
+ - `format: "text"`
87
+ - `targetPages` if only part of the document matters
88
+ - `ocr: "off"` for clearly native-text PDFs
89
+
90
+ Then inspect the returned text file with `read`.
91
+
92
+ ### Extract positions or bounding boxes
93
+
94
+ Use:
95
+
96
+ - `format: "json"`
97
+ - `targetPages` when possible
98
+
99
+ Then inspect the JSON file with `read`.
100
+
101
+ ### Review a visually complex page
102
+
103
+ Use:
104
+
105
+ - text or JSON parsing as needed
106
+ - `screenshotPages` for the relevant PDF pages
107
+ - higher `dpi` only if readability is a problem
108
+
109
+ Then inspect the generated screenshots with `read`.
110
+
111
+ ### Parse a scanned or image-based document
112
+
113
+ Use:
114
+
115
+ - OCR enabled
116
+ - `ocrLanguage` or `ocrLanguages` when the document language is known
117
+ - optionally higher `dpi`
118
+ - JSON only if positional data matters
119
+
120
+ ## Parameter reminders
121
+
122
+ The tool supports these high-value parameters:
123
+
124
+ - `path`
125
+ - `format`
126
+ - `targetPages`
127
+ - `screenshotPages`
128
+ - `ocr`
129
+ - `ocrLanguage`
130
+ - `ocrLanguages`
131
+ - `ocrServerUrl`
132
+ - `numWorkers`
133
+ - `maxPages`
134
+ - `dpi`
135
+ - `preciseBoundingBox`
136
+ - `preserveSmallText`
137
+ - `preserveLayoutAlignmentAcrossPages`
138
+
139
+ Prefer a minimal parameter set. Only add advanced options when the task clearly benefits from them.