@opendataloader/pdf 1.0.1 → 1.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +28 -28
- package/lib/opendataloader-pdf-cli.jar +0 -0
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -83,20 +83,20 @@ opendataloader_pdf.run(
|
|
|
83
83
|
|
|
84
84
|
The main function to process PDFs.
|
|
85
85
|
|
|
86
|
-
| Parameter | Type | Required | Default | Description
|
|
87
|
-
|--------------------------| ------ | --------
|
|
88
|
-
| `input_path` | `str` | ✅ Yes | — | Path to the input PDF file or folder.
|
|
89
|
-
| `output_folder` | `str` | No | input folder | Path to the output folder.
|
|
90
|
-
| `password` | `str` | No | `None` | Password for the PDF file.
|
|
91
|
-
| `replace_invalid_chars` | `str` | No | `" "` | Character to replace invalid or unrecognized characters (e.g., �, \u0000)
|
|
92
|
-
| `content_safety_off` | `str` | No | `None` | Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page, tiny. |
|
|
93
|
-
| `generate_markdown` | `bool` | No | `False` | If `True`, generates a Markdown output file.
|
|
94
|
-
| `generate_html` | `bool` | No | `False` | If `True`, generates an HTML output file.
|
|
95
|
-
| `generate_annotated_pdf` | `bool` | No | `False` | If `True`, generates an annotated PDF output file.
|
|
96
|
-
| `keep_line_breaks` | `bool` | No | `False` | If `True`, keeps line breaks in the output.
|
|
97
|
-
| `html_in_markdown` | `bool` | No | `False` | If `True`, uses HTML in the Markdown output.
|
|
98
|
-
| `add_image_to_markdown` | `bool` | No | `False` | If `True`, adds images to the Markdown output.
|
|
99
|
-
| `debug` | `bool` | No | `False` | If `True`, prints CLI messages to the console during execution.
|
|
86
|
+
| Parameter | Type | Required | Default | Description |
|
|
87
|
+
|--------------------------| ------ | -------- |--------------|---------------------------------------------------------------------------------------------------------------------------------------------|
|
|
88
|
+
| `input_path` | `str` | ✅ Yes | — | Path to the input PDF file or folder. |
|
|
89
|
+
| `output_folder` | `str` | No | input folder | Path to the output folder. |
|
|
90
|
+
| `password` | `str` | No | `None` | Password for the PDF file. |
|
|
91
|
+
| `replace_invalid_chars` | `str` | No | `" "` | Character to replace invalid or unrecognized characters (e.g., �, \u0000) |
|
|
92
|
+
| `content_safety_off` | `str` | No | `None` | Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page, tiny, hidden-ocg. |
|
|
93
|
+
| `generate_markdown` | `bool` | No | `False` | If `True`, generates a Markdown output file. |
|
|
94
|
+
| `generate_html` | `bool` | No | `False` | If `True`, generates an HTML output file. |
|
|
95
|
+
| `generate_annotated_pdf` | `bool` | No | `False` | If `True`, generates an annotated PDF output file. |
|
|
96
|
+
| `keep_line_breaks` | `bool` | No | `False` | If `True`, keeps line breaks in the output. |
|
|
97
|
+
| `html_in_markdown` | `bool` | No | `False` | If `True`, uses HTML in the Markdown output. |
|
|
98
|
+
| `add_image_to_markdown` | `bool` | No | `False` | If `True`, adds images to the Markdown output. |
|
|
99
|
+
| `debug` | `bool` | No | `False` | If `True`, prints CLI messages to the console during execution. |
|
|
100
100
|
|
|
101
101
|
<br/>
|
|
102
102
|
|
|
@@ -155,19 +155,19 @@ The main function to process PDFs.
|
|
|
155
155
|
|
|
156
156
|
**RunOptions**
|
|
157
157
|
|
|
158
|
-
| Property | Type | Default | Description
|
|
159
|
-
| ----------------------- | --------- | -------------
|
|
160
|
-
| `outputFolder` | `string` | `undefined` | Path to the output folder. If not set, output is saved next to the input.
|
|
161
|
-
| `password` | `string` | `undefined` | Password for the PDF file.
|
|
162
|
-
| `replaceInvalidChars` | `string` | `" "` | Character to replace invalid or unrecognized characters (e.g., , \u0000).
|
|
163
|
-
| `contentSafetyOff` | `string` | `undefined` | Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page, tiny. |
|
|
164
|
-
| `generateMarkdown` | `boolean` | `false` | If `true`, generates a Markdown output file.
|
|
165
|
-
| `generateHtml` | `boolean` | `false` | If `true`, generates an HTML output file.
|
|
166
|
-
| `generateAnnotatedPdf` | `boolean` | `false` | If `true`, generates an annotated PDF output file.
|
|
167
|
-
| `keepLineBreaks` | `boolean` | `false` | If `true`, keeps line breaks in the output.
|
|
168
|
-
| `htmlInMarkdown` | `boolean` | `false` | If `true`, uses HTML in the Markdown output.
|
|
169
|
-
| `addImageToMarkdown` | `boolean` | `false` | If `true`, adds images to the Markdown output.
|
|
170
|
-
| `debug` | `boolean` | `false` | If `true`, prints CLI messages to the console during execution.
|
|
158
|
+
| Property | Type | Default | Description |
|
|
159
|
+
| ----------------------- | --------- | ------------- |-------------------------------------------------------------------------------------------------------------------------------------------------------|
|
|
160
|
+
| `outputFolder` | `string` | `undefined` | Path to the output folder. If not set, output is saved next to the input. |
|
|
161
|
+
| `password` | `string` | `undefined` | Password for the PDF file. |
|
|
162
|
+
| `replaceInvalidChars` | `string` | `" "` | Character to replace invalid or unrecognized characters (e.g., , \u0000). |
|
|
163
|
+
| `contentSafetyOff` | `string` | `undefined` | Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page, tiny, hidden-ocg. |
|
|
164
|
+
| `generateMarkdown` | `boolean` | `false` | If `true`, generates a Markdown output file. |
|
|
165
|
+
| `generateHtml` | `boolean` | `false` | If `true`, generates an HTML output file. |
|
|
166
|
+
| `generateAnnotatedPdf` | `boolean` | `false` | If `true`, generates an annotated PDF output file. |
|
|
167
|
+
| `keepLineBreaks` | `boolean` | `false` | If `true`, keeps line breaks in the output. |
|
|
168
|
+
| `htmlInMarkdown` | `boolean` | `false` | If `true`, uses HTML in the Markdown output. |
|
|
169
|
+
| `addImageToMarkdown` | `boolean` | `false` | If `true`, adds images to the Markdown output. |
|
|
170
|
+
| `debug` | `boolean` | `false` | If `true`, prints CLI messages to the console during execution. |
|
|
171
171
|
|
|
172
172
|
<br/>
|
|
173
173
|
|
|
@@ -309,7 +309,7 @@ The images are extracted from PDF as individual files and stored in a subfolder
|
|
|
309
309
|
Options:
|
|
310
310
|
-o,--output-dir <arg> Specifies the output directory for generated files
|
|
311
311
|
--keep-line-breaks Preserves original line breaks in the extracted text
|
|
312
|
-
--content-safety-off <arg> Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page, tiny
|
|
312
|
+
--content-safety-off <arg> Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page, tiny, hidden-ocg
|
|
313
313
|
--markdown-with-html Sets the data extraction output format to Markdown with rendering complex elements like tables as HTML for better structure
|
|
314
314
|
--markdown-with-images Sets the data extraction output format to Markdown with extracting images from the PDF and includes them as links
|
|
315
315
|
--markdown Sets the data extraction output format to Markdown
|
|
Binary file
|