@opendataloader/pdf 1.0.6 → 1.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +37 -35
- package/lib/opendataloader-pdf-cli.jar +0 -0
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -75,47 +75,42 @@ pip install -U opendataloader-pdf
|
|
|
75
75
|
|
|
76
76
|
### Usage
|
|
77
77
|
|
|
78
|
-
|
|
79
|
-
- If you don’t specify an output_folder, the output data will be saved in the same directory as the input document.
|
|
78
|
+
input_path can be either the path to a single document or the path to a folder.
|
|
80
79
|
|
|
81
80
|
```python
|
|
82
81
|
import opendataloader_pdf
|
|
83
82
|
|
|
84
|
-
opendataloader_pdf.
|
|
85
|
-
input_path="path/to/document.pdf",
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
generate_html=True,
|
|
89
|
-
generate_annotated_pdf=True,
|
|
90
|
-
debug=True,
|
|
83
|
+
opendataloader_pdf.convert(
|
|
84
|
+
input_path=["path/to/document.pdf", "path/to/folder"],
|
|
85
|
+
output_dir="path/to/output",
|
|
86
|
+
format=["json", "html", "pdf", "markdown"]
|
|
91
87
|
)
|
|
92
88
|
```
|
|
93
89
|
|
|
94
|
-
|
|
90
|
+
If you want to run it via CLI, you can use the following command on the terminal:
|
|
95
91
|
|
|
96
|
-
```
|
|
97
|
-
opendataloader-pdf path/to/document.pdf
|
|
92
|
+
```bash
|
|
93
|
+
opendataloader-pdf path/to/document.pdf path/to/folder -o path/to/output -f json html pdf markdown
|
|
98
94
|
```
|
|
99
95
|
|
|
100
|
-
### Function:
|
|
96
|
+
### Function: convert()
|
|
101
97
|
|
|
102
98
|
The main function to process PDFs.
|
|
103
99
|
|
|
104
|
-
| Parameter | Type
|
|
105
|
-
|
|
106
|
-
| `input_path` | `str`
|
|
107
|
-
| `
|
|
108
|
-
| `password` | `str`
|
|
109
|
-
| `
|
|
110
|
-
| `
|
|
111
|
-
| `
|
|
112
|
-
| `
|
|
113
|
-
| `
|
|
114
|
-
|
|
115
|
-
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
| `debug` | `bool` | No | `False` | If `True`, prints CLI messages to the console during execution. |
|
|
100
|
+
| Parameter | Type | Required | Default | Description |
|
|
101
|
+
|--------------------------|----------------| -------- |--------------|---------------------------------------------------------------------------------------------------------------------------------------------|
|
|
102
|
+
| `input_path` | `List[str]` | ✅ Yes | — | One or more PDF file paths or directories to process. |
|
|
103
|
+
| `output_dir` | `Optional[str]` | No | input folder | Directory where outputs are written. |
|
|
104
|
+
| `password` | `Optional[str]` | No | `None` | Password used for encrypted PDFs. |
|
|
105
|
+
| `format` | `Optional[List[str]]` | No | `None` | Output formats to generate (e.g. `"json"`, `"html"`, `"pdf"`, `"text"`, `"markdown"`, `"markdown-with-html"`, `"markdown-with-images"`). |
|
|
106
|
+
| `quiet` | `bool` | No | `False` | Suppresses CLI logging output when `True`. |
|
|
107
|
+
| `content_safety_off` | `Optional[List[str]]` | No | `None` | List of content safety filters to disable (e.g. `"all"`, `"hidden-text"`, `"off-page"`, `"tiny"`, `"hidden-ocg"`). |
|
|
108
|
+
| `keep_line_breaks` | `bool` | No | `False` | Preserves line breaks in text output when `True`. |
|
|
109
|
+
| `replace_invalid_chars` | `Optional[str]` | No | `None` | Replacement character for invalid or unrecognized characters (e.g., �, `\u0000`). |
|
|
110
|
+
|
|
111
|
+
### Function: run()
|
|
112
|
+
|
|
113
|
+
Deprecated.
|
|
119
114
|
|
|
120
115
|
<br/>
|
|
121
116
|
|
|
@@ -346,16 +341,23 @@ The images are extracted from PDF as individual files and stored in a subfolder
|
|
|
346
341
|
```
|
|
347
342
|
Options:
|
|
348
343
|
-o,--output-dir <arg> Specifies the output directory for generated files
|
|
344
|
+
-p,--password <arg> Specifies the password for an encrypted PDF
|
|
345
|
+
-f,--format <arg> List of output formats to generate (json, text, html, pdf, markdown, markdown-with-html, markdown-with-images). Default: json
|
|
346
|
+
-q,--quiet Suppresses console logging output
|
|
347
|
+
--content-safety-off <arg> Disables one or more content safety filters. Accepts a list of filter names. Arguments: all, hidden-text, off-page, tiny, hidden-ocg
|
|
349
348
|
--keep-line-breaks Preserves original line breaks in the extracted text
|
|
350
|
-
--
|
|
351
|
-
|
|
352
|
-
|
|
353
|
-
|
|
354
|
-
|
|
349
|
+
--replace-invalid-chars <arg> Replaces invalid or unrecognized characters (e.g., �, \u0000) with the specified character
|
|
350
|
+
```
|
|
351
|
+
|
|
352
|
+
The legacy options (for backward compatibility):
|
|
353
|
+
|
|
354
|
+
```
|
|
355
355
|
--no-json Disables the JSON output format
|
|
356
|
-
|
|
356
|
+
--html Sets the data extraction output format to HTML
|
|
357
357
|
--pdf Generates a new PDF file where the extracted layout data is visualized as annotations
|
|
358
|
-
--
|
|
358
|
+
--markdown Sets the data extraction output format to Markdown
|
|
359
|
+
--markdown-with-html Sets the data extraction output format to Markdown with rendering complex elements like tables as HTML for better structure
|
|
360
|
+
--markdown-with-images Sets the data extraction output format to Markdown with extracting images from the PDF and includes them as links
|
|
359
361
|
```
|
|
360
362
|
|
|
361
363
|
### Schema of the JSON output
|
|
Binary file
|