@opendataloader/pdf 1.0.6 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -75,47 +75,42 @@ pip install -U opendataloader-pdf
75
75
 
76
76
  ### Usage
77
77
 
78
- - input_path can be either the path to a single document or the path to a folder.
79
- - If you don’t specify an output_folder, the output data will be saved in the same directory as the input document.
78
+ input_path can be either the path to a single document or the path to a folder.
80
79
 
81
80
  ```python
82
81
  import opendataloader_pdf
83
82
 
84
- opendataloader_pdf.run(
85
- input_path="path/to/document.pdf",
86
- output_folder="path/to/output",
87
- generate_markdown=True,
88
- generate_html=True,
89
- generate_annotated_pdf=True,
90
- debug=True,
83
+ opendataloader_pdf.convert(
84
+ input_path=["path/to/document.pdf", "path/to/folder"],
85
+ output_dir="path/to/output",
86
+ format=["json", "html", "pdf", "markdown"]
91
87
  )
92
88
  ```
93
89
 
94
- - If you want to run it via CLI, you can use the following command:
90
+ If you want to run it via CLI, you can use the following command on the terminal:
95
91
 
96
- ```sh
97
- opendataloader-pdf path/to/document.pdf --markdown --html --pdf
92
+ ```bash
93
+ opendataloader-pdf path/to/document.pdf path/to/folder -o path/to/output -f json html pdf markdown
98
94
  ```
99
95
 
100
- ### Function: run()
96
+ ### Function: convert()
101
97
 
102
98
  The main function to process PDFs.
103
99
 
104
- | Parameter | Type | Required | Default | Description |
105
- |--------------------------| ------ | -------- |--------------|---------------------------------------------------------------------------------------------------------------------------------------------|
106
- | `input_path` | `str` | ✅ Yes | — | Path to the input PDF file or folder. |
107
- | `output_folder` | `str` | No | input folder | Path to the output folder. |
108
- | `password` | `str` | No | `None` | Password for the PDF file. |
109
- | `replace_invalid_chars` | `str` | No | `" "` | Character to replace invalid or unrecognized characters (e.g., �, \u0000) |
110
- | `content_safety_off` | `str` | No | `None` | Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page, tiny, hidden-ocg. |
111
- | `generate_markdown` | `bool` | No | `False` | If `True`, generates a Markdown output file. |
112
- | `generate_html` | `bool` | No | `False` | If `True`, generates an HTML output file. |
113
- | `generate_annotated_pdf` | `bool` | No | `False` | If `True`, generates an annotated PDF output file. |
114
- | `keep_line_breaks` | `bool` | No | `False` | If `True`, keeps line breaks in the output. |
115
- | `html_in_markdown` | `bool` | No | `False` | If `True`, uses HTML in the Markdown output. |
116
- | `add_image_to_markdown` | `bool` | No | `False` | If `True`, adds images to the Markdown output. |
117
- | `no_json` | `bool` | No | `False` | If `True`, disables the JSON output. |
118
- | `debug` | `bool` | No | `False` | If `True`, prints CLI messages to the console during execution. |
100
+ | Parameter | Type | Required | Default | Description |
101
+ |--------------------------|----------------| -------- |--------------|---------------------------------------------------------------------------------------------------------------------------------------------|
102
+ | `input_path` | `List[str]` | ✅ Yes | — | One or more PDF file paths or directories to process. |
103
+ | `output_dir` | `Optional[str]` | No | input folder | Directory where outputs are written. |
104
+ | `password` | `Optional[str]` | No | `None` | Password used for encrypted PDFs. |
105
+ | `format` | `Optional[List[str]]` | No | `None` | Output formats to generate (e.g. `"json"`, `"html"`, `"pdf"`, `"text"`, `"markdown"`, `"markdown-with-html"`, `"markdown-with-images"`). |
106
+ | `quiet` | `bool` | No | `False` | Suppresses CLI logging output when `True`. |
107
+ | `content_safety_off` | `Optional[List[str]]` | No | `None` | List of content safety filters to disable (e.g. `"all"`, `"hidden-text"`, `"off-page"`, `"tiny"`, `"hidden-ocg"`). |
108
+ | `keep_line_breaks` | `bool` | No | `False` | Preserves line breaks in text output when `True`. |
109
+ | `replace_invalid_chars` | `Optional[str]` | No | `None` | Replacement character for invalid or unrecognized characters (e.g., �, `\u0000`). |
110
+
111
+ ### Function: run()
112
+
113
+ Deprecated.
119
114
 
120
115
  <br/>
121
116
 
@@ -346,16 +341,23 @@ The images are extracted from PDF as individual files and stored in a subfolder
346
341
  ```
347
342
  Options:
348
343
  -o,--output-dir <arg> Specifies the output directory for generated files
344
+ -p,--password <arg> Specifies the password for an encrypted PDF
345
+ -f,--format <arg> List of output formats to generate (json, text, html, pdf, markdown, markdown-with-html, markdown-with-images). Default: json
346
+ -q,--quiet Suppresses console logging output
347
+ --content-safety-off <arg> Disables one or more content safety filters. Accepts a list of filter names. Arguments: all, hidden-text, off-page, tiny, hidden-ocg
349
348
  --keep-line-breaks Preserves original line breaks in the extracted text
350
- --content-safety-off <arg> Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page, tiny, hidden-ocg
351
- --markdown-with-html Sets the data extraction output format to Markdown with rendering complex elements like tables as HTML for better structure
352
- --markdown-with-images Sets the data extraction output format to Markdown with extracting images from the PDF and includes them as links
353
- --markdown Sets the data extraction output format to Markdown
354
- --html Sets the data extraction output format to HTML
349
+ --replace-invalid-chars <arg> Replaces invalid or unrecognized characters (e.g., �, \u0000) with the specified character
350
+ ```
351
+
352
+ The legacy options (for backward compatibility):
353
+
354
+ ```
355
355
  --no-json Disables the JSON output format
356
- -p,--password <arg> Specifies the password for an encrypted PDF
356
+ --html Sets the data extraction output format to HTML
357
357
  --pdf Generates a new PDF file where the extracted layout data is visualized as annotations
358
- --replace-invalid-chars <arg> Replaces invalid or unrecognized characters (e.g., �, \u0000) with the specified character
358
+ --markdown Sets the data extraction output format to Markdown
359
+ --markdown-with-html Sets the data extraction output format to Markdown with rendering complex elements like tables as HTML for better structure
360
+ --markdown-with-images Sets the data extraction output format to Markdown with extracting images from the PDF and includes them as links
359
361
  ```
360
362
 
361
363
  ### Schema of the JSON output
Binary file
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@opendataloader/pdf",
3
- "version": "1.0.6",
3
+ "version": "1.1.0",
4
4
  "description": "A Node.js wrapper for the opendataloader-pdf Java CLI.",
5
5
  "main": "./dist/index.cjs",
6
6
  "module": "./dist/index.js",