npm - @opendataloader/pdf - Versions diffs - 1.0.6 → 1.1.0 - Mend

@opendataloader/pdf 1.0.6 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (3) hide show

package/README.md +37 -35
package/lib/opendataloader-pdf-cli.jar +0 -0
package/package.json +1 -1

package/README.md CHANGED Viewed

@@ -75,47 +75,42 @@ pip install -U opendataloader-pdf
 ### Usage
-- input_path can be either the path to a single document or the path to a folder.
-- If you don’t specify an output_folder, the output data will be saved in the same directory as the input document.
+input_path can be either the path to a single document or the path to a folder.
 ```python
 import opendataloader_pdf
-opendataloader_pdf.run(
-    input_path="path/to/document.pdf",
-    output_folder="path/to/output",
-    generate_markdown=True,
-    generate_html=True,
-    generate_annotated_pdf=True,
-    debug=True,
+opendataloader_pdf.convert(
+    input_path=["path/to/document.pdf", "path/to/folder"],
+    output_dir="path/to/output",
+    format=["json", "html", "pdf", "markdown"]
 )
 ```
-- If you want to run it via CLI, you can use the following command:
+If you want to run it via CLI, you can use the following command on the terminal:
-```sh
-opendataloader-pdf path/to/document.pdf --markdown --html --pdf
+```bash
+opendataloader-pdf path/to/document.pdf path/to/folder -o path/to/output -f json html pdf markdown
 ```
-### Function: run()
+### Function: convert()
 The main function to process PDFs.
-| Parameter                | Type   | Required | Default      | Description                                                                                                                                 |
-|--------------------------| ------ | -------- |--------------|---------------------------------------------------------------------------------------------------------------------------------------------|
-| `input_path`             | `str`  | ✅ Yes    | —            | Path to the input PDF file or folder.                                                                                                       |
-| `output_folder`          | `str`  | No       | input folder | Path to the output folder.                                                                                                                  |
-| `password`               | `str`  | No       | `None`       | Password for the PDF file.                                                                                                                  |
-| `replace_invalid_chars`  | `str`  | No       | `" "`       | Character to replace invalid or unrecognized characters (e.g., �, \u0000)                                                                   |
-| `content_safety_off`     | `str`  | No       | `None`       | Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page, tiny, hidden-ocg. |
-| `generate_markdown`      | `bool` | No       | `False`      | If `True`, generates a Markdown output file.                                                                                                |
-| `generate_html`          | `bool` | No       | `False`      | If `True`, generates an HTML output file.                                                                                                   |
-| `generate_annotated_pdf` | `bool` | No       | `False`      | If `True`, generates an annotated PDF output file.                                                                                          |
-| `keep_line_breaks`       | `bool` | No       | `False`      | If `True`, keeps line breaks in the output.                                                                                                 |
-| `html_in_markdown`       | `bool` | No       | `False`      | If `True`, uses HTML in the Markdown output.                                                                                                |
-| `add_image_to_markdown`  | `bool` | No       | `False`      | If `True`, adds images to the Markdown output.                                                                                              |
-| `no_json`                | `bool` | No       | `False`      | If `True`, disables the JSON output.                                                                                                        |
-| `debug`                  | `bool` | No       | `False`      | If `True`, prints CLI messages to the console during execution.                                                                             |
+| Parameter                | Type           | Required | Default      | Description                                                                                                                                 |
+|--------------------------|----------------| -------- |--------------|---------------------------------------------------------------------------------------------------------------------------------------------|
+| `input_path`             | `List[str]`     | ✅ Yes    | —            | One or more PDF file paths or directories to process.                                                                                       |
+| `output_dir`             | `Optional[str]` | No       | input folder | Directory where outputs are written.                                                                                                       |
+| `password`               | `Optional[str]` | No       | `None`       | Password used for encrypted PDFs.                                                                                                           |
+| `format`                 | `Optional[List[str]]` | No | `None`       | Output formats to generate (e.g. `"json"`, `"html"`, `"pdf"`, `"text"`, `"markdown"`, `"markdown-with-html"`, `"markdown-with-images"`).                                                             |
+| `quiet`                  | `bool`          | No       | `False`      | Suppresses CLI logging output when `True`.                                                                                                  |
+| `content_safety_off`     | `Optional[List[str]]` | No | `None`       | List of content safety filters to disable (e.g. `"all"`, `"hidden-text"`, `"off-page"`, `"tiny"`, `"hidden-ocg"`).                      |
+| `keep_line_breaks`       | `bool`          | No       | `False`      | Preserves line breaks in text output when `True`.                                                                                           |
+| `replace_invalid_chars`  | `Optional[str]` | No       | `None`       | Replacement character for invalid or unrecognized characters (e.g., �, `\u0000`).                                                           |
+### Function: run()
+Deprecated.
 <br/>
@@ -346,16 +341,23 @@ The images are extracted from PDF as individual files and stored in a subfolder
 ```
 Options:
 -o,--output-dir <arg>           Specifies the output directory for generated files
+-p,--password <arg>             Specifies the password for an encrypted PDF
+-f,--format <arg>               List of output formats to generate (json, text, html, pdf, markdown, markdown-with-html, markdown-with-images). Default: json
+-q,--quiet                      Suppresses console logging output
+--content-safety-off <arg>      Disables one or more content safety filters. Accepts a list of filter names. Arguments: all, hidden-text, off-page, tiny, hidden-ocg
 --keep-line-breaks              Preserves original line breaks in the extracted text
---content-safety-off <arg>      Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page, tiny, hidden-ocg
---markdown-with-html            Sets the data extraction output format to Markdown with rendering complex elements like tables as HTML for better structure
---markdown-with-images          Sets the data extraction output format to Markdown with extracting images from the PDF and includes them as links
---markdown                      Sets the data extraction output format to Markdown
---html                          Sets the data extraction output format to HTML
+--replace-invalid-chars <arg>   Replaces invalid or unrecognized characters (e.g., �, \u0000) with the specified character
+```
+The legacy options (for backward compatibility):
+```
 --no-json                       Disables the JSON output format
--p,--password <arg>             Specifies the password for an encrypted PDF
+--html                          Sets the data extraction output format to HTML
 --pdf                           Generates a new PDF file where the extracted layout data is visualized as annotations
---replace-invalid-chars <arg>   Replaces invalid or unrecognized characters (e.g., �, \u0000) with the specified character
+--markdown                      Sets the data extraction output format to Markdown
+--markdown-with-html            Sets the data extraction output format to Markdown with rendering complex elements like tables as HTML for better structure
+--markdown-with-images          Sets the data extraction output format to Markdown with extracting images from the PDF and includes them as links
 ```
 ### Schema of the JSON output

package/lib/opendataloader-pdf-cli.jar CHANGED Viewed

Binary file

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@opendataloader/pdf",
-  "version": "1.0.6",
+  "version": "1.1.0",
   "description": "A Node.js wrapper for the opendataloader-pdf Java CLI.",
   "main": "./dist/index.cjs",
   "module": "./dist/index.js",