PyPI - opendataloader-pdf - Versions diffs - 0.0.15__py3-none-any.whl → 0.0.16__py3-none-any.whl - Mend

opendataloader-pdf 0.0.15py3-none-any.whl → 0.0.16py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of opendataloader-pdf might be problematic. Click here for more details.

Files changed (6) hide show

opendataloader_pdf/jar/opendataloader-pdf-cli.jar CHANGED Viewed

Binary file

opendataloader_pdf/wrapper.py CHANGED Viewed

@@ -12,11 +12,12 @@ def run(
     input_path: str,
     output_folder: str = None,
     password: str = None,
+    replace_invalid_chars: str = None,
     generate_markdown: bool = False,
     generate_html: bool = False,
     generate_annotated_pdf: bool = False,
     keep_line_breaks: bool = False,
-    find_hidden_text: bool = False,
+    content_safety_off: str = None,
     html_in_markdown: bool = False,
     add_image_to_markdown: bool = False,
     debug: bool = False,
@@ -28,6 +29,7 @@ def run(
         input_path: Path to the input PDF file or folder.
         output_folder: Path to the output folder. Defaults to the input folder.
         password: Password for the PDF file.
+        replace_invalid_chars: Character to replace invalid or unrecognized characters (e.g., �, \u0000) with.
         generate_markdown: If True, generates a Markdown output file.
         generate_html: If True, generates an HTML output file.
         generate_annotated_pdf: If True, generates an annotated PDF output file.
@@ -49,9 +51,11 @@ def run(
     args = []
     if output_folder:
-        args.extend(["--folder", output_folder])
+        args.extend(["--output-dir", output_folder])
     if password:
         args.extend(["--password", password])
+    if replace_invalid_chars:
+        args.extend(["--replace-invalid-chars", replace_invalid_chars])
     if generate_markdown:
         args.append("--markdown")
     if generate_html:
@@ -59,13 +63,13 @@ def run(
     if generate_annotated_pdf:
         args.append("--pdf")
     if keep_line_breaks:
-        args.append("--keeplinebreaks")
-    if find_hidden_text:
-        args.append("--findhiddentext")
+        args.append("--keep-line-breaks")
+    if content_safety_off:
+        args.append(["--content-safety-off", content_safety_off])
     if html_in_markdown:
-        args.append("--htmlinmarkdown")
+        args.append("--markdown-with-html")
     if add_image_to_markdown:
-        args.append("--addimagetomarkdown")
+        args.append("--markdown-with-images")
     args.append(input_path)

{opendataloader_pdf-0.0.15.dist-info → opendataloader_pdf-0.0.16.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: opendataloader-pdf
-Version: 0.0.15
+Version: 0.0.16
 Summary: A Python wrapper for the opendataloader-pdf Java CLI.
 Home-page: https://github.com/opendataloader-project/opendataloader-pdf
 Author: opendataloader-project
@@ -25,11 +25,12 @@ Dynamic: summary
 # OpenDataLoader PDF
 ![Pre-release](https://img.shields.io/badge/Pre--release-FFA500&logo=github)
-[![License](https://img.shields.io/pypi/l/opendataloader-pdf.svg)](https://pypi.org/project/opendataloader-pdf/)
+[![License](https://img.shields.io/pypi/l/opendataloader-pdf.svg)](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/LICENSE)
 ![Java](https://img.shields.io/badge/Java-11+-blue.svg)
 ![Python](https://img.shields.io/badge/Python-3.8+-blue.svg)
-[![Maven Central](https://img.shields.io/maven-central/v/io.github.opendataloader-project/opendataloader-pdf-core.svg)](https://search.maven.org/artifact/io.github.opendataloader-project/opendataloader-pdf-core)
+[![Maven Central](https://img.shields.io/maven-central/v/org.opendataloader/opendataloader-pdf-core.svg)](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core)
 [![PyPI version](https://img.shields.io/pypi/v/opendataloader-pdf.svg)](https://pypi.org/project/opendataloader-pdf/)
+[![npm version](https://img.shields.io/npm/v/@opendataloader/pdf.svg)](https://www.npmjs.com/package/@opendataloader/pdf)
 [![GHCR Version](https://ghcr-badge.egpl.dev/opendataloader-project/opendataloader-pdf-cli/latest_tag?trim=major&label=docker-image)](https://github.com/opendataloader-project/opendataloader-pdf/pkgs/container/opendataloader-pdf-cli)
 [![Coverage](https://codecov.io/gh/opendataloader-project/opendataloader-pdf/branch/main/graph/badge.svg)](https://app.codecov.io/gh/opendataloader-project/opendataloader-pdf)
 [![CLA assistant](https://cla-assistant.io/readme/badge/opendataloader-project/opendataloader-pdf)](https://cla-assistant.io/opendataloader-project/opendataloader-pdf)
@@ -50,10 +51,9 @@ AI-safety is enabled by default and automatically filters likely prompt-injectio
 - 🧾 **Rich, Structured Output** — JSON, Markdown or Html
 - 🧩 **Layout Reconstruction** — Headings, Lists, Tables, Images, Reading Order
-- 🔒 **Local-First Privacy** — Runs fully on your machine
 - ⚡ **Fast & Lightweight** — Rule-Based Heuristic, High-Throughput, No GPU
-- 🛡️ **AI-Safety** — Auto-Filters likely prompt-injection content
-- 👐 **Open-Source** — Free for commercial use
+- 🔒 **Local-First Privacy** — Runs fully on your machine
+- 🛡️ **AI-Safety** — Auto-Filters likely prompt-injection content - [Learn more about AI-Safety](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/docs/AI_SAFETY.md)
 - 🖍️ **Annotated PDF Visualization** — See detected structures overlaid on the original
 [Download Annotated PDF Sample](https://raw.githubusercontent.com/opendataloader-project/opendataloader-pdf/main/resources/1901.03003_annotated.pdf)
@@ -95,8 +95,8 @@ pip install -U opendataloader-pdf
 import opendataloader_pdf
 opendataloader_pdf.run(
-    input_path="path/to/document.pdf",
-    output_folder="path/to/output",
+    input_path="path-to-document.pdf",
+    output_folder="path-to-output",
     generate_markdown=True,
     generate_html=True,
     generate_annotated_pdf=True,
@@ -107,36 +107,115 @@ opendataloader_pdf.run(
 The main function to process PDFs.
-| Parameter               | Type   | Required | Default      | Description                                                     |
-| ----------------------- | ------ | -------- | ------------ | --------------------------------------------------------------- |
-| `input_path`            | `str`  | ✅ Yes    | —            | Path to the input PDF file or folder.                           |
-| `output_folder`         | `str`  | No       | input folder | Path to the output folder.                                      |
-| `password`              | `str`  | No       | `None`       | Password for the PDF file.                                      |
-| `generate_markdown`     | `bool` | No       | `False`      | If `True`, generates a Markdown output file.                    |
-| `generate_html`         | `bool` | No       | `False`      | If `True`, generates an HTML output file.                       |
-| `generate_annotated_pdf`| `bool` | No       | `False`      | If `True`, generates an annotated PDF output file.              |
-| `keep_line_breaks`      | `bool` | No       | `False`      | If `True`, keeps line breaks in the output.                     |
-| `find_hidden_text`      | `bool` | No       | `False`      | If `True`, finds hidden text in the PDF.                        |
-| `html_in_markdown`      | `bool` | No       | `False`      | If `True`, uses HTML in the Markdown output.                    |
-| `add_image_to_markdown` | `bool` | No       | `False`      | If `True`, adds images to the Markdown output.                  |
-| `debug`                 | `bool` | No       | `False`      | If `True`, prints CLI messages to the console during execution. |
+| Parameter                | Type   | Required | Default      | Description                                                                                                                         |
+|--------------------------| ------ | -------- |--------------|-------------------------------------------------------------------------------------------------------------------------------------|
+| `input_path`             | `str`  | ✅ Yes    | —            | Path to the input PDF file or folder.                                                                                               |
+| `output_folder`          | `str`  | No       | input folder | Path to the output folder.                                                                                                          |
+| `password`               | `str`  | No       | `None`       | Password for the PDF file.                                                                                                          |
+| `replace_invalid_chars`  | `str`  | No       | `" "`       | Character to replace invalid or unrecognized characters (e.g., �, \u0000)                                                           |
+| `content_safety_off`     | `str`  | No       | `None`       | Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page. |
+| `generate_markdown`      | `bool` | No       | `False`      | If `True`, generates a Markdown output file.                                                                                        |
+| `generate_html`          | `bool` | No       | `False`      | If `True`, generates an HTML output file.                                                                                           |
+| `generate_annotated_pdf` | `bool` | No       | `False`      | If `True`, generates an annotated PDF output file.                                                                                  |
+| `keep_line_breaks`       | `bool` | No       | `False`      | If `True`, keeps line breaks in the output.                                                                                         |
+| `html_in_markdown`       | `bool` | No       | `False`      | If `True`, uses HTML in the Markdown output.                                                                                        |
+| `add_image_to_markdown`  | `bool` | No       | `False`      | If `True`, adds images to the Markdown output.                                                                                      |
+| `debug`                  | `bool` | No       | `False`      | If `True`, prints CLI messages to the console during execution.                                                                     |
+<br/>
+## Node.js / NPM
+**Note:** This package is a wrapper around a Java CLI and is intended for use in a Node.js backend environment. It cannot be used in a browser-based frontend.
+### Prerequisites
+- Java 11 or higher must be installed and available in your system's PATH.
+### Installation
+```sh
+npm install @opendataloader/pdf
+```
+### Usage
+- `inputPath` can be either the path to a single document or the path to a folder.
+- If you don’t specify an `outputFolder`, the output data will be saved in the same directory as the input document.
+```typescript
+import { run } from '@opendataloader/pdf';
+async function main() {
+  try {
+    const output = await run('path-to-document.pdf', {
+      outputFolder: 'path-to-output',
+      generateMarkdown: true,
+      generateHtml: true,
+      generateAnnotatedPdf: true,
+      debug: true,
+    });
+    console.log('PDF processing complete.', output);
+  } catch (error) {
+    console.error('Error processing PDF:', error);
+  }
+}
+main();
+```
+### Function: run()
+`run(inputPath: string, options?: RunOptions): Promise<string>`
+The main function to process PDFs.
+**Parameters**
+| Parameter   | Type     | Required | Description                           |
+| ----------- | -------- | -------- | ------------------------------------- |
+| `inputPath` | `string` | ✅ Yes    | Path to the input PDF file or folder. |
+| `options`   | `RunOptions` | No       | Configuration options for the run.    |
+**RunOptions**
+| Property                | Type      | Default       | Description                                                                 |
+| ----------------------- | --------- | ------------- | --------------------------------------------------------------------------- |
+| `outputFolder`          | `string`  | `undefined`   | Path to the output folder. If not set, output is saved next to the input.   |
+| `password`              | `string`  | `undefined`   | Password for the PDF file.                                                  |
+| `replaceInvalidChars`   | `string`  | `" "`         | Character to replace invalid or unrecognized characters (e.g., , \u0000).  |
+| `content_safety_off`     | `string`  | `undefined`   | Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page. |
+| `generateMarkdown`      | `boolean` | `false`       | If `true`, generates a Markdown output file.                                |
+| `generateHtml`          | `boolean` | `false`       | If `true`, generates an HTML output file.                                   |
+| `generateAnnotatedPdf`  | `boolean` | `false`       | If `true`, generates an annotated PDF output file.                          |
+| `keepLineBreaks`        | `boolean` | `false`       | If `true`, keeps line breaks in the output.                                 |
+| `htmlInMarkdown`        | `boolean` | `false`       | If `true`, uses HTML in the Markdown output.                                |
+| `addImageToMarkdown`    | `boolean` | `false`       | If `true`, adds images to the Markdown output.                              |
+| `debug`                 | `boolean` | `false`       | If `true`, prints CLI messages to the console during execution.             |
 <br/>
 ## Java
+For various example templates, including Gradle and Maven, please refer to https://github.com/opendataloader-project/opendataloader-pdf/tree/main/examples/java.
 ### Dependency
 To include OpenDataLoader PDF in your Maven project, add the dependency below to your `pom.xml` file.
-Check for the latest version on [Maven Central](https://search.maven.org/artifact/io.github.opendataloader-project/opendataloader-pdf-core).
+Check for the latest version on [Maven Central](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core).
 ```xml
-    <dependency>
-        <groupId>io.github.opendataloader-project</groupId>
-        <artifactId>opendataloader-pdf-core</artifactId>
-        <version>0.0.12</version>
-    </dependency>
+<project>
+    <!-- other configurations... -->
+    <dependencies>
+        <dependency>
+            <groupId>org.opendataloader</groupId>
+            <artifactId>opendataloader-pdf-core</artifactId>
+            <version>0.0.15</version>
+        </dependency>
+    </dependencies>
     <repositories>
         <repository>
@@ -158,6 +237,9 @@ Check for the latest version on [Maven Central](https://search.maven.org/artifac
             <url>https://artifactory.openpreservation.org/artifactory/vera-dev</url>
         </pluginRepository>
     </pluginRepositories>
+    <!-- other configurations... -->
+</project>
 ```
@@ -166,51 +248,22 @@ Check for the latest version on [Maven Central](https://search.maven.org/artifac
 To integrate Layout recognition API into Java code, one can follow the sample code below.
 ```java
-import com.hancom.opendataloader.pdf.api.Config;
-import com.hancom.opendataloader.pdf.api.OpenDataLoaderPDF;
+import org.opendataloader.pdf.api.Config;
+import org.opendataloader.pdf.api.OpenDataLoaderPDF;
 import java.io.IOException;
 public class Sample {
     public static void main(String[] args) {
-        //create default config
         Config config = new Config();
-        //set output folder relative to the input PDF
-        //if the output folder is not set, the current folder of the input PDF is used
-        config.setOutputFolder("output");
-        //generating pdf output file
+        config.setOutputFolder("path/to/output");
         config.setGeneratePDF(true);
-        //set password of input pdf file
-        config.setPassword("password");
-        //generate markdown output file
         config.setGenerateMarkdown(true);
-        //generate html output file
         config.setGenerateHtml(true);
-        //enable html in markdown output file
-        config.setUseHTMLInMarkdown(true);
-        //add images to markdown output file
-        config.setAddImageToMarkdown(true);
-        //disable json output file
-        config.setGenerateJSON(false);
-        //keep line breaks
-        config.setKeepLineBreaks(true);
-        //find hidden text
-        config.setFindHiddenText(true);
         try {
-            //process pdf file
-            OpenDataLoaderPDF.processFile("input.pdf", config);
+            OpenDataLoaderPDF.processFile("path/to/document.pdf", config);
         } catch (Exception exception) {
             //exception during processing
         }
@@ -220,7 +273,7 @@ public class Sample {
 ### API Documentation
-The full API documentation is available at [javadoc](https://javadoc.io/doc/io.github.opendataloader-project/opendataloader-pdf-core/latest/)
+The full API documentation is available at [javadoc](https://javadoc.io/doc/org.opendataloader/opendataloader-pdf-core/latest/)
 <br/>
@@ -267,25 +320,27 @@ Additionally, annotated PDF with recognized structures, Markdown and Html are ge
 By default all line breaks and hyphenation characters are removed, the Markdown does not include any images and does not use any HTML.
-The option `--keeplinebreaks` to preserve the original line breaks text content in JSON and Markdown output.
-The option `--htmlinmarkdown` enables use of HTML in Markdown, which may improve Markdown preview in processors that support HTML tags.
-The option `--addimagetomarkdown` enables inclusion of image references into the output Markdown.
+The option `--keep-line-breaks` to preserve the original line breaks text content in JSON and Markdown output.
+The option `--content-safety-off` disables one or more content safety filters. Accepts a comma-separated list of filter names.
+The option `--markdown-with-html` enables use of HTML in Markdown, which may improve Markdown preview in processors that support HTML tags.
+The option `--markdown-with-images` enables inclusion of image references into the output Markdown.
+The option `--replace-invalid-chars` replaces invalid or unrecognized characters (e.g., �, \u0000) with the specified character.
 The images are extracted from PDF as individual files and stored in a subfolder next to the Markdown output.
 #### Available options:
 ```
 Options:
--f,--folder <arg>          Specify output folder (default the folder of the input PDF)
--klb,--keeplinebreaks      Keep line breaks
--ht,--findhiddentext       Find hidden text
--htmlmd,--htmlinmarkdown   Use html in markdown
--im,--addimagetomarkdown   Add images to markdown
--markdown,--markdown       Generates markdown output
--html,--html               Generates html output
--p,--password <arg>        Specifies password
--pdf,--pdf                 Generates pdf output
+-o,--output-dir <arg>           Specifies the output directory for generated files
+--keep-line-breaks              Preserves original line breaks in the extracted text
+--content-safety-off <arg>      Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page
+--markdown-with-html            Sets the data extraction output format to Markdown with rendering complex elements like tables as HTML for better structure
+--markdown-with-images          Sets the data extraction output format to Markdown with extracting images from the PDF and includes them as links
+--markdown                      Sets the data extraction output format to Markdown
+--html                          Sets the data extraction output format to HTML
+-p,--password <arg>             Specifies the password for an encrypted PDF
+--pdf                           Generates a new PDF file where the extracted layout data is visualized as annotations
+--replace-invalid-chars <arg>   Replaces invalid or unrecognized characters (e.g., �, \u0000) with the specified character
 ```
 ### Schema of the JSON output

{opendataloader_pdf-0.0.15.dist-info → opendataloader_pdf-0.0.16.dist-info}/RECORD RENAMED Viewed

@@ -1,7 +1,7 @@
 opendataloader_pdf/LICENSE,sha256=rxdbnZbuk8IaA2FS4bkFsLlTBNSujCySHHYJEAuo334,15921
 opendataloader_pdf/NOTICE.md,sha256=Uxc6sEbVz2hfsDinzzSNMtmsjx9HsQUod0yy0cswUwg,562
 opendataloader_pdf/__init__.py,sha256=T5RV-dcgjNCm8klNy_EH-IgOeodcPg6Yc34HHXtuAmQ,44
-opendataloader_pdf/wrapper.py,sha256=YuCPVrqZdoA6kg-_MiXYo9KvIkmRIY_QxDqem8Sd8V0,4666
+opendataloader_pdf/wrapper.py,sha256=bPy-wNmQfJpmCg9dVx9uNTrGfW446GdGNrlJnt0cosA,4960
 opendataloader_pdf/THIRD_PARTY/THIRD_PARTY_LICENSES.md,sha256=QRYYiXFS2zBDGdmWRo_SrRfGhrdRBwhiRo1SdUKfrQo,11235
 opendataloader_pdf/THIRD_PARTY/THIRD_PARTY_NOTICES.md,sha256=pB2ZitFM1u0x3rIDpMHsLxOe4OFNCZRqkzeR-bfpFzE,8911
 opendataloader_pdf/THIRD_PARTY/licenses/Apache-2.0.txt,sha256=z8d0m5b2O9McPEK1xHG_dWgUBT6EfBDz6wA0F7xSPTA,11358
@@ -13,8 +13,8 @@ opendataloader_pdf/THIRD_PARTY/licenses/LICENSE-JJ2000.txt,sha256=itSesIy3XiNWgJ
 opendataloader_pdf/THIRD_PARTY/licenses/MIT.txt,sha256=JPCdbR3BU0uO_KypOd3sGWnKwlVHGq4l0pmrjoGtop8,1078
 opendataloader_pdf/THIRD_PARTY/licenses/MPL-2.0.txt,sha256=CGF6Fx5WV7DJmRZJ8_6w6JEt2N9bu4p6zDo18fTHHRw,15818
 opendataloader_pdf/THIRD_PARTY/licenses/Plexus Classworlds License.txt,sha256=ZQuKXwVz4FeC34ApB20vYg8kPTwgIUKRzEk5ew74-hU,1937
-opendataloader_pdf/jar/opendataloader-pdf-cli.jar,sha256=GCTahEYOHGxpId3ce3pbkB4C2CVf2VbHMY78WjvzIk4,22126046
-opendataloader_pdf-0.0.15.dist-info/METADATA,sha256=7J_lFR5yzyXMywass6JZaUh6GSt3UG4nBInfMlS_c5c,18727
-opendataloader_pdf-0.0.15.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
-opendataloader_pdf-0.0.15.dist-info/top_level.txt,sha256=xee0qFQd6HPfS50E2NLICGuR6cq9C9At5SJ81yv5HkY,19
-opendataloader_pdf-0.0.15.dist-info/RECORD,,
+opendataloader_pdf/jar/opendataloader-pdf-cli.jar,sha256=DI0_vONCuUqmvKnVwkzUcRoA4HSv4B8EWqs27vb8u2w,22126375
+opendataloader_pdf-0.0.16.dist-info/METADATA,sha256=FjpkSNX7uz8YdehHMeZenaWi7ZVQKgJnJ-4RXAR_ITI,23689
+opendataloader_pdf-0.0.16.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
+opendataloader_pdf-0.0.16.dist-info/top_level.txt,sha256=xee0qFQd6HPfS50E2NLICGuR6cq9C9At5SJ81yv5HkY,19
+opendataloader_pdf-0.0.16.dist-info/RECORD,,

{opendataloader_pdf-0.0.15.dist-info → opendataloader_pdf-0.0.16.dist-info}/WHEEL RENAMED Viewed

File without changes

{opendataloader_pdf-0.0.15.dist-info → opendataloader_pdf-0.0.16.dist-info}/top_level.txt RENAMED Viewed

File without changes

opendataloader-pdf 0.0.15__py3-none-any.whl → 0.0.16__py3-none-any.whl

Potentially problematic release.

opendataloader-pdf 0.0.15py3-none-any.whl → 0.0.16py3-none-any.whl