npm - @opendataloader/pdf - Versions diffs - 0.0.0 → 0.0.16 - Mend

@opendataloader/pdf 0.0.0 → 0.0.16

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (3) hide show

package/README.md +112 -62
package/lib/opendataloader-pdf-cli.jar +0 -0
package/package.json +1 -1

package/README.md CHANGED Viewed

@@ -4,8 +4,9 @@
 [![License](https://img.shields.io/pypi/l/opendataloader-pdf.svg)](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/LICENSE)
 ![Java](https://img.shields.io/badge/Java-11+-blue.svg)
 ![Python](https://img.shields.io/badge/Python-3.8+-blue.svg)
-[![Maven Central](https://img.shields.io/maven-central/v/io.github.opendataloader-project/opendataloader-pdf-core.svg)](https://search.maven.org/artifact/io.github.opendataloader-project/opendataloader-pdf-core)
+[![Maven Central](https://img.shields.io/maven-central/v/org.opendataloader/opendataloader-pdf-core.svg)](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core)
 [![PyPI version](https://img.shields.io/pypi/v/opendataloader-pdf.svg)](https://pypi.org/project/opendataloader-pdf/)
+[![npm version](https://img.shields.io/npm/v/@opendataloader/pdf.svg)](https://www.npmjs.com/package/@opendataloader/pdf)
 [![GHCR Version](https://ghcr-badge.egpl.dev/opendataloader-project/opendataloader-pdf-cli/latest_tag?trim=major&label=docker-image)](https://github.com/opendataloader-project/opendataloader-pdf/pkgs/container/opendataloader-pdf-cli)
 [![Coverage](https://codecov.io/gh/opendataloader-project/opendataloader-pdf/branch/main/graph/badge.svg)](https://app.codecov.io/gh/opendataloader-project/opendataloader-pdf)
 [![CLA assistant](https://cla-assistant.io/readme/badge/opendataloader-project/opendataloader-pdf)](https://cla-assistant.io/opendataloader-project/opendataloader-pdf)
@@ -70,8 +71,8 @@ pip install -U opendataloader-pdf
 import opendataloader_pdf
 opendataloader_pdf.run(
-    input_path="path/to/document.pdf",
-    output_folder="path/to/output",
+    input_path="path-to-document.pdf",
+    output_folder="path-to-output",
     generate_markdown=True,
     generate_html=True,
     generate_annotated_pdf=True,
@@ -82,37 +83,115 @@ opendataloader_pdf.run(
 The main function to process PDFs.
-| Parameter               | Type   | Required | Default      | Description                                                                 |
-| ----------------------- | ------ | -------- |--------------|-----------------------------------------------------------------------------|
-| `input_path`            | `str`  | ✅ Yes    | —            | Path to the input PDF file or folder.                                       |
-| `output_folder`         | `str`  | No       | input folder | Path to the output folder.                                                  |
-| `password`              | `str`  | No       | `None`       | Password for the PDF file.                                                  |
-| `replace_invalid_chars` | `str`  | No       | `None`       | Character to replace invalid or unrecognized characters (e.g., �, \u0000)   |
-| `generate_markdown`     | `bool` | No       | `False`      | If `True`, generates a Markdown output file.                                |
-| `generate_html`         | `bool` | No       | `False`      | If `True`, generates an HTML output file.                                   |
-| `generate_annotated_pdf`| `bool` | No       | `False`      | If `True`, generates an annotated PDF output file.                          |
-| `keep_line_breaks`      | `bool` | No       | `False`      | If `True`, keeps line breaks in the output.                                 |
-| `find_hidden_text`      | `bool` | No       | `False`      | If `True`, finds hidden text in the PDF.                                    |
-| `html_in_markdown`      | `bool` | No       | `False`      | If `True`, uses HTML in the Markdown output.                                |
-| `add_image_to_markdown` | `bool` | No       | `False`      | If `True`, adds images to the Markdown output.                              |
-| `debug`                 | `bool` | No       | `False`      | If `True`, prints CLI messages to the console during execution.             |
+| Parameter                | Type   | Required | Default      | Description                                                                                                                         |
+|--------------------------| ------ | -------- |--------------|-------------------------------------------------------------------------------------------------------------------------------------|
+| `input_path`             | `str`  | ✅ Yes    | —            | Path to the input PDF file or folder.                                                                                               |
+| `output_folder`          | `str`  | No       | input folder | Path to the output folder.                                                                                                          |
+| `password`               | `str`  | No       | `None`       | Password for the PDF file.                                                                                                          |
+| `replace_invalid_chars`  | `str`  | No       | `" "`       | Character to replace invalid or unrecognized characters (e.g., �, \u0000)                                                           |
+| `content_safety_off`     | `str`  | No       | `None`       | Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page. |
+| `generate_markdown`      | `bool` | No       | `False`      | If `True`, generates a Markdown output file.                                                                                        |
+| `generate_html`          | `bool` | No       | `False`      | If `True`, generates an HTML output file.                                                                                           |
+| `generate_annotated_pdf` | `bool` | No       | `False`      | If `True`, generates an annotated PDF output file.                                                                                  |
+| `keep_line_breaks`       | `bool` | No       | `False`      | If `True`, keeps line breaks in the output.                                                                                         |
+| `html_in_markdown`       | `bool` | No       | `False`      | If `True`, uses HTML in the Markdown output.                                                                                        |
+| `add_image_to_markdown`  | `bool` | No       | `False`      | If `True`, adds images to the Markdown output.                                                                                      |
+| `debug`                  | `bool` | No       | `False`      | If `True`, prints CLI messages to the console during execution.                                                                     |
+<br/>
+## Node.js / NPM
+**Note:** This package is a wrapper around a Java CLI and is intended for use in a Node.js backend environment. It cannot be used in a browser-based frontend.
+### Prerequisites
+- Java 11 or higher must be installed and available in your system's PATH.
+### Installation
+```sh
+npm install @opendataloader/pdf
+```
+### Usage
+- `inputPath` can be either the path to a single document or the path to a folder.
+- If you don’t specify an `outputFolder`, the output data will be saved in the same directory as the input document.
+```typescript
+import { run } from '@opendataloader/pdf';
+async function main() {
+  try {
+    const output = await run('path-to-document.pdf', {
+      outputFolder: 'path-to-output',
+      generateMarkdown: true,
+      generateHtml: true,
+      generateAnnotatedPdf: true,
+      debug: true,
+    });
+    console.log('PDF processing complete.', output);
+  } catch (error) {
+    console.error('Error processing PDF:', error);
+  }
+}
+main();
+```
+### Function: run()
+`run(inputPath: string, options?: RunOptions): Promise<string>`
+The main function to process PDFs.
+**Parameters**
+| Parameter   | Type     | Required | Description                           |
+| ----------- | -------- | -------- | ------------------------------------- |
+| `inputPath` | `string` | ✅ Yes    | Path to the input PDF file or folder. |
+| `options`   | `RunOptions` | No       | Configuration options for the run.    |
+**RunOptions**
+| Property                | Type      | Default       | Description                                                                 |
+| ----------------------- | --------- | ------------- | --------------------------------------------------------------------------- |
+| `outputFolder`          | `string`  | `undefined`   | Path to the output folder. If not set, output is saved next to the input.   |
+| `password`              | `string`  | `undefined`   | Password for the PDF file.                                                  |
+| `replaceInvalidChars`   | `string`  | `" "`         | Character to replace invalid or unrecognized characters (e.g., , \u0000).  |
+| `content_safety_off`     | `string`  | `undefined`   | Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page. |
+| `generateMarkdown`      | `boolean` | `false`       | If `true`, generates a Markdown output file.                                |
+| `generateHtml`          | `boolean` | `false`       | If `true`, generates an HTML output file.                                   |
+| `generateAnnotatedPdf`  | `boolean` | `false`       | If `true`, generates an annotated PDF output file.                          |
+| `keepLineBreaks`        | `boolean` | `false`       | If `true`, keeps line breaks in the output.                                 |
+| `htmlInMarkdown`        | `boolean` | `false`       | If `true`, uses HTML in the Markdown output.                                |
+| `addImageToMarkdown`    | `boolean` | `false`       | If `true`, adds images to the Markdown output.                              |
+| `debug`                 | `boolean` | `false`       | If `true`, prints CLI messages to the console during execution.             |
 <br/>
 ## Java
+For various example templates, including Gradle and Maven, please refer to https://github.com/opendataloader-project/opendataloader-pdf/tree/main/examples/java.
 ### Dependency
 To include OpenDataLoader PDF in your Maven project, add the dependency below to your `pom.xml` file.
-Check for the latest version on [Maven Central](https://search.maven.org/artifact/io.github.opendataloader-project/opendataloader-pdf-core).
+Check for the latest version on [Maven Central](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core).
 ```xml
-    <dependency>
-        <groupId>io.github.opendataloader-project</groupId>
-        <artifactId>opendataloader-pdf-core</artifactId>
-        <version>0.0.12</version>
-    </dependency>
+<project>
+    <!-- other configurations... -->
+    <dependencies>
+        <dependency>
+            <groupId>org.opendataloader</groupId>
+            <artifactId>opendataloader-pdf-core</artifactId>
+            <version>0.0.15</version>
+        </dependency>
+    </dependencies>
     <repositories>
         <repository>
@@ -134,6 +213,9 @@ Check for the latest version on [Maven Central](https://search.maven.org/artifac
             <url>https://artifactory.openpreservation.org/artifactory/vera-dev</url>
         </pluginRepository>
     </pluginRepositories>
+    <!-- other configurations... -->
+</project>
 ```
@@ -142,54 +224,22 @@ Check for the latest version on [Maven Central](https://search.maven.org/artifac
 To integrate Layout recognition API into Java code, one can follow the sample code below.
 ```java
-import com.hancom.opendataloader.pdf.api.Config;
-import com.hancom.opendataloader.pdf.api.OpenDataLoaderPDF;
+import org.opendataloader.pdf.api.Config;
+import org.opendataloader.pdf.api.OpenDataLoaderPDF;
 import java.io.IOException;
 public class Sample {
     public static void main(String[] args) {
-        //create default config
         Config config = new Config();
-        //set output folder relative to the input PDF
-        //if the output folder is not set, the current folder of the input PDF is used
-        config.setOutputFolder("output");
-        //generating pdf output file
+        config.setOutputFolder("path/to/output");
         config.setGeneratePDF(true);
-        //set password of input pdf file
-        config.setPassword("password");
-        //generate markdown output file
         config.setGenerateMarkdown(true);
-        //generate html output file
         config.setGenerateHtml(true);
-        //enable html in markdown output file
-        config.setUseHTMLInMarkdown(true);
-        //add images to markdown output file
-        config.setAddImageToMarkdown(true);
-        //disable json output file
-        config.setGenerateJSON(false);
-        //keep line breaks
-        config.setKeepLineBreaks(true);
-        //find hidden text
-        config.setFindHiddenText(true);
-        //replace invalid chars with specified character
-        config.setReplaceInvalidChars("character");
         try {
-            //process pdf file
-            OpenDataLoaderPDF.processFile("input.pdf", config);
+            OpenDataLoaderPDF.processFile("path/to/document.pdf", config);
         } catch (Exception exception) {
             //exception during processing
         }
@@ -199,7 +249,7 @@ public class Sample {
 ### API Documentation
-The full API documentation is available at [javadoc](https://javadoc.io/doc/io.github.opendataloader-project/opendataloader-pdf-core/latest/)
+The full API documentation is available at [javadoc](https://javadoc.io/doc/org.opendataloader/opendataloader-pdf-core/latest/)
 <br/>
@@ -247,7 +297,7 @@ Additionally, annotated PDF with recognized structures, Markdown and Html are ge
 By default all line breaks and hyphenation characters are removed, the Markdown does not include any images and does not use any HTML.
 The option `--keep-line-breaks` to preserve the original line breaks text content in JSON and Markdown output.
+The option `--content-safety-off` disables one or more content safety filters. Accepts a comma-separated list of filter names.
 The option `--markdown-with-html` enables use of HTML in Markdown, which may improve Markdown preview in processors that support HTML tags.
 The option `--markdown-with-images` enables inclusion of image references into the output Markdown.
 The option `--replace-invalid-chars` replaces invalid or unrecognized characters (e.g., �, \u0000) with the specified character.
@@ -259,7 +309,7 @@ The images are extracted from PDF as individual files and stored in a subfolder
 Options:
 -o,--output-dir <arg>           Specifies the output directory for generated files
 --keep-line-breaks              Preserves original line breaks in the extracted text
--ht,--findhiddentext            Find hidden text
+--content-safety-off <arg>      Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page
 --markdown-with-html            Sets the data extraction output format to Markdown with rendering complex elements like tables as HTML for better structure
 --markdown-with-images          Sets the data extraction output format to Markdown with extracting images from the PDF and includes them as links
 --markdown                      Sets the data extraction output format to Markdown

package/lib/opendataloader-pdf-cli.jar CHANGED Viewed

Binary file

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@opendataloader/pdf",
-  "version": "0.0.0",
+  "version": "0.0.16",
   "description": "A Node.js wrapper for the opendataloader-pdf Java CLI.",
   "main": "./dist/index.cjs",
   "module": "./dist/index.js",