opendataloader-pdf 1.1.0__py3-none-any.whl → 1.1.2__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of opendataloader-pdf might be problematic. Click here for more details.

@@ -2,6 +2,7 @@ import argparse
2
2
  import subprocess
3
3
  import sys
4
4
  import importlib.resources as resources
5
+ import locale
5
6
  from pathlib import Path
6
7
  from typing import List, Optional
7
8
 
@@ -136,7 +137,7 @@ def run_jar(args: List[str], quiet: bool = False) -> str:
136
137
  capture_output=True,
137
138
  text=True,
138
139
  check=True,
139
- encoding="utf-8",
140
+ encoding=locale.getpreferredencoding(False),
140
141
  )
141
142
  return result.stdout
142
143
 
@@ -146,7 +147,7 @@ def run_jar(args: List[str], quiet: bool = False) -> str:
146
147
  stdout=subprocess.PIPE,
147
148
  stderr=subprocess.STDOUT,
148
149
  text=True,
149
- encoding="utf-8",
150
+ encoding=locale.getpreferredencoding(False),
150
151
  ) as process:
151
152
  output_lines: List[str] = []
152
153
  for line in process.stdout:
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: opendataloader-pdf
3
- Version: 1.1.0
3
+ Version: 1.1.2
4
4
  Summary: A Python wrapper for the opendataloader-pdf Java CLI.
5
5
  Home-page: https://github.com/opendataloader-project/opendataloader-pdf
6
6
  Author: opendataloader-project
@@ -152,22 +152,18 @@ npm install @opendataloader/pdf
152
152
 
153
153
  ### Usage
154
154
 
155
- - `inputPath` can be either the path to a single document or the path to a folder.
156
- - If you don’t specify an `outputFolder`, the output data will be saved in the same directory as the input document.
155
+ `inputPath` can be either the path to a single document or the path to a folder.
157
156
 
158
157
  ```typescript
159
- import { run } from '@opendataloader/pdf';
158
+ import { convert } from '@opendataloader/pdf';
160
159
 
161
160
  async function main() {
162
161
  try {
163
- const output = await run('path/to/document.pdf', {
164
- outputFolder: 'path/to/output',
165
- generateMarkdown: true,
166
- generateHtml: true,
167
- generateAnnotatedPdf: true,
168
- debug: true,
162
+ await convert(['path/to/document.pdf', 'path/to/folder'], {
163
+ outputDir: 'path/to/output',
164
+ format: ['json', 'html', 'pdf', 'markdown'],
169
165
  });
170
- console.log('PDF processing complete.', output);
166
+ console.log('convert() complete');
171
167
  } catch (error) {
172
168
  console.error('Error processing PDF:', error);
173
169
  }
@@ -175,54 +171,57 @@ async function main() {
175
171
 
176
172
  main();
177
173
  ```
174
+ ### Function: convert()
175
+
176
+ `convert(inputPaths: string[], options?: ConvertOptions): Promise<string>`
177
+
178
+ Multi-input helper matching the Python wrapper.
179
+
180
+ | Property | Type | Default | Description |
181
+ | --------------------------------| ---------- | ----------- | ----------------------------------------------------------------------------------------------------------------------------------- |
182
+ | `inputPaths` | `string[]` | — | One or more file paths or directories to process. |
183
+ | `options.outputDir` | `string` | `undefined` | Directory where outputs are written. |
184
+ | `options.password` | `string` | `undefined` | Password for encrypted PDFs. |
185
+ | `options.format` | `string[]` | `undefined` | Output formats (any combination of `json`, `text`, `html`, `pdf`, `markdown`, `markdown-with-html`, `markdown-with-images`). |
186
+ | `options.quiet` | `boolean` | `false` | Suppress CLI logging output and prevent streaming. |
187
+ | `options.contentSafetyOff` | `string[]` | `undefined` | Disable one or more content safety filters (`all`, `hidden-text`, `off-page`, `tiny`, `hidden-ocg`). |
188
+ | `options.keepLineBreaks` | `boolean` | `false` | Preserve line breaks in text output. |
189
+ | `options.replaceInvalidChars` | `string` | `undefined` | Replacement character for invalid or unrecognized characters. |
190
+
191
+ ### Function: run()
192
+
193
+ Deprecated.
178
194
 
179
- If you want to run it via CLI, you can use the following command:
195
+ ### CLI
180
196
 
181
197
  ```bash
182
- npx @opendataloader/pdf path/to/document.pdf -o path/to/output --markdown --html --pdf
198
+ npx @opendataloader/pdf path/to/document.pdf path/to/folder -o path/to/output -f json html pdf markdown
183
199
  ```
184
200
 
185
- or you can install it globally:
201
+ Or install globally:
186
202
 
187
203
  ```bash
188
204
  npm install -g @opendataloader/pdf
189
205
  ```
190
206
 
191
- then run:
207
+ Then run:
192
208
 
193
209
  ```bash
194
- opendataloader-pdf path/to/document.pdf -o path/to/output --markdown --html --pdf
210
+ opendataloader-pdf path/to/document.pdf path/to/folder -o path/to/output -f json html pdf markdown
195
211
  ```
196
212
 
197
- ### Function: run()
198
-
199
- `run(inputPath: string, options?: RunOptions): Promise<string>`
213
+ #### Available options
200
214
 
201
- The main function to process PDFs.
202
-
203
- **Parameters**
204
-
205
- | Parameter | Type | Required | Description |
206
- | ----------- | -------- | -------- | ------------------------------------- |
207
- | `inputPath` | `string` | ✅ Yes | Path to the input PDF file or folder. |
208
- | `options` | `RunOptions` | No | Configuration options for the run. |
209
-
210
- **RunOptions**
211
-
212
- | Property | Type | Default | Description |
213
- | ----------------------- | --------- | ------------- |-------------------------------------------------------------------------------------------------------------------------------------------------------|
214
- | `outputFolder` | `string` | `undefined` | Path to the output folder. If not set, output is saved next to the input. |
215
- | `password` | `string` | `undefined` | Password for the PDF file. |
216
- | `replaceInvalidChars` | `string` | `" "` | Character to replace invalid or unrecognized characters (e.g., , \u0000). |
217
- | `contentSafetyOff` | `string` | `undefined` | Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page, tiny, hidden-ocg. |
218
- | `generateMarkdown` | `boolean` | `false` | If `true`, generates a Markdown output file. |
219
- | `generateHtml` | `boolean` | `false` | If `true`, generates an HTML output file. |
220
- | `generateAnnotatedPdf` | `boolean` | `false` | If `true`, generates an annotated PDF output file. |
221
- | `keepLineBreaks` | `boolean` | `false` | If `true`, keeps line breaks in the output. |
222
- | `htmlInMarkdown` | `boolean` | `false` | If `true`, uses HTML in the Markdown output. |
223
- | `addImageToMarkdown` | `boolean` | `false` | If `true`, adds images to the Markdown output. |
224
- | `noJson` | `boolean` | `false` | If `true`, disables the JSON output. |
225
- | `debug` | `boolean` | `false` | If `true`, prints CLI messages to the console during execution. |
215
+ ```
216
+ -o, --output-dir <path> Directory where outputs are written
217
+ -p, --password <password> Password for encrypted PDFs
218
+ -f, --format <value...> Output formats to generate (json, text, html, pdf, markdown, markdown-with-html, markdown-with-images)
219
+ -q, --quiet Suppress CLI logging output
220
+ --content-safety-off <mode...> Disable one or more content safety filters (all, hidden-text, off-page, tiny, hidden-ocg)
221
+ --keep-line-breaks Preserve line breaks in text output
222
+ --replace-invalid-chars <c> Replacement character for invalid or unrecognized characters
223
+ -h, --help Show usage information
224
+ ```
226
225
 
227
226
  <br/>
228
227
 
@@ -2,7 +2,7 @@ opendataloader_pdf/LICENSE,sha256=rxdbnZbuk8IaA2FS4bkFsLlTBNSujCySHHYJEAuo334,15
2
2
  opendataloader_pdf/NOTICE.md,sha256=Uxc6sEbVz2hfsDinzzSNMtmsjx9HsQUod0yy0cswUwg,562
3
3
  opendataloader_pdf/__init__.py,sha256=xkTyVWNu1W2YrI1tPpGnd11DwwcwFDyBp8b4agLdd7A,64
4
4
  opendataloader_pdf/__main__.py,sha256=lmla4yz3SaYBfRJXOXnwO_8ID31-Ja20aQmomiz1eEc,84
5
- opendataloader_pdf/wrapper.py,sha256=0Erld0Cpu5rXhT46WpqNkrg0bBg15iBmaH-AU9CfdNU,8409
5
+ opendataloader_pdf/wrapper.py,sha256=h7QhPlHKFRL-ppeU2ZO-M7HKMFBJARPOENJhKRnMtFE,8477
6
6
  opendataloader_pdf/THIRD_PARTY/THIRD_PARTY_LICENSES.md,sha256=QRYYiXFS2zBDGdmWRo_SrRfGhrdRBwhiRo1SdUKfrQo,11235
7
7
  opendataloader_pdf/THIRD_PARTY/THIRD_PARTY_NOTICES.md,sha256=pB2ZitFM1u0x3rIDpMHsLxOe4OFNCZRqkzeR-bfpFzE,8911
8
8
  opendataloader_pdf/THIRD_PARTY/licenses/Apache-2.0.txt,sha256=z8d0m5b2O9McPEK1xHG_dWgUBT6EfBDz6wA0F7xSPTA,11358
@@ -14,9 +14,9 @@ opendataloader_pdf/THIRD_PARTY/licenses/LICENSE-JJ2000.txt,sha256=itSesIy3XiNWgJ
14
14
  opendataloader_pdf/THIRD_PARTY/licenses/MIT.txt,sha256=JPCdbR3BU0uO_KypOd3sGWnKwlVHGq4l0pmrjoGtop8,1078
15
15
  opendataloader_pdf/THIRD_PARTY/licenses/MPL-2.0.txt,sha256=CGF6Fx5WV7DJmRZJ8_6w6JEt2N9bu4p6zDo18fTHHRw,15818
16
16
  opendataloader_pdf/THIRD_PARTY/licenses/Plexus Classworlds License.txt,sha256=ZQuKXwVz4FeC34ApB20vYg8kPTwgIUKRzEk5ew74-hU,1937
17
- opendataloader_pdf/jar/opendataloader-pdf-cli.jar,sha256=0mrZFl8pVTrUIf8Nk09wN9KJPWW1U8ZMvdyB9sEPt-c,20485612
18
- opendataloader_pdf-1.1.0.dist-info/METADATA,sha256=jk-YV7lX5a9wqL57cLhAhJr_bgXEOBSY2RGL67lDN34,25270
19
- opendataloader_pdf-1.1.0.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
20
- opendataloader_pdf-1.1.0.dist-info/entry_points.txt,sha256=Tupa9pVNF6nXD9sqzCLI8PCHbSu0jKkL3SYyTkQy0dc,71
21
- opendataloader_pdf-1.1.0.dist-info/top_level.txt,sha256=xee0qFQd6HPfS50E2NLICGuR6cq9C9At5SJ81yv5HkY,19
22
- opendataloader_pdf-1.1.0.dist-info/RECORD,,
17
+ opendataloader_pdf/jar/opendataloader-pdf-cli.jar,sha256=ahAIoJG7eEEHkJteSfFvRKQpALaebK_fHJfmuiM1DtE,20485426
18
+ opendataloader_pdf-1.1.2.dist-info/METADATA,sha256=VlzNEdH2QJX7dhG3ZfsBe1l5BPB10d80ptIAO9P33uY,24565
19
+ opendataloader_pdf-1.1.2.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
20
+ opendataloader_pdf-1.1.2.dist-info/entry_points.txt,sha256=Tupa9pVNF6nXD9sqzCLI8PCHbSu0jKkL3SYyTkQy0dc,71
21
+ opendataloader_pdf-1.1.2.dist-info/top_level.txt,sha256=xee0qFQd6HPfS50E2NLICGuR6cq9C9At5SJ81yv5HkY,19
22
+ opendataloader_pdf-1.1.2.dist-info/RECORD,,