opendataloader-pdf 1.0.1__py3-none-any.whl → 1.0.3__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of opendataloader-pdf might be problematic. Click here for more details.

@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: opendataloader-pdf
3
- Version: 1.0.1
3
+ Version: 1.0.3
4
4
  Summary: A Python wrapper for the opendataloader-pdf Java CLI.
5
5
  Home-page: https://github.com/opendataloader-project/opendataloader-pdf
6
6
  Author: opendataloader-project
@@ -105,20 +105,20 @@ opendataloader_pdf.run(
105
105
 
106
106
  The main function to process PDFs.
107
107
 
108
- | Parameter | Type | Required | Default | Description |
109
- |--------------------------| ------ | -------- |--------------|-------------------------------------------------------------------------------------------------------------------------------------------|
110
- | `input_path` | `str` | ✅ Yes | — | Path to the input PDF file or folder. |
111
- | `output_folder` | `str` | No | input folder | Path to the output folder. |
112
- | `password` | `str` | No | `None` | Password for the PDF file. |
113
- | `replace_invalid_chars` | `str` | No | `" "` | Character to replace invalid or unrecognized characters (e.g., �, \u0000) |
114
- | `content_safety_off` | `str` | No | `None` | Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page, tiny. |
115
- | `generate_markdown` | `bool` | No | `False` | If `True`, generates a Markdown output file. |
116
- | `generate_html` | `bool` | No | `False` | If `True`, generates an HTML output file. |
117
- | `generate_annotated_pdf` | `bool` | No | `False` | If `True`, generates an annotated PDF output file. |
118
- | `keep_line_breaks` | `bool` | No | `False` | If `True`, keeps line breaks in the output. |
119
- | `html_in_markdown` | `bool` | No | `False` | If `True`, uses HTML in the Markdown output. |
120
- | `add_image_to_markdown` | `bool` | No | `False` | If `True`, adds images to the Markdown output. |
121
- | `debug` | `bool` | No | `False` | If `True`, prints CLI messages to the console during execution. |
108
+ | Parameter | Type | Required | Default | Description |
109
+ |--------------------------| ------ | -------- |--------------|---------------------------------------------------------------------------------------------------------------------------------------------|
110
+ | `input_path` | `str` | ✅ Yes | — | Path to the input PDF file or folder. |
111
+ | `output_folder` | `str` | No | input folder | Path to the output folder. |
112
+ | `password` | `str` | No | `None` | Password for the PDF file. |
113
+ | `replace_invalid_chars` | `str` | No | `" "` | Character to replace invalid or unrecognized characters (e.g., �, \u0000) |
114
+ | `content_safety_off` | `str` | No | `None` | Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page, tiny, hidden-ocg. |
115
+ | `generate_markdown` | `bool` | No | `False` | If `True`, generates a Markdown output file. |
116
+ | `generate_html` | `bool` | No | `False` | If `True`, generates an HTML output file. |
117
+ | `generate_annotated_pdf` | `bool` | No | `False` | If `True`, generates an annotated PDF output file. |
118
+ | `keep_line_breaks` | `bool` | No | `False` | If `True`, keeps line breaks in the output. |
119
+ | `html_in_markdown` | `bool` | No | `False` | If `True`, uses HTML in the Markdown output. |
120
+ | `add_image_to_markdown` | `bool` | No | `False` | If `True`, adds images to the Markdown output. |
121
+ | `debug` | `bool` | No | `False` | If `True`, prints CLI messages to the console during execution. |
122
122
 
123
123
  <br/>
124
124
 
@@ -177,19 +177,19 @@ The main function to process PDFs.
177
177
 
178
178
  **RunOptions**
179
179
 
180
- | Property | Type | Default | Description |
181
- | ----------------------- | --------- | ------------- |-------------------------------------------------------------------------------------------------------------------------------------------|
182
- | `outputFolder` | `string` | `undefined` | Path to the output folder. If not set, output is saved next to the input. |
183
- | `password` | `string` | `undefined` | Password for the PDF file. |
184
- | `replaceInvalidChars` | `string` | `" "` | Character to replace invalid or unrecognized characters (e.g., , \u0000). |
185
- | `contentSafetyOff` | `string` | `undefined` | Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page, tiny. |
186
- | `generateMarkdown` | `boolean` | `false` | If `true`, generates a Markdown output file. |
187
- | `generateHtml` | `boolean` | `false` | If `true`, generates an HTML output file. |
188
- | `generateAnnotatedPdf` | `boolean` | `false` | If `true`, generates an annotated PDF output file. |
189
- | `keepLineBreaks` | `boolean` | `false` | If `true`, keeps line breaks in the output. |
190
- | `htmlInMarkdown` | `boolean` | `false` | If `true`, uses HTML in the Markdown output. |
191
- | `addImageToMarkdown` | `boolean` | `false` | If `true`, adds images to the Markdown output. |
192
- | `debug` | `boolean` | `false` | If `true`, prints CLI messages to the console during execution. |
180
+ | Property | Type | Default | Description |
181
+ | ----------------------- | --------- | ------------- |-------------------------------------------------------------------------------------------------------------------------------------------------------|
182
+ | `outputFolder` | `string` | `undefined` | Path to the output folder. If not set, output is saved next to the input. |
183
+ | `password` | `string` | `undefined` | Password for the PDF file. |
184
+ | `replaceInvalidChars` | `string` | `" "` | Character to replace invalid or unrecognized characters (e.g., , \u0000). |
185
+ | `contentSafetyOff` | `string` | `undefined` | Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page, tiny, hidden-ocg. |
186
+ | `generateMarkdown` | `boolean` | `false` | If `true`, generates a Markdown output file. |
187
+ | `generateHtml` | `boolean` | `false` | If `true`, generates an HTML output file. |
188
+ | `generateAnnotatedPdf` | `boolean` | `false` | If `true`, generates an annotated PDF output file. |
189
+ | `keepLineBreaks` | `boolean` | `false` | If `true`, keeps line breaks in the output. |
190
+ | `htmlInMarkdown` | `boolean` | `false` | If `true`, uses HTML in the Markdown output. |
191
+ | `addImageToMarkdown` | `boolean` | `false` | If `true`, adds images to the Markdown output. |
192
+ | `debug` | `boolean` | `false` | If `true`, prints CLI messages to the console during execution. |
193
193
 
194
194
  <br/>
195
195
 
@@ -331,7 +331,7 @@ The images are extracted from PDF as individual files and stored in a subfolder
331
331
  Options:
332
332
  -o,--output-dir <arg> Specifies the output directory for generated files
333
333
  --keep-line-breaks Preserves original line breaks in the extracted text
334
- --content-safety-off <arg> Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page, tiny
334
+ --content-safety-off <arg> Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page, tiny, hidden-ocg
335
335
  --markdown-with-html Sets the data extraction output format to Markdown with rendering complex elements like tables as HTML for better structure
336
336
  --markdown-with-images Sets the data extraction output format to Markdown with extracting images from the PDF and includes them as links
337
337
  --markdown Sets the data extraction output format to Markdown
@@ -13,8 +13,8 @@ opendataloader_pdf/THIRD_PARTY/licenses/LICENSE-JJ2000.txt,sha256=itSesIy3XiNWgJ
13
13
  opendataloader_pdf/THIRD_PARTY/licenses/MIT.txt,sha256=JPCdbR3BU0uO_KypOd3sGWnKwlVHGq4l0pmrjoGtop8,1078
14
14
  opendataloader_pdf/THIRD_PARTY/licenses/MPL-2.0.txt,sha256=CGF6Fx5WV7DJmRZJ8_6w6JEt2N9bu4p6zDo18fTHHRw,15818
15
15
  opendataloader_pdf/THIRD_PARTY/licenses/Plexus Classworlds License.txt,sha256=ZQuKXwVz4FeC34ApB20vYg8kPTwgIUKRzEk5ew74-hU,1937
16
- opendataloader_pdf/jar/opendataloader-pdf-cli.jar,sha256=pW9pLp40AhKPBn6UalczNCpzN2zesq0q7hhdl-hOSTw,20470360
17
- opendataloader_pdf-1.0.1.dist-info/METADATA,sha256=43ddIC8BVAzij8L3akH3MUq5Yd-C_80HVsVGb5cODz8,24374
18
- opendataloader_pdf-1.0.1.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
19
- opendataloader_pdf-1.0.1.dist-info/top_level.txt,sha256=xee0qFQd6HPfS50E2NLICGuR6cq9C9At5SJ81yv5HkY,19
20
- opendataloader_pdf-1.0.1.dist-info/RECORD,,
16
+ opendataloader_pdf/jar/opendataloader-pdf-cli.jar,sha256=Xj3vHN5EyNydUtNheaZT81gvATHhX_q1tYt9yRCt8EA,20472831
17
+ opendataloader_pdf-1.0.3.dist-info/METADATA,sha256=1nB-I81XSIeqVrFdiODmdJyKRx0yfbyNLPO_jhqRz5Q,24580
18
+ opendataloader_pdf-1.0.3.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
19
+ opendataloader_pdf-1.0.3.dist-info/top_level.txt,sha256=xee0qFQd6HPfS50E2NLICGuR6cq9C9At5SJ81yv5HkY,19
20
+ opendataloader_pdf-1.0.3.dist-info/RECORD,,