kreuzberg 3.1.7__tar.gz → 3.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (187) hide show
  1. kreuzberg-3.3.0/.commitlintrc +1 -0
  2. kreuzberg-3.3.0/.github/benchmarks/README.md +15 -0
  3. kreuzberg-3.3.0/.github/dependabot.yaml +6 -0
  4. kreuzberg-3.3.0/.github/workflows/ci.yaml +124 -0
  5. kreuzberg-3.3.0/.github/workflows/pr-title.yaml +20 -0
  6. kreuzberg-3.3.0/.github/workflows/release.yaml +31 -0
  7. kreuzberg-3.3.0/.gitignore +33 -0
  8. kreuzberg-3.3.0/.markdownlint.yaml +17 -0
  9. kreuzberg-3.3.0/.pre-commit-config.yaml +86 -0
  10. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/PKG-INFO +95 -34
  11. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/README.md +65 -8
  12. kreuzberg-3.3.0/ai-rulez.yaml +166 -0
  13. kreuzberg-3.3.0/benchmarks/README.md +152 -0
  14. kreuzberg-3.3.0/benchmarks/benchmark_baseline.py +117 -0
  15. kreuzberg-3.3.0/benchmarks/end_to_end_benchmark.py +239 -0
  16. kreuzberg-3.3.0/benchmarks/final_benchmark.py +147 -0
  17. kreuzberg-3.3.0/benchmarks/pyproject.toml +28 -0
  18. kreuzberg-3.3.0/benchmarks/results/baseline_results.json +35 -0
  19. kreuzberg-3.3.0/benchmarks/results/benchmark_msgpack_20250702_003800.json +50 -0
  20. kreuzberg-3.3.0/benchmarks/results/comprehensive_caching_results.json +55 -0
  21. kreuzberg-3.3.0/benchmarks/results/final_benchmark_results.json +12 -0
  22. kreuzberg-3.3.0/benchmarks/results/mime_caching_results.json +18 -0
  23. kreuzberg-3.3.0/benchmarks/results/msgspec_caching_results.json +10 -0
  24. kreuzberg-3.3.0/benchmarks/results/ocr_caching_results.json +17 -0
  25. kreuzberg-3.3.0/benchmarks/results/serialization_benchmark_results.json +42 -0
  26. kreuzberg-3.3.0/benchmarks/results/statistical_benchmark_results.json +26 -0
  27. kreuzberg-3.3.0/benchmarks/results/table_caching_results.json +17 -0
  28. kreuzberg-3.3.0/benchmarks/serialization_benchmark.py +167 -0
  29. kreuzberg-3.3.0/benchmarks/src/kreuzberg_benchmarks/__init__.py +3 -0
  30. kreuzberg-3.3.0/benchmarks/src/kreuzberg_benchmarks/__main__.py +6 -0
  31. kreuzberg-3.3.0/benchmarks/src/kreuzberg_benchmarks/benchmarks.py +274 -0
  32. kreuzberg-3.3.0/benchmarks/src/kreuzberg_benchmarks/cli.py +247 -0
  33. kreuzberg-3.3.0/benchmarks/src/kreuzberg_benchmarks/models.py +145 -0
  34. kreuzberg-3.3.0/benchmarks/src/kreuzberg_benchmarks/profiler.py +184 -0
  35. kreuzberg-3.3.0/benchmarks/src/kreuzberg_benchmarks/runner.py +278 -0
  36. kreuzberg-3.3.0/benchmarks/statistical_benchmark.py +220 -0
  37. kreuzberg-3.3.0/docs/advanced/custom-extractors.md +203 -0
  38. kreuzberg-3.3.0/docs/advanced/custom-hooks.md +148 -0
  39. kreuzberg-3.3.0/docs/advanced/error-handling.md +181 -0
  40. kreuzberg-3.3.0/docs/advanced/index.md +41 -0
  41. kreuzberg-3.3.0/docs/advanced/performance.md +224 -0
  42. kreuzberg-3.3.0/docs/api-reference/exceptions.md +33 -0
  43. kreuzberg-3.3.0/docs/api-reference/extraction-functions.md +59 -0
  44. kreuzberg-3.3.0/docs/api-reference/extractor-registry.md +5 -0
  45. kreuzberg-3.3.0/docs/api-reference/index.md +51 -0
  46. kreuzberg-3.3.0/docs/api-reference/ocr-configuration.md +27 -0
  47. kreuzberg-3.3.0/docs/api-reference/types.md +51 -0
  48. kreuzberg-3.3.0/docs/assets/favicon.png +0 -0
  49. kreuzberg-3.3.0/docs/assets/logo.png +0 -0
  50. kreuzberg-3.3.0/docs/changelog.md +30 -0
  51. kreuzberg-3.3.0/docs/cli.md +190 -0
  52. kreuzberg-3.3.0/docs/contributing.md +78 -0
  53. kreuzberg-3.3.0/docs/css/extra.css +56 -0
  54. kreuzberg-3.3.0/docs/examples/extraction-examples.md +195 -0
  55. kreuzberg-3.3.0/docs/examples/index.md +48 -0
  56. kreuzberg-3.3.0/docs/getting-started/index.md +20 -0
  57. kreuzberg-3.3.0/docs/getting-started/installation.md +117 -0
  58. kreuzberg-3.3.0/docs/getting-started/quick-start.md +111 -0
  59. kreuzberg-3.3.0/docs/index.md +15 -0
  60. kreuzberg-3.3.0/docs/user-guide/basic-usage.md +133 -0
  61. kreuzberg-3.3.0/docs/user-guide/chunking.md +124 -0
  62. kreuzberg-3.3.0/docs/user-guide/extraction-configuration.md +162 -0
  63. kreuzberg-3.3.0/docs/user-guide/index.md +40 -0
  64. kreuzberg-3.3.0/docs/user-guide/metadata-extraction.md +74 -0
  65. kreuzberg-3.3.0/docs/user-guide/ocr-backends.md +238 -0
  66. kreuzberg-3.3.0/docs/user-guide/ocr-configuration.md +161 -0
  67. kreuzberg-3.3.0/docs/user-guide/supported-formats.md +48 -0
  68. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/__init__.py +3 -0
  69. kreuzberg-3.3.0/kreuzberg/__main__.py +8 -0
  70. kreuzberg-3.3.0/kreuzberg/_cli_config.py +175 -0
  71. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/_extractors/_image.py +39 -4
  72. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/_extractors/_pandoc.py +158 -18
  73. kreuzberg-3.3.0/kreuzberg/_extractors/_pdf.py +351 -0
  74. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/_extractors/_presentation.py +1 -1
  75. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/_extractors/_spread_sheet.py +65 -7
  76. kreuzberg-3.3.0/kreuzberg/_gmft.py +380 -0
  77. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/_mime_types.py +62 -16
  78. kreuzberg-3.3.0/kreuzberg/_multiprocessing/__init__.py +6 -0
  79. kreuzberg-3.3.0/kreuzberg/_multiprocessing/gmft_isolated.py +332 -0
  80. kreuzberg-3.3.0/kreuzberg/_multiprocessing/process_manager.py +188 -0
  81. kreuzberg-3.3.0/kreuzberg/_multiprocessing/sync_tesseract.py +261 -0
  82. kreuzberg-3.3.0/kreuzberg/_multiprocessing/tesseract_pool.py +359 -0
  83. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/_ocr/_easyocr.py +66 -10
  84. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/_ocr/_paddleocr.py +86 -7
  85. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/_ocr/_tesseract.py +136 -46
  86. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/_playa.py +43 -0
  87. kreuzberg-3.3.0/kreuzberg/_utils/_cache.py +372 -0
  88. kreuzberg-3.3.0/kreuzberg/_utils/_device.py +356 -0
  89. kreuzberg-3.3.0/kreuzberg/_utils/_document_cache.py +220 -0
  90. kreuzberg-3.3.0/kreuzberg/_utils/_errors.py +232 -0
  91. kreuzberg-3.3.0/kreuzberg/_utils/_pdf_lock.py +72 -0
  92. kreuzberg-3.3.0/kreuzberg/_utils/_process_pool.py +100 -0
  93. kreuzberg-3.3.0/kreuzberg/_utils/_serialization.py +82 -0
  94. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/_utils/_string.py +1 -1
  95. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/_utils/_sync.py +21 -0
  96. kreuzberg-3.3.0/kreuzberg/cli.py +338 -0
  97. kreuzberg-3.3.0/kreuzberg/extraction.py +462 -0
  98. kreuzberg-3.3.0/mkdocs.yaml +155 -0
  99. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/pyproject.toml +79 -39
  100. kreuzberg-3.3.0/run_benchmarks.py +195 -0
  101. kreuzberg-3.3.0/scripts/__init__.py +1 -0
  102. kreuzberg-3.3.0/scripts/compare_benchmarks.py +100 -0
  103. kreuzberg-3.3.0/tests/__init__.py +0 -0
  104. kreuzberg-3.3.0/tests/chunker_test.py +102 -0
  105. kreuzberg-3.3.0/tests/cli_integration_test.py +523 -0
  106. kreuzberg-3.3.0/tests/cli_test.py +335 -0
  107. kreuzberg-3.3.0/tests/conftest.py +117 -0
  108. kreuzberg-3.3.0/tests/exceptions_test.py +101 -0
  109. kreuzberg-3.3.0/tests/extraction_batch_test.py +278 -0
  110. kreuzberg-3.3.0/tests/extraction_test.py +373 -0
  111. kreuzberg-3.3.0/tests/extractors/__init__.py +0 -0
  112. kreuzberg-3.3.0/tests/extractors/html_test.py +54 -0
  113. kreuzberg-3.3.0/tests/extractors/image_test.py +240 -0
  114. kreuzberg-3.3.0/tests/extractors/pandoc_metadata_test.py +323 -0
  115. kreuzberg-3.3.0/tests/extractors/pandoc_test.py +458 -0
  116. kreuzberg-3.3.0/tests/extractors/pdf_test.py +385 -0
  117. kreuzberg-3.3.0/tests/extractors/presentation_test.py +410 -0
  118. kreuzberg-3.3.0/tests/extractors/spreed_sheet_test.py +325 -0
  119. kreuzberg-3.3.0/tests/gmft_extended_test.py +163 -0
  120. kreuzberg-3.3.0/tests/gmft_test.py +383 -0
  121. kreuzberg-3.3.0/tests/hooks_test.py +205 -0
  122. kreuzberg-3.3.0/tests/mime_types_test.py +199 -0
  123. kreuzberg-3.3.0/tests/multiprocessing/__init__.py +1 -0
  124. kreuzberg-3.3.0/tests/multiprocessing/gmft_integration_test.py +104 -0
  125. kreuzberg-3.3.0/tests/multiprocessing/process_manager_test.py +282 -0
  126. kreuzberg-3.3.0/tests/multiprocessing/sync_tesseract_test.py +367 -0
  127. kreuzberg-3.3.0/tests/multiprocessing/tesseract_pool_test.py +349 -0
  128. kreuzberg-3.3.0/tests/ocr/__init__.py +0 -0
  129. kreuzberg-3.3.0/tests/ocr/base_test.py +79 -0
  130. kreuzberg-3.3.0/tests/ocr/device_integration_test.py +270 -0
  131. kreuzberg-3.3.0/tests/ocr/easyocr_test.py +462 -0
  132. kreuzberg-3.3.0/tests/ocr/init_test.py +41 -0
  133. kreuzberg-3.3.0/tests/ocr/paddleocr_test.py +857 -0
  134. kreuzberg-3.3.0/tests/ocr/tesseract_test.py +431 -0
  135. kreuzberg-3.3.0/tests/playa_test.py +111 -0
  136. kreuzberg-3.3.0/tests/registry_test.py +190 -0
  137. kreuzberg-3.3.0/tests/test_source_files/document.docx +0 -0
  138. kreuzberg-3.3.0/tests/test_source_files/excel-multi-sheet.xlsx +0 -0
  139. kreuzberg-3.3.0/tests/test_source_files/excel.xlsx +0 -0
  140. kreuzberg-3.3.0/tests/test_source_files/html.html +10 -0
  141. kreuzberg-3.3.0/tests/test_source_files/markdown.md +1 -0
  142. kreuzberg-3.3.0/tests/test_source_files/non-ascii-text.pdf +0 -0
  143. kreuzberg-3.3.0/tests/test_source_files/non-searchable.pdf +0 -0
  144. kreuzberg-3.3.0/tests/test_source_files/ocr-image.jpg +0 -0
  145. kreuzberg-3.3.0/tests/test_source_files/pdfs_with_tables/large.pdf +0 -0
  146. kreuzberg-3.3.0/tests/test_source_files/pdfs_with_tables/medium.pdf +0 -0
  147. kreuzberg-3.3.0/tests/test_source_files/pdfs_with_tables/tiny.pdf +0 -0
  148. kreuzberg-3.3.0/tests/test_source_files/pitch-deck-presentation.pptx +0 -0
  149. kreuzberg-3.3.0/tests/test_source_files/sample-contract.pdf +0 -0
  150. kreuzberg-3.3.0/tests/test_source_files/scanned.pdf +0 -0
  151. kreuzberg-3.3.0/tests/test_source_files/searchable.pdf +0 -0
  152. kreuzberg-3.3.0/tests/test_source_files/test-article.pdf +0 -0
  153. kreuzberg-3.3.0/tests/types_test.py +132 -0
  154. kreuzberg-3.3.0/tests/utils/__init__.py +0 -0
  155. kreuzberg-3.3.0/tests/utils/cache_test.py +473 -0
  156. kreuzberg-3.3.0/tests/utils/device_test.py +349 -0
  157. kreuzberg-3.3.0/tests/utils/errors_test.py +309 -0
  158. kreuzberg-3.3.0/tests/utils/pdf_lock_test.py +233 -0
  159. kreuzberg-3.3.0/tests/utils/process_pool_test.py +246 -0
  160. kreuzberg-3.3.0/tests/utils/serialization_test.py +336 -0
  161. kreuzberg-3.3.0/tests/utils/string_test.py +85 -0
  162. kreuzberg-3.3.0/tests/utils/sync_test.py +309 -0
  163. kreuzberg-3.3.0/tests/utils/tmp_test.py +50 -0
  164. kreuzberg-3.3.0/uv.lock +3395 -0
  165. kreuzberg-3.1.7/kreuzberg/_extractors/_pdf.py +0 -171
  166. kreuzberg-3.1.7/kreuzberg/_gmft.py +0 -174
  167. kreuzberg-3.1.7/kreuzberg/extraction.py +0 -251
  168. kreuzberg-3.1.7/kreuzberg.egg-info/PKG-INFO +0 -174
  169. kreuzberg-3.1.7/kreuzberg.egg-info/SOURCES.txt +0 -36
  170. kreuzberg-3.1.7/kreuzberg.egg-info/dependency_links.txt +0 -1
  171. kreuzberg-3.1.7/kreuzberg.egg-info/requires.txt +0 -35
  172. kreuzberg-3.1.7/kreuzberg.egg-info/top_level.txt +0 -1
  173. kreuzberg-3.1.7/setup.cfg +0 -4
  174. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/LICENSE +0 -0
  175. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/_chunker.py +0 -0
  176. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/_constants.py +0 -0
  177. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/_extractors/__init__.py +0 -0
  178. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/_extractors/_base.py +0 -0
  179. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/_extractors/_html.py +0 -0
  180. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/_ocr/__init__.py +0 -0
  181. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/_ocr/_base.py +0 -0
  182. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/_registry.py +0 -0
  183. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/_types.py +0 -0
  184. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/_utils/__init__.py +0 -0
  185. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/_utils/_tmp.py +0 -0
  186. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/exceptions.py +0 -0
  187. {kreuzberg-3.1.7 → kreuzberg-3.3.0}/kreuzberg/py.typed +0 -0
@@ -0,0 +1 @@
1
+ { "extends": ["@commitlint/config-conventional"] }
@@ -0,0 +1,15 @@
1
+ # Performance Baseline
2
+
3
+ This directory contains baseline performance metrics for the Kreuzberg library.
4
+
5
+ ## Files
6
+
7
+ - `baseline.json` - Performance baseline automatically updated from main branch CI
8
+ - This file is used for performance regression detection in PRs
9
+
10
+ ## How it works
11
+
12
+ 1. When code is pushed to `main`, CI runs benchmarks and stores results as `baseline.json`
13
+ 1. When PRs are opened, CI compares current performance against this baseline
14
+ 1. If performance degrades beyond threshold (20%), the CI check fails
15
+ 1. The baseline is automatically updated when new changes are merged to main
@@ -0,0 +1,6 @@
1
+ version: 2
2
+ updates:
3
+ - package-ecosystem: "github-actions"
4
+ directory: "/"
5
+ schedule:
6
+ interval: "daily"
@@ -0,0 +1,124 @@
1
+ name: CI
2
+
3
+ on:
4
+ pull_request:
5
+ branches:
6
+ - main
7
+ push:
8
+ branches:
9
+ - main
10
+ - feat/smart-multiprocessing
11
+
12
+ jobs:
13
+ validate:
14
+ runs-on: ubuntu-latest
15
+ timeout-minutes: 10
16
+ steps:
17
+ - name: Checkout
18
+ uses: actions/checkout@v4
19
+
20
+ - name: Install uv
21
+ uses: astral-sh/setup-uv@v6
22
+ with:
23
+ enable-cache: true
24
+
25
+ - name: Set up Python
26
+ uses: actions/setup-python@v5
27
+ with:
28
+ python-version-file: "pyproject.toml"
29
+
30
+ - name: Install Dependencies
31
+ shell: bash
32
+ run: |
33
+ if [[ "${{ runner.os }}" == "Windows" ]] && [[ -d ".venv" ]]; then
34
+ echo "Removing existing .venv directory on Windows"
35
+ rm -rf .venv
36
+ fi
37
+ uv sync --all-packages --all-extras --dev
38
+
39
+ - name: Load Cached Pre-Commit Dependencies
40
+ id: cached-pre-commit-dependencies
41
+ uses: actions/cache@v4
42
+ with:
43
+ path: ~/.cache/pre-commit/
44
+ key: pre-commit|${{ env.pythonLocation }}|${{ hashFiles('.pre-commit-config.yaml') }}
45
+
46
+ - name: Execute Pre-Commit
47
+ run: uv run pre-commit run --show-diff-on-failure --color=always --all-files
48
+
49
+ test:
50
+ strategy:
51
+ matrix:
52
+ os: [ ubuntu-latest, macOS-latest, windows-latest ]
53
+ python: ${{ github.event_name == 'pull_request' && fromJSON('["3.13"]') || fromJSON('["3.9", "3.10", "3.11", "3.12", "3.13"]') }}
54
+ runs-on: ${{ matrix.os }}
55
+ timeout-minutes: 30
56
+ steps:
57
+ - name: Checkout
58
+ uses: actions/checkout@v4
59
+
60
+ - name: Install uv
61
+ uses: astral-sh/setup-uv@v6
62
+ with:
63
+ enable-cache: true
64
+
65
+ - name: Install Python
66
+ uses: actions/setup-python@v5
67
+ id: setup-python
68
+ with:
69
+ python-version: ${{ matrix.python }}
70
+
71
+ - name: Cache Python Dependencies
72
+ id: python-cache
73
+ uses: actions/cache@v4
74
+ with:
75
+ path: |
76
+ ~/.cache/uv
77
+ .venv
78
+ key: python-dependencies-${{ matrix.os }}-${{ steps.setup-python.outputs.python-version }}-${{ hashFiles('uv.lock') }}
79
+ restore-keys: |
80
+ python-dependencies-${{ matrix.os }}-${{ matrix.python }}-
81
+
82
+ - name: Install Dependencies
83
+ shell: bash
84
+ run: |
85
+ if [[ "${{ runner.os }}" == "Windows" ]] && [[ -d ".venv" ]]; then
86
+ echo "Removing existing .venv directory on Windows"
87
+ rm -rf .venv
88
+ fi
89
+ uv sync --all-packages --all-extras --dev
90
+
91
+ - name: Cache Test Artifacts
92
+ uses: actions/cache@v4
93
+ with:
94
+ path: .pytest_cache/
95
+ key: pytest-cache-${{ matrix.os }}-${{ matrix.python }}
96
+
97
+ - name: Cache and Install Homebrew (macOS)
98
+ if: runner.os == 'macOS'
99
+ uses: tecolicom/actions-use-homebrew-tools@v1
100
+ with:
101
+ tools: 'tesseract tesseract-lang pandoc'
102
+ key: 'homebrew-tools-${{ runner.os }}'
103
+ cache: yes
104
+ verbose: false
105
+
106
+ - name: Cache and Install APT Packages (Linux)
107
+ if: runner.os == 'Linux'
108
+ uses: awalsh128/cache-apt-pkgs-action@latest
109
+ with:
110
+ packages: tesseract-ocr tesseract-ocr-deu pandoc
111
+ version: 1.0
112
+
113
+ - name: Install System Dependencies (Windows)
114
+ if: runner.os == 'Windows'
115
+ run: |
116
+ choco install -y tesseract pandoc
117
+ Write-Output "C:\Program Files\Tesseract-OCR" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
118
+ Write-Output "C:\Program Files\Pandoc" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
119
+ $env:PATH = "C:\Program Files\Tesseract-OCR;C:\Program Files\Pandoc;" + $env:PATH
120
+ tesseract --version
121
+ pandoc --version
122
+
123
+ - name: Run Tests
124
+ run: uv run pytest -s -vvv
@@ -0,0 +1,20 @@
1
+ name: "Check PR Title"
2
+
3
+ on:
4
+ pull_request_target:
5
+ types:
6
+ - opened
7
+ - edited
8
+ - synchronize
9
+
10
+ permissions:
11
+ pull-requests: read
12
+
13
+ jobs:
14
+ main:
15
+ name: Validate PR title
16
+ runs-on: ubuntu-latest
17
+ steps:
18
+ - uses: amannn/action-semantic-pull-request@v5
19
+ env:
20
+ GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
@@ -0,0 +1,31 @@
1
+ name: Release
2
+
3
+ on:
4
+ release:
5
+ types: [published]
6
+
7
+ jobs:
8
+ release:
9
+ runs-on: ubuntu-latest
10
+ environment: pypi
11
+ permissions:
12
+ id-token: write
13
+ steps:
14
+ - name: Checkout
15
+ uses: actions/checkout@v4
16
+
17
+ - name: Install uv
18
+ uses: astral-sh/setup-uv@v6
19
+ with:
20
+ enable-cache: true
21
+
22
+ - name: Set up Python
23
+ uses: actions/setup-python@v5
24
+ with:
25
+ python-version-file: "pyproject.toml"
26
+
27
+ - name: Install Dependencies
28
+ run: uv build
29
+
30
+ - name: Publish
31
+ uses: pypa/gh-action-pypi-publish@release/v1
@@ -0,0 +1,33 @@
1
+ *$py.class
2
+ *.Cache
3
+ *.cscfg
4
+ *.egg-info/
5
+ *.log
6
+ *.py[cod]
7
+ *.suo
8
+ *.user
9
+ .DS_store
10
+ .coverage
11
+ .coverage*
12
+ .dist/
13
+ .env
14
+ .idea/
15
+ .mypy_cache/
16
+ .pytest_cache/
17
+ .python-version
18
+ .ruff_cache/
19
+ .run/
20
+ .venv/
21
+ .vscode/
22
+ .windsurfrules
23
+ .cursorrules
24
+ CLAUDE.md
25
+ GEMINI.md
26
+ __pycache__/
27
+ coverage.xml
28
+ prompt_template.egg-info/
29
+ requirements.txt
30
+ Dockerfile
31
+ docker-compose.yaml
32
+ benchmark_results.json
33
+ .kreuzberg/
@@ -0,0 +1,17 @@
1
+ default: true
2
+
3
+ MD007:
4
+ indent: 4
5
+
6
+ MD033: false
7
+
8
+ MD041: false
9
+
10
+ MD013: false
11
+
12
+ MD014: false
13
+
14
+ MD024:
15
+ siblings_only: true
16
+
17
+ MD046: false
@@ -0,0 +1,86 @@
1
+ repos:
2
+ - repo: https://github.com/alessandrojcm/commitlint-pre-commit-hook
3
+ rev: "v9.22.0"
4
+ hooks:
5
+ - id: commitlint
6
+ stages: [commit-msg]
7
+ additional_dependencies: ["@commitlint/config-conventional"]
8
+ - repo: https://github.com/Goldziher/ai-rulez
9
+ rev: v1.1.2
10
+ hooks:
11
+ - id: ai-rulez-validate
12
+ - id: ai-rulez-generate
13
+ - repo: https://github.com/pre-commit/pre-commit-hooks
14
+ rev: v5.0.0
15
+ hooks:
16
+ - id: name-tests-test
17
+ args:
18
+ - --pytest
19
+ exclude: factories|test_utils|completion.py|test_data
20
+ - id: trailing-whitespace
21
+ - id: end-of-file-fixer
22
+ - id: check-toml
23
+ - id: check-case-conflict
24
+ - id: detect-private-key
25
+ - repo: https://github.com/abravalheri/validate-pyproject
26
+ rev: v0.24.1
27
+ hooks:
28
+ - id: validate-pyproject
29
+ - repo: https://github.com/executablebooks/mdformat
30
+ rev: 0.7.22
31
+ hooks:
32
+ - id: mdformat
33
+ additional_dependencies:
34
+ - mdformat-mkdocs==4.0.0
35
+ - repo: https://github.com/igorshubovych/markdownlint-cli
36
+ rev: v0.45.0
37
+ hooks:
38
+ - id: markdownlint-fix
39
+ - repo: https://github.com/adamchainz/blacken-docs
40
+ rev: 1.19.1
41
+ hooks:
42
+ - id: blacken-docs
43
+ args: ["--pyi", "--line-length", "130"]
44
+ additional_dependencies:
45
+ - black==25.1.0
46
+ - repo: https://github.com/rbubley/mirrors-prettier
47
+ rev: "v3.6.2"
48
+ hooks:
49
+ - id: prettier
50
+ exclude: ^tests|^.idea|^migrations|^.git|README.md|^docs
51
+ - repo: https://github.com/tox-dev/pyproject-fmt
52
+ rev: "v2.6.0"
53
+ hooks:
54
+ - id: pyproject-fmt
55
+ - repo: https://github.com/astral-sh/ruff-pre-commit
56
+ rev: v0.12.1
57
+ hooks:
58
+ - id: ruff
59
+ args: ["--fix", "--unsafe-fixes"]
60
+ - id: ruff-format
61
+ - repo: https://github.com/codespell-project/codespell
62
+ rev: v2.4.1
63
+ hooks:
64
+ - id: codespell
65
+ exclude: ^tests|^scripts|^kreuzberg/_tesseract|^kreuzberg/_mime_types
66
+ additional_dependencies:
67
+ - tomli
68
+ - repo: https://github.com/jsh9/pydoclint
69
+ rev: 0.6.7
70
+ hooks:
71
+ - id: pydoclint
72
+ args:
73
+ [
74
+ --style=google,
75
+ --check-return-types=False,
76
+ --arg-type-hints-in-docstring=False,
77
+ ]
78
+ exclude: ^benchmarks/|^kreuzberg/_|^tests/|^scripts/|^run_benchmarks\.py
79
+ - repo: local
80
+ hooks:
81
+ - id: mypy
82
+ name: mypy
83
+ entry: uv run mypy
84
+ require_serial: true
85
+ language: system
86
+ types: [python]
@@ -1,56 +1,60 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: kreuzberg
3
- Version: 3.1.7
3
+ Version: 3.3.0
4
4
  Summary: A text extraction library supporting PDFs, images, office documents and more
5
+ Project-URL: homepage, https://github.com/Goldziher/kreuzberg
5
6
  Author-email: Na'aman Hirschfeld <nhirschfed@gmail.com>
6
7
  License: MIT
7
- Project-URL: homepage, https://github.com/Goldziher/kreuzberg
8
+ License-File: LICENSE
8
9
  Keywords: document-processing,image-to-text,ocr,pandoc,pdf-extraction,rag,table-extraction,tesseract,text-extraction,text-processing
9
10
  Classifier: Development Status :: 4 - Beta
10
11
  Classifier: Intended Audience :: Developers
11
12
  Classifier: License :: OSI Approved :: MIT License
12
13
  Classifier: Operating System :: OS Independent
13
14
  Classifier: Programming Language :: Python :: 3 :: Only
14
- Classifier: Programming Language :: Python :: 3.9
15
- Classifier: Programming Language :: Python :: 3.10
16
- Classifier: Programming Language :: Python :: 3.11
17
- Classifier: Programming Language :: Python :: 3.12
18
15
  Classifier: Programming Language :: Python :: 3.13
19
16
  Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
20
17
  Classifier: Topic :: Software Development :: Libraries :: Python Modules
21
18
  Classifier: Topic :: Text Processing :: General
22
19
  Classifier: Topic :: Utilities
23
20
  Classifier: Typing :: Typed
24
- Requires-Python: >=3.9
25
- Description-Content-Type: text/markdown
26
- License-File: LICENSE
21
+ Requires-Python: >=3.13
27
22
  Requires-Dist: anyio>=4.9.0
28
23
  Requires-Dist: charset-normalizer>=3.4.2
29
- Requires-Dist: exceptiongroup>=1.2.2; python_version < "3.11"
30
- Requires-Dist: html-to-markdown>=1.3.3
31
- Requires-Dist: playa-pdf>=0.5.1
24
+ Requires-Dist: exceptiongroup>=1.2.2; python_version < '3.11'
25
+ Requires-Dist: html-to-markdown>=1.4.0
26
+ Requires-Dist: msgspec>=0.18.0
27
+ Requires-Dist: playa-pdf>=0.6.1
28
+ Requires-Dist: psutil>=7.0.0
32
29
  Requires-Dist: pypdfium2==4.30.0
33
30
  Requires-Dist: python-calamine>=0.3.2
34
31
  Requires-Dist: python-pptx>=1.0.2
35
- Requires-Dist: typing-extensions>=4.14.0; python_version < "3.12"
32
+ Requires-Dist: typing-extensions>=4.14.0; python_version < '3.12'
36
33
  Provides-Extra: all
37
- Requires-Dist: easyocr>=1.7.2; extra == "all"
38
- Requires-Dist: gmft>=0.4.1; extra == "all"
39
- Requires-Dist: paddleocr>=3.0.1; extra == "all"
40
- Requires-Dist: paddlepaddle>=3.0.0; extra == "all"
41
- Requires-Dist: semantic-text-splitter>=0.27.0; extra == "all"
42
- Requires-Dist: setuptools>=80.9.0; extra == "all"
34
+ Requires-Dist: click>=8.2.1; extra == 'all'
35
+ Requires-Dist: easyocr>=1.7.2; extra == 'all'
36
+ Requires-Dist: gmft>=0.4.2; extra == 'all'
37
+ Requires-Dist: paddleocr>=3.1.0; extra == 'all'
38
+ Requires-Dist: paddlepaddle>=3.1.0; extra == 'all'
39
+ Requires-Dist: rich>=14.0.0; extra == 'all'
40
+ Requires-Dist: semantic-text-splitter>=0.27.0; extra == 'all'
41
+ Requires-Dist: setuptools>=80.9.0; extra == 'all'
42
+ Requires-Dist: tomli>=2.0.0; (python_version < '3.11') and extra == 'all'
43
43
  Provides-Extra: chunking
44
- Requires-Dist: semantic-text-splitter>=0.27.0; extra == "chunking"
44
+ Requires-Dist: semantic-text-splitter>=0.27.0; extra == 'chunking'
45
+ Provides-Extra: cli
46
+ Requires-Dist: click>=8.2.1; extra == 'cli'
47
+ Requires-Dist: rich>=14.0.0; extra == 'cli'
48
+ Requires-Dist: tomli>=2.0.0; (python_version < '3.11') and extra == 'cli'
45
49
  Provides-Extra: easyocr
46
- Requires-Dist: easyocr>=1.7.2; extra == "easyocr"
50
+ Requires-Dist: easyocr>=1.7.2; extra == 'easyocr'
47
51
  Provides-Extra: gmft
48
- Requires-Dist: gmft>=0.4.1; extra == "gmft"
52
+ Requires-Dist: gmft>=0.4.2; extra == 'gmft'
49
53
  Provides-Extra: paddleocr
50
- Requires-Dist: paddleocr>=3.0.1; extra == "paddleocr"
51
- Requires-Dist: paddlepaddle>=3.0.0; extra == "paddleocr"
52
- Requires-Dist: setuptools>=80.9.0; extra == "paddleocr"
53
- Dynamic: license-file
54
+ Requires-Dist: paddleocr>=3.1.0; extra == 'paddleocr'
55
+ Requires-Dist: paddlepaddle>=3.1.0; extra == 'paddleocr'
56
+ Requires-Dist: setuptools>=80.9.0; extra == 'paddleocr'
57
+ Description-Content-Type: text/markdown
54
58
 
55
59
  # Kreuzberg
56
60
 
@@ -68,6 +72,7 @@ Kreuzberg is a Python library for text extraction from documents. It provides a
68
72
  - **Resource Efficient**: Lightweight processing without GPU requirements
69
73
  - **Format Support**: Comprehensive support for documents, images, and text formats
70
74
  - **Multiple OCR Engines**: Support for Tesseract, EasyOCR, and PaddleOCR
75
+ - **Command Line Interface**: Powerful CLI for batch processing and automation
71
76
  - **Metadata Extraction**: Get document metadata alongside text content
72
77
  - **Table Extraction**: Extract tables from documents using the excellent GMFT library
73
78
  - **Modern Python**: Built with async/await, type hints, and a functional-first approach
@@ -77,6 +82,9 @@ Kreuzberg is a Python library for text extraction from documents. It provides a
77
82
 
78
83
  ```bash
79
84
  pip install kreuzberg
85
+
86
+ # Or install with CLI support
87
+ pip install "kreuzberg[cli]"
80
88
  ```
81
89
 
82
90
  Install pandoc:
@@ -126,12 +134,53 @@ async def main():
126
134
  asyncio.run(main())
127
135
  ```
128
136
 
137
+ ## Command Line Interface
138
+
139
+ Kreuzberg includes a powerful CLI for processing documents from the command line:
140
+
141
+ ```bash
142
+ # Extract text from a file
143
+ kreuzberg extract document.pdf
144
+
145
+ # Extract with JSON output and metadata
146
+ kreuzberg extract document.pdf --output-format json --show-metadata
147
+
148
+ # Extract from stdin
149
+ cat document.html | kreuzberg extract
150
+
151
+ # Use specific OCR backend
152
+ kreuzberg extract image.png --ocr-backend easyocr --easyocr-languages en,de
153
+
154
+ # Extract with configuration file
155
+ kreuzberg extract document.pdf --config config.toml
156
+ ```
157
+
158
+ ### CLI Configuration
159
+
160
+ Configure via `pyproject.toml`:
161
+
162
+ ```toml
163
+ [tool.kreuzberg]
164
+ force_ocr = true
165
+ chunk_content = false
166
+ extract_tables = true
167
+ max_chars = 4000
168
+ ocr_backend = "tesseract"
169
+
170
+ [tool.kreuzberg.tesseract]
171
+ language = "eng+deu"
172
+ psm = 3
173
+ ```
174
+
175
+ For full CLI documentation, see the [CLI Guide](https://goldziher.github.io/kreuzberg/cli/).
176
+
129
177
  ## Documentation
130
178
 
131
179
  For comprehensive documentation, visit our [GitHub Pages](https://goldziher.github.io/kreuzberg/):
132
180
 
133
181
  - [Getting Started](https://goldziher.github.io/kreuzberg/getting-started/) - Installation and basic usage
134
182
  - [User Guide](https://goldziher.github.io/kreuzberg/user-guide/) - In-depth usage information
183
+ - [CLI Guide](https://goldziher.github.io/kreuzberg/cli/) - Command-line interface documentation
135
184
  - [API Reference](https://goldziher.github.io/kreuzberg/api-reference/) - Detailed API documentation
136
185
  - [Examples](https://goldziher.github.io/kreuzberg/examples/) - Code examples for common use cases
137
186
  - [OCR Configuration](https://goldziher.github.io/kreuzberg/user-guide/ocr-configuration/) - Configure OCR engines
@@ -157,17 +206,29 @@ Kreuzberg supports multiple OCR engines:
157
206
 
158
207
  For comparison and selection guidance, see the [OCR Backends](https://goldziher.github.io/kreuzberg/user-guide/ocr-backends/) documentation.
159
208
 
160
- ## Contribution
209
+ ## Performance
210
+
211
+ Kreuzberg offers both sync and async APIs. Choose the right one based on your use case:
212
+
213
+ | Operation | Sync Time | Async Time | Async Advantage |
214
+ | ---------------------- | --------- | ---------- | ------------------ |
215
+ | Simple text (Markdown) | 0.4ms | 17.5ms | **❌ 41x slower** |
216
+ | HTML documents | 1.6ms | 1.1ms | **✅ 1.5x faster** |
217
+ | Complex PDFs | 39.0s | 8.5s | **✅ 4.6x faster** |
218
+ | OCR processing | 0.4s | 0.7s | **✅ 1.7x faster** |
219
+ | Batch operations | 38.6s | 8.5s | **✅ 4.5x faster** |
220
+
221
+ **Rule of thumb:**
222
+
223
+ - Use **sync** for simple documents and CLI applications
224
+ - Use **async** for complex PDFs, OCR, and batch processing
225
+ - Use **batch operations** for multiple files
161
226
 
162
- This library is open to contribution. Feel free to open issues or submit PRs. It's better to discuss issues before submitting PRs to avoid disappointment.
227
+ For detailed benchmarks and methodology, see our [Performance Documentation](https://goldziher.github.io/kreuzberg/advanced/performance/).
163
228
 
164
- ### Local Development
229
+ ## Contributing
165
230
 
166
- - Clone the repo
167
- - Install the system dependencies
168
- - Install the full dependencies with `uv sync`
169
- - Install the pre-commit hooks with: `pre-commit install && pre-commit install --hook-type commit-msg`
170
- - Make your changes and submit a PR
231
+ We welcome contributions! Please see our [Contributing Guide](docs/contributing.md) for details on setting up your development environment and submitting pull requests.
171
232
 
172
233
  ## License
173
234
 
@@ -14,6 +14,7 @@ Kreuzberg is a Python library for text extraction from documents. It provides a
14
14
  - **Resource Efficient**: Lightweight processing without GPU requirements
15
15
  - **Format Support**: Comprehensive support for documents, images, and text formats
16
16
  - **Multiple OCR Engines**: Support for Tesseract, EasyOCR, and PaddleOCR
17
+ - **Command Line Interface**: Powerful CLI for batch processing and automation
17
18
  - **Metadata Extraction**: Get document metadata alongside text content
18
19
  - **Table Extraction**: Extract tables from documents using the excellent GMFT library
19
20
  - **Modern Python**: Built with async/await, type hints, and a functional-first approach
@@ -23,6 +24,9 @@ Kreuzberg is a Python library for text extraction from documents. It provides a
23
24
 
24
25
  ```bash
25
26
  pip install kreuzberg
27
+
28
+ # Or install with CLI support
29
+ pip install "kreuzberg[cli]"
26
30
  ```
27
31
 
28
32
  Install pandoc:
@@ -72,12 +76,53 @@ async def main():
72
76
  asyncio.run(main())
73
77
  ```
74
78
 
79
+ ## Command Line Interface
80
+
81
+ Kreuzberg includes a powerful CLI for processing documents from the command line:
82
+
83
+ ```bash
84
+ # Extract text from a file
85
+ kreuzberg extract document.pdf
86
+
87
+ # Extract with JSON output and metadata
88
+ kreuzberg extract document.pdf --output-format json --show-metadata
89
+
90
+ # Extract from stdin
91
+ cat document.html | kreuzberg extract
92
+
93
+ # Use specific OCR backend
94
+ kreuzberg extract image.png --ocr-backend easyocr --easyocr-languages en,de
95
+
96
+ # Extract with configuration file
97
+ kreuzberg extract document.pdf --config config.toml
98
+ ```
99
+
100
+ ### CLI Configuration
101
+
102
+ Configure via `pyproject.toml`:
103
+
104
+ ```toml
105
+ [tool.kreuzberg]
106
+ force_ocr = true
107
+ chunk_content = false
108
+ extract_tables = true
109
+ max_chars = 4000
110
+ ocr_backend = "tesseract"
111
+
112
+ [tool.kreuzberg.tesseract]
113
+ language = "eng+deu"
114
+ psm = 3
115
+ ```
116
+
117
+ For full CLI documentation, see the [CLI Guide](https://goldziher.github.io/kreuzberg/cli/).
118
+
75
119
  ## Documentation
76
120
 
77
121
  For comprehensive documentation, visit our [GitHub Pages](https://goldziher.github.io/kreuzberg/):
78
122
 
79
123
  - [Getting Started](https://goldziher.github.io/kreuzberg/getting-started/) - Installation and basic usage
80
124
  - [User Guide](https://goldziher.github.io/kreuzberg/user-guide/) - In-depth usage information
125
+ - [CLI Guide](https://goldziher.github.io/kreuzberg/cli/) - Command-line interface documentation
81
126
  - [API Reference](https://goldziher.github.io/kreuzberg/api-reference/) - Detailed API documentation
82
127
  - [Examples](https://goldziher.github.io/kreuzberg/examples/) - Code examples for common use cases
83
128
  - [OCR Configuration](https://goldziher.github.io/kreuzberg/user-guide/ocr-configuration/) - Configure OCR engines
@@ -103,17 +148,29 @@ Kreuzberg supports multiple OCR engines:
103
148
 
104
149
  For comparison and selection guidance, see the [OCR Backends](https://goldziher.github.io/kreuzberg/user-guide/ocr-backends/) documentation.
105
150
 
106
- ## Contribution
151
+ ## Performance
152
+
153
+ Kreuzberg offers both sync and async APIs. Choose the right one based on your use case:
154
+
155
+ | Operation | Sync Time | Async Time | Async Advantage |
156
+ | ---------------------- | --------- | ---------- | ------------------ |
157
+ | Simple text (Markdown) | 0.4ms | 17.5ms | **❌ 41x slower** |
158
+ | HTML documents | 1.6ms | 1.1ms | **✅ 1.5x faster** |
159
+ | Complex PDFs | 39.0s | 8.5s | **✅ 4.6x faster** |
160
+ | OCR processing | 0.4s | 0.7s | **✅ 1.7x faster** |
161
+ | Batch operations | 38.6s | 8.5s | **✅ 4.5x faster** |
162
+
163
+ **Rule of thumb:**
164
+
165
+ - Use **sync** for simple documents and CLI applications
166
+ - Use **async** for complex PDFs, OCR, and batch processing
167
+ - Use **batch operations** for multiple files
107
168
 
108
- This library is open to contribution. Feel free to open issues or submit PRs. It's better to discuss issues before submitting PRs to avoid disappointment.
169
+ For detailed benchmarks and methodology, see our [Performance Documentation](https://goldziher.github.io/kreuzberg/advanced/performance/).
109
170
 
110
- ### Local Development
171
+ ## Contributing
111
172
 
112
- - Clone the repo
113
- - Install the system dependencies
114
- - Install the full dependencies with `uv sync`
115
- - Install the pre-commit hooks with: `pre-commit install && pre-commit install --hook-type commit-msg`
116
- - Make your changes and submit a PR
173
+ We welcome contributions! Please see our [Contributing Guide](docs/contributing.md) for details on setting up your development environment and submitting pull requests.
117
174
 
118
175
  ## License
119
176