markitdown-plus 0.2.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- markitdown_plus/__about__.py +7 -0
- markitdown_plus/__init__.py +12 -0
- markitdown_plus/assets.py +154 -0
- markitdown_plus/batch.py +387 -0
- markitdown_plus/chunker.py +433 -0
- markitdown_plus/cleaner.py +158 -0
- markitdown_plus/cli.py +205 -0
- markitdown_plus/converter.py +58 -0
- markitdown_plus/errors.py +13 -0
- markitdown_plus/manifest.py +164 -0
- markitdown_plus/metadata.py +97 -0
- markitdown_plus/utils.py +52 -0
- markitdown_plus-0.2.0.dist-info/METADATA +292 -0
- markitdown_plus-0.2.0.dist-info/RECORD +17 -0
- markitdown_plus-0.2.0.dist-info/WHEEL +4 -0
- markitdown_plus-0.2.0.dist-info/entry_points.txt +2 -0
- markitdown_plus-0.2.0.dist-info/licenses/LICENSE +21 -0
|
@@ -0,0 +1,292 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: markitdown-plus
|
|
3
|
+
Version: 0.2.0
|
|
4
|
+
Summary: Batch conversion, asset extraction, and RAG-ready output toolkit for Microsoft MarkItDown.
|
|
5
|
+
Project-URL: Homepage, https://github.com/lamguo/markitdown-plus
|
|
6
|
+
Project-URL: Repository, https://github.com/lamguo/markitdown-plus
|
|
7
|
+
Project-URL: Issues, https://github.com/lamguo/markitdown-plus/issues
|
|
8
|
+
Project-URL: Funding, https://www.paypal.me/lamguo
|
|
9
|
+
Author-email: Lam Guo <lamguo111@gmail.com>
|
|
10
|
+
License: MIT
|
|
11
|
+
License-File: LICENSE
|
|
12
|
+
Keywords: asset-extraction,batch-conversion,document-conversion,docx-to-markdown,jsonl,llm,markdown,markitdown,microsoft-markitdown,pdf-to-markdown,rag
|
|
13
|
+
Classifier: Development Status :: 3 - Alpha
|
|
14
|
+
Classifier: Environment :: Console
|
|
15
|
+
Classifier: Intended Audience :: Developers
|
|
16
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
17
|
+
Classifier: Programming Language :: Python :: 3
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
21
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
22
|
+
Classifier: Topic :: Text Processing :: Markup :: Markdown
|
|
23
|
+
Requires-Python: >=3.10
|
|
24
|
+
Requires-Dist: markitdown[all]>=0.1.0
|
|
25
|
+
Provides-Extra: dev
|
|
26
|
+
Requires-Dist: hypothesis>=6.100; extra == 'dev'
|
|
27
|
+
Requires-Dist: pytest-benchmark>=4.0; extra == 'dev'
|
|
28
|
+
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
|
|
29
|
+
Requires-Dist: pytest>=8.0; extra == 'dev'
|
|
30
|
+
Requires-Dist: ruff>=0.6.0; extra == 'dev'
|
|
31
|
+
Requires-Dist: tqdm>=4.66; extra == 'dev'
|
|
32
|
+
Provides-Extra: progress
|
|
33
|
+
Requires-Dist: tqdm>=4.66; extra == 'progress'
|
|
34
|
+
Provides-Extra: quality
|
|
35
|
+
Requires-Dist: hypothesis>=6.100; extra == 'quality'
|
|
36
|
+
Requires-Dist: pytest-benchmark>=4.0; extra == 'quality'
|
|
37
|
+
Provides-Extra: test
|
|
38
|
+
Requires-Dist: pytest-cov>=5.0; extra == 'test'
|
|
39
|
+
Requires-Dist: pytest>=8.0; extra == 'test'
|
|
40
|
+
Description-Content-Type: text/markdown
|
|
41
|
+
|
|
42
|
+
# MarkItDown Plus
|
|
43
|
+
|
|
44
|
+
Batch conversion, asset extraction, RAG-ready Markdown, JSONL chunks, and cleaner AI document pipelines for **Microsoft MarkItDown**.
|
|
45
|
+
|
|
46
|
+
MarkItDown Plus is an enhancement toolkit built on top of Microsoft MarkItDown. It adds folder conversion, recursive processing, optional parallel workers, Markdown cleanup, multiple chunking strategies, lightweight asset extraction, conversion manifests, and JSONL output for RAG workflows.
|
|
47
|
+
|
|
48
|
+
> This project is independent and is not affiliated with Microsoft. It is designed as a companion CLI for the Microsoft MarkItDown ecosystem.
|
|
49
|
+
|
|
50
|
+
## Why MarkItDown Plus?
|
|
51
|
+
|
|
52
|
+
Microsoft MarkItDown is excellent for converting individual files to Markdown. MarkItDown Plus focuses on the next step: turning many documents into clean, AI-ready project output.
|
|
53
|
+
|
|
54
|
+
Key features:
|
|
55
|
+
|
|
56
|
+
- Batch convert files and folders
|
|
57
|
+
- Recursive directory conversion
|
|
58
|
+
- Parallel conversion with `--workers`
|
|
59
|
+
- Optional tqdm progress with `--progress`
|
|
60
|
+
- RAG-ready JSONL chunk export
|
|
61
|
+
- Chunk strategies: `heading`, `fixed`, `semantic-lite`
|
|
62
|
+
- Markdown cleanup for common PDF/document artifacts
|
|
63
|
+
- Basic asset extraction for DOCX / PPTX / XLSX / HTML
|
|
64
|
+
- `manifest.json`, `failed.json`, and large-run JSONL manifest streaming
|
|
65
|
+
- Unicode-safe output filenames
|
|
66
|
+
- PayPal funding link included through GitHub Sponsors/Funding
|
|
67
|
+
|
|
68
|
+
## Installation
|
|
69
|
+
|
|
70
|
+
```bash
|
|
71
|
+
pip install markitdown-plus
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
For progress bars:
|
|
75
|
+
|
|
76
|
+
```bash
|
|
77
|
+
pip install "markitdown-plus[progress]"
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
For development tests and coverage:
|
|
81
|
+
|
|
82
|
+
```bash
|
|
83
|
+
pip install -e ".[dev]"
|
|
84
|
+
pytest
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
## Quick Start
|
|
88
|
+
|
|
89
|
+
Convert a folder:
|
|
90
|
+
|
|
91
|
+
```bash
|
|
92
|
+
markitdown-plus convert ./docs --output ./out
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
Convert recursively:
|
|
96
|
+
|
|
97
|
+
```bash
|
|
98
|
+
markitdown-plus convert ./docs --output ./out --recursive
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
Convert only specific file types:
|
|
102
|
+
|
|
103
|
+
```bash
|
|
104
|
+
markitdown-plus convert ./docs --output ./out --types pdf,docx,pptx,xlsx,html,csv
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
Clean Markdown and export RAG chunks:
|
|
108
|
+
|
|
109
|
+
```bash
|
|
110
|
+
markitdown-plus convert ./docs --output ./out --clean --rag
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
Use parallel workers:
|
|
114
|
+
|
|
115
|
+
```bash
|
|
116
|
+
markitdown-plus convert ./docs --output ./out --recursive --workers 4 --progress
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
Use auto worker count:
|
|
120
|
+
|
|
121
|
+
```bash
|
|
122
|
+
markitdown-plus convert ./docs --output ./out --workers 0
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
Extract assets when supported:
|
|
126
|
+
|
|
127
|
+
```bash
|
|
128
|
+
markitdown-plus convert ./docs --output ./out --extract-assets
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
Use a specific chunking strategy:
|
|
132
|
+
|
|
133
|
+
```bash
|
|
134
|
+
markitdown-plus convert ./docs --output ./out --rag --chunk-strategy semantic-lite
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
## Output Structure
|
|
138
|
+
|
|
139
|
+
A normal batch run creates:
|
|
140
|
+
|
|
141
|
+
```text
|
|
142
|
+
out/
|
|
143
|
+
markdown/
|
|
144
|
+
report.md
|
|
145
|
+
metadata/
|
|
146
|
+
report.json
|
|
147
|
+
manifest.json
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
With RAG enabled:
|
|
151
|
+
|
|
152
|
+
```text
|
|
153
|
+
out/
|
|
154
|
+
markdown/
|
|
155
|
+
report.md
|
|
156
|
+
chunks/
|
|
157
|
+
report.jsonl
|
|
158
|
+
metadata/
|
|
159
|
+
report.json
|
|
160
|
+
manifest.json
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
With asset extraction enabled:
|
|
164
|
+
|
|
165
|
+
```text
|
|
166
|
+
out/
|
|
167
|
+
markdown/
|
|
168
|
+
report.md
|
|
169
|
+
assets/
|
|
170
|
+
report_img_001.png
|
|
171
|
+
report_img_002.jpg
|
|
172
|
+
metadata/
|
|
173
|
+
report.json
|
|
174
|
+
manifest.json
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
For very large jobs, MarkItDown Plus avoids huge `manifest.json` files by streaming records:
|
|
178
|
+
|
|
179
|
+
```text
|
|
180
|
+
out/
|
|
181
|
+
manifest.json
|
|
182
|
+
manifest-records.jsonl
|
|
183
|
+
failed.jsonl
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
## Chunk Strategies
|
|
187
|
+
|
|
188
|
+
### `heading`
|
|
189
|
+
|
|
190
|
+
Default. Preserves Markdown heading paths and is best for most structured documents.
|
|
191
|
+
|
|
192
|
+
```bash
|
|
193
|
+
markitdown-plus convert ./docs -o ./out --rag --chunk-strategy heading
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
### `fixed`
|
|
197
|
+
|
|
198
|
+
Creates stable chunk sizes and ignores heading boundaries. Useful for embedding pipelines that prefer consistent lengths.
|
|
199
|
+
|
|
200
|
+
```bash
|
|
201
|
+
markitdown-plus convert ./docs -o ./out --rag --chunk-strategy fixed
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
### `semantic-lite`
|
|
205
|
+
|
|
206
|
+
Dependency-free rule-based topical splitting. It starts new chunks at obvious semantic cues such as headings, summary, conclusion, recommendations, and other section-like paragraphs.
|
|
207
|
+
|
|
208
|
+
```bash
|
|
209
|
+
markitdown-plus convert ./docs -o ./out --rag --chunk-strategy semantic-lite
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
## Asset Extraction
|
|
213
|
+
|
|
214
|
+
`--extract-assets` currently supports lightweight extraction for:
|
|
215
|
+
|
|
216
|
+
- `.docx`
|
|
217
|
+
- `.pptx`
|
|
218
|
+
- `.xlsx`
|
|
219
|
+
- `.html` / `.htm` local image references
|
|
220
|
+
|
|
221
|
+
PDF image extraction is intentionally left for a later version because reliable PDF asset extraction requires heavier format-specific dependencies.
|
|
222
|
+
|
|
223
|
+
When assets are extracted, MarkItDown Plus appends an `Extracted Assets` section to the generated Markdown and records asset metadata in the file-level metadata JSON.
|
|
224
|
+
|
|
225
|
+
## Single File Commands
|
|
226
|
+
|
|
227
|
+
Convert one file directly:
|
|
228
|
+
|
|
229
|
+
```bash
|
|
230
|
+
markitdown-plus single report.pdf -o report.md
|
|
231
|
+
```
|
|
232
|
+
|
|
233
|
+
Clean an existing Markdown file:
|
|
234
|
+
|
|
235
|
+
```bash
|
|
236
|
+
markitdown-plus clean dirty.md -o clean.md
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
Chunk an existing Markdown file:
|
|
240
|
+
|
|
241
|
+
```bash
|
|
242
|
+
markitdown-plus chunk clean.md -o chunks.jsonl --chunk-strategy fixed
|
|
243
|
+
```
|
|
244
|
+
|
|
245
|
+
## Development
|
|
246
|
+
|
|
247
|
+
```bash
|
|
248
|
+
git clone https://github.com/lamguo/markitdown-plus.git
|
|
249
|
+
cd markitdown-plus
|
|
250
|
+
pip install -e ".[dev]"
|
|
251
|
+
pytest
|
|
252
|
+
```
|
|
253
|
+
|
|
254
|
+
The test configuration includes a coverage gate:
|
|
255
|
+
|
|
256
|
+
```bash
|
|
257
|
+
pytest --cov=markitdown_plus --cov-fail-under=85
|
|
258
|
+
```
|
|
259
|
+
|
|
260
|
+
Optional property and benchmark tests are included. They are skipped automatically if `hypothesis` or `pytest-benchmark` is not installed.
|
|
261
|
+
|
|
262
|
+
## GitHub Topics
|
|
263
|
+
|
|
264
|
+
Suggested topics for the repository:
|
|
265
|
+
|
|
266
|
+
```text
|
|
267
|
+
markitdown
|
|
268
|
+
microsoft-markitdown
|
|
269
|
+
markdown
|
|
270
|
+
rag
|
|
271
|
+
llm
|
|
272
|
+
document-conversion
|
|
273
|
+
pdf-to-markdown
|
|
274
|
+
docx-to-markdown
|
|
275
|
+
batch-conversion
|
|
276
|
+
jsonl
|
|
277
|
+
asset-extraction
|
|
278
|
+
ai-tools
|
|
279
|
+
```
|
|
280
|
+
|
|
281
|
+
## Support This Project
|
|
282
|
+
|
|
283
|
+
If MarkItDown Plus helps you save time or build better AI document pipelines, you can support development here:
|
|
284
|
+
|
|
285
|
+
- Star this repository
|
|
286
|
+
- Support via PayPal: https://www.paypal.me/lamguo
|
|
287
|
+
|
|
288
|
+
Thank you for supporting open-source development.
|
|
289
|
+
|
|
290
|
+
## License
|
|
291
|
+
|
|
292
|
+
MIT License.
|
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
markitdown_plus/__about__.py,sha256=R1j3rv8sKRRZBWgf-R2yfl9mpn86EWL2nU-4qoNxWDg,173
|
|
2
|
+
markitdown_plus/__init__.py,sha256=Es4TxE9wFkHDSZOQ7W72iPyjOMvZZkY6vYBGcukKRvs,373
|
|
3
|
+
markitdown_plus/assets.py,sha256=f64lizi3BsJXJ6R8nOFz_zYn8MH-OJ-y-vi7dSKgPnM,5601
|
|
4
|
+
markitdown_plus/batch.py,sha256=np85srYs9iqHJ4lQdZxvOqtn6doiF-zyEar8-Sfa8jA,13992
|
|
5
|
+
markitdown_plus/chunker.py,sha256=mC-gNqNRecXMNu3wCt21nEhmKeU78Vn-GiXvaj5Uqww,14327
|
|
6
|
+
markitdown_plus/cleaner.py,sha256=a_g1psLKZCuZyyC588kXTC0ilPqUgL3oulVT8076LNM,5185
|
|
7
|
+
markitdown_plus/cli.py,sha256=nN78PwM2Fq0bhFN3EjQ0SaHH4mMgRhydGs8qvUgj8jY,8508
|
|
8
|
+
markitdown_plus/converter.py,sha256=4i2Ya-uLGxX2PXo3Y32pI2e4s0yD54aFjN-41CSNCeQ,2127
|
|
9
|
+
markitdown_plus/errors.py,sha256=tr__j5b6qMHmSr89oT1RfWHCg-PzM8mZSdMuld4CheA,346
|
|
10
|
+
markitdown_plus/manifest.py,sha256=AXe9SC5Jg4bB9iXJGssz5BAdLJl1UOw5UDN7hMm2Q6M,5655
|
|
11
|
+
markitdown_plus/metadata.py,sha256=pt1efOZJYyCjJrH8E7WlIjgrXBTfRxopGI1ckSxRs2A,2996
|
|
12
|
+
markitdown_plus/utils.py,sha256=A64niqGBmnCEYI7FVAV0jlY8aX6qI-HPZ_sDQxWM0DI,1762
|
|
13
|
+
markitdown_plus-0.2.0.dist-info/METADATA,sha256=Dv5pPOICuwzBi3ZYfX9PBi-xd_qXLvCpCJDNhZ1891w,7315
|
|
14
|
+
markitdown_plus-0.2.0.dist-info/WHEEL,sha256=mffPy8wBnZQn2VnJUU5jE99KsxaSfiyMHV9Yt0aLVxs,87
|
|
15
|
+
markitdown_plus-0.2.0.dist-info/entry_points.txt,sha256=_N3MJXJsEr2q_QflyP0nB0DLWjN5EOg4xkJ0LxuZLTk,61
|
|
16
|
+
markitdown_plus-0.2.0.dist-info/licenses/LICENSE,sha256=ApuTtJxtA5_UygDm5JxFYs-v-NJ_hjSep0JkJQG6tIk,1064
|
|
17
|
+
markitdown_plus-0.2.0.dist-info/RECORD,,
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Lam Guo
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|