anymd 0.0.9 → 0.0.10
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +26 -102
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -2,125 +2,49 @@
|
|
|
2
2
|
|
|
3
3
|
Convert any document (PDF, DOC, DOCX) to clean Markdown for RAG. macOS Apple Silicon only.
|
|
4
4
|
|
|
5
|
-
## Install
|
|
6
|
-
|
|
7
5
|
```bash
|
|
8
6
|
bunx anymd --input-dir ./my-documents
|
|
9
7
|
```
|
|
10
8
|
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
## Usage
|
|
14
|
-
|
|
15
|
-
```bash
|
|
16
|
-
bunx anymd --input-dir <path> [--output-dir <path>] [--config <path>]
|
|
17
|
-
```
|
|
18
|
-
|
|
19
|
-
| Flag | Required | Default | Description |
|
|
20
|
-
|------|----------|---------|-------------|
|
|
21
|
-
| `--input-dir` | Yes | — | Directory containing documents (any nested structure) |
|
|
22
|
-
| `--output-dir` | No | `./output` | Where to write all output files |
|
|
23
|
-
| `--config` | No | `./config.json` | Path to configuration file |
|
|
24
|
-
|
|
25
|
-
Runs a 3-step pipeline with verbose per-file progress printed to stdout. A progress summary prints every 5 seconds during long-running steps. On completion: rings terminal bell and sends macOS notification. Safe to Ctrl+C — progress is saved, re-run to resume.
|
|
26
|
-
|
|
27
|
-
## Pipeline Steps
|
|
9
|
+
## Pipeline
|
|
28
10
|
|
|
29
|
-
| Step |
|
|
30
|
-
|
|
31
|
-
|
|
|
32
|
-
|
|
|
33
|
-
|
|
|
11
|
+
| Step | Description |
|
|
12
|
+
|------|-------------|
|
|
13
|
+
| Classify | Detect native/scanned/mixed PDFs via `pdftotext` |
|
|
14
|
+
| Convert + OCR + Enhance | Parallel conversion (marker-pdf, markitdown) and OCR (mlx-vlm chandra-8bit), with incremental enhancement |
|
|
15
|
+
| Dataset | Deduplicated JSONL from enhanced markdown |
|
|
34
16
|
|
|
35
|
-
Convert and OCR run in parallel. Enhancement runs incrementally as files
|
|
17
|
+
Convert and OCR run in parallel. Enhancement runs incrementally as files land.
|
|
36
18
|
|
|
37
|
-
##
|
|
19
|
+
## Requirements
|
|
38
20
|
|
|
39
|
-
- macOS
|
|
40
|
-
- [Bun](https://bun.sh)
|
|
41
|
-
- [
|
|
42
|
-
- [poppler](https://poppler.freedesktop.org/) (`pdftotext`) — `brew install poppler`
|
|
43
|
-
- [LibreOffice](https://www.libreoffice.org/) (`soffice`) — optional, only for `.doc` files
|
|
21
|
+
- macOS Apple Silicon (64GB recommended for OCR)
|
|
22
|
+
- [Bun](https://bun.sh), [uv](https://docs.astral.sh/uv/), [poppler](https://poppler.freedesktop.org/) (`brew install poppler`)
|
|
23
|
+
- [LibreOffice](https://www.libreoffice.org/) — optional, for `.doc` files
|
|
44
24
|
|
|
45
|
-
|
|
25
|
+
On first run, a Python 3.13 venv is auto-created at `~/.cache/anymd/` with all ML dependencies (~2 min).
|
|
46
26
|
|
|
47
|
-
##
|
|
27
|
+
## Options
|
|
48
28
|
|
|
49
|
-
|
|
50
|
-
-
|
|
51
|
-
-
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
- `torchvision` — image processing for OCR
|
|
55
|
-
|
|
56
|
-
This takes ~2 minutes. Progress is printed to stdout. Subsequent runs detect the existing venv and skip setup.
|
|
57
|
-
|
|
58
|
-
## Configuration
|
|
59
|
-
|
|
60
|
-
Create a `config.json` in your working directory (or pass `--config <path>`):
|
|
61
|
-
|
|
62
|
-
| Key | Default | Description |
|
|
63
|
-
|-----|---------|-------------|
|
|
64
|
-
| `classifyBatchSize` | 20 | PDFs classified per batch |
|
|
65
|
-
| `datasetConcurrency` | 50 | Parallel file reads during dataset build |
|
|
66
|
-
| `enhanceConcurrency` | 10 | Parallel markdown enhancement workers |
|
|
67
|
-
| `markerWorkers` | 3 | Parallel marker-pdf processes for PDF → markdown |
|
|
68
|
-
| `minTextLength` | 50 | Minimum characters for a document to be included in dataset |
|
|
69
|
-
| `nativeThreshold` | 200 | Alpha chars above which a PDF is classified as native |
|
|
70
|
-
| `scannedThreshold` | 50 | Alpha chars below which a PDF is classified as scanned |
|
|
71
|
-
|
|
72
|
-
All fields are optional — omitted fields use defaults. No config file = all defaults.
|
|
29
|
+
```
|
|
30
|
+
--input-dir <path> Input directory (required)
|
|
31
|
+
--output-dir <path> Output directory (default: ./output)
|
|
32
|
+
--config <path> Config file (default: ./config.json)
|
|
33
|
+
```
|
|
73
34
|
|
|
74
|
-
## Output
|
|
35
|
+
## Output
|
|
75
36
|
|
|
76
37
|
```
|
|
77
38
|
<output-dir>/
|
|
78
|
-
├──
|
|
79
|
-
├──
|
|
80
|
-
├──
|
|
81
|
-
├──
|
|
82
|
-
├──
|
|
83
|
-
|
|
84
|
-
├── ocr-log.txt OCR processing log
|
|
85
|
-
└── errors.log Timestamped error log (all steps)
|
|
39
|
+
├── markdown/ Enhanced markdown
|
|
40
|
+
├── dataset/dataset.jsonl JSONL dataset for RAG
|
|
41
|
+
├── classification.json PDF classification
|
|
42
|
+
├── raw-md/ Raw converted markdown
|
|
43
|
+
├── ocr-raw/ OCR markdown
|
|
44
|
+
└── errors.log Error log
|
|
86
45
|
```
|
|
87
46
|
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
- Put documents anywhere inside `--input-dir` in any folder structure
|
|
91
|
-
- Supports `.doc`, `.docx`, `.pdf` (native, scanned, mixed)
|
|
92
|
-
- Output files use flat naming with `--` separator: `docs/foo/bar/doc.pdf` → `foo--bar--doc.md`
|
|
93
|
-
|
|
94
|
-
## PDF Fallback
|
|
95
|
-
|
|
96
|
-
When marker-pdf fails on a PDF (e.g. index out of bounds), `anymd` automatically falls back to markitdown (pdfminer-six) for text extraction.
|
|
97
|
-
|
|
98
|
-
## Dataset Deduplication
|
|
99
|
-
|
|
100
|
-
Step 3 deduplicates entries by content hash. If two source files produce identical markdown, only one entry appears in the JSONL. The completion summary shows the dedup count.
|
|
101
|
-
|
|
102
|
-
## Resume Support
|
|
103
|
-
|
|
104
|
-
All steps support resume:
|
|
105
|
-
|
|
106
|
-
- Classify: re-runs if `classification.json` missing
|
|
107
|
-
- Convert: skips files already in `raw-md/`
|
|
108
|
-
- OCR: skips files already in `ocr-raw/`
|
|
109
|
-
- Enhance: skips files already in `markdown/`
|
|
110
|
-
- Dataset: always regenerates from `markdown/`
|
|
111
|
-
|
|
112
|
-
## OCR Details
|
|
113
|
-
|
|
114
|
-
Native PDFs use marker-pdf for structured markdown extraction (headings, bold, lists). Scanned/mixed PDFs use `mlx-community/chandra-8bit` via mlx-vlm at 150 DPI (~32s per page on Apple Silicon). Chandra was chosen for superior Vietnamese diacritical accuracy over marker's surya OCR. Mixed PDFs use native text for text-heavy pages and OCR for scanned pages.
|
|
115
|
-
|
|
116
|
-
## Development
|
|
117
|
-
|
|
118
|
-
```bash
|
|
119
|
-
bun run doc # Run locally (uses ./data as input)
|
|
120
|
-
bun test # 73 unit tests
|
|
121
|
-
bun fix # TypeScript linting (oxlint + eslint + biome + tsc)
|
|
122
|
-
ruff format && ruff check --fix # Python linting
|
|
123
|
-
```
|
|
47
|
+
Safe to Ctrl+C — re-run to resume. When marker-pdf fails, falls back to markitdown automatically.
|
|
124
48
|
|
|
125
49
|
## License
|
|
126
50
|
|