anymd 0.0.9 → 0.0.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -2,125 +2,49 @@
2
2
 
3
3
  Convert any document (PDF, DOC, DOCX) to clean Markdown for RAG. macOS Apple Silicon only.
4
4
 
5
- ## Install
6
-
7
5
  ```bash
8
6
  bunx anymd --input-dir ./my-documents
9
7
  ```
10
8
 
11
- On first run, `anymd` automatically sets up a Python virtual environment in `~/.cache/anymd/` and installs all required Python packages (marker-pdf, mlx-vlm, pypdfium2). Subsequent runs start instantly.
12
-
13
- ## Usage
14
-
15
- ```bash
16
- bunx anymd --input-dir <path> [--output-dir <path>] [--config <path>]
17
- ```
18
-
19
- | Flag | Required | Default | Description |
20
- |------|----------|---------|-------------|
21
- | `--input-dir` | Yes | — | Directory containing documents (any nested structure) |
22
- | `--output-dir` | No | `./output` | Where to write all output files |
23
- | `--config` | No | `./config.json` | Path to configuration file |
24
-
25
- Runs a 3-step pipeline with verbose per-file progress printed to stdout. A progress summary prints every 5 seconds during long-running steps. On completion: rings terminal bell and sends macOS notification. Safe to Ctrl+C — progress is saved, re-run to resume.
26
-
27
- ## Pipeline Steps
9
+ ## Pipeline
28
10
 
29
- | Step | What | Tool | Output |
30
- |------|------|------|--------|
31
- | 1. Classify | Detect native/scanned/mixed PDFs | pdftotext (TypeScript) | `<output-dir>/classification.json` |
32
- | 2. Convert + OCR + Enhance | Doc/docx/native PDF → markdown, scanned/mixed PDF OCR markdown, heading detection + cleanup | marker + markitdown + mlx-vlm + TypeScript | `<output-dir>/markdown/` |
33
- | 3. Dataset | Collect all markdown → JSONL | TypeScript | `<output-dir>/dataset/dataset.jsonl` |
11
+ | Step | Description |
12
+ |------|-------------|
13
+ | Classify | Detect native/scanned/mixed PDFs via `pdftotext` |
14
+ | Convert + OCR + Enhance | Parallel conversion (marker-pdf, markitdown) and OCR (mlx-vlm chandra-8bit), with incremental enhancement |
15
+ | Dataset | Deduplicated JSONL from enhanced markdown |
34
16
 
35
- Convert and OCR run in parallel. Enhancement runs incrementally as files are ready — no separate step needed.
17
+ Convert and OCR run in parallel. Enhancement runs incrementally as files land.
36
18
 
37
- ## System Requirements
19
+ ## Requirements
38
20
 
39
- - macOS with Apple Silicon (64GB recommended for OCR)
40
- - [Bun](https://bun.sh) runtime
41
- - [uv](https://docs.astral.sh/uv/) — `curl -LsSf https://astral.sh/uv/install.sh | sh`
42
- - [poppler](https://poppler.freedesktop.org/) (`pdftotext`) — `brew install poppler`
43
- - [LibreOffice](https://www.libreoffice.org/) (`soffice`) — optional, only for `.doc` files
21
+ - macOS Apple Silicon (64GB recommended for OCR)
22
+ - [Bun](https://bun.sh), [uv](https://docs.astral.sh/uv/), [poppler](https://poppler.freedesktop.org/) (`brew install poppler`)
23
+ - [LibreOffice](https://www.libreoffice.org/) — optional, for `.doc` files
44
24
 
45
- A preflight check runs at startup and tells you exactly what's missing.
25
+ On first run, a Python 3.13 venv is auto-created at `~/.cache/anymd/` with all ML dependencies (~2 min).
46
26
 
47
- ## Auto-Bootstrap
27
+ ## Options
48
28
 
49
- On first run, `anymd` uses `uv` to create `~/.cache/anymd/.venv` (Python 3.13) and installs:
50
- - `marker-pdf` PDF to markdown conversion
51
- - `markitdown[docx,pdf]` DOCX and PDF fallback conversion
52
- - `mlx-vlm` Apple Silicon MLX inference for OCR
53
- - `pypdfium2` — PDF page rendering
54
- - `torchvision` — image processing for OCR
55
-
56
- This takes ~2 minutes. Progress is printed to stdout. Subsequent runs detect the existing venv and skip setup.
57
-
58
- ## Configuration
59
-
60
- Create a `config.json` in your working directory (or pass `--config <path>`):
61
-
62
- | Key | Default | Description |
63
- |-----|---------|-------------|
64
- | `classifyBatchSize` | 20 | PDFs classified per batch |
65
- | `datasetConcurrency` | 50 | Parallel file reads during dataset build |
66
- | `enhanceConcurrency` | 10 | Parallel markdown enhancement workers |
67
- | `markerWorkers` | 3 | Parallel marker-pdf processes for PDF → markdown |
68
- | `minTextLength` | 50 | Minimum characters for a document to be included in dataset |
69
- | `nativeThreshold` | 200 | Alpha chars above which a PDF is classified as native |
70
- | `scannedThreshold` | 50 | Alpha chars below which a PDF is classified as scanned |
71
-
72
- All fields are optional — omitted fields use defaults. No config file = all defaults.
29
+ ```
30
+ --input-dir <path> Input directory (required)
31
+ --output-dir <path> Output directory (default: ./output)
32
+ --config <path> Config file (default: ./config.json)
33
+ ```
73
34
 
74
- ## Output Structure
35
+ ## Output
75
36
 
76
37
  ```
77
38
  <output-dir>/
78
- ├── raw-md/ Raw markdown from conversion (step 2)
79
- ├── ocr-raw/ OCR markdown from scanned PDFs (step 3)
80
- ├── markdown/ Final enhanced markdown (step 4)
81
- ├── dataset/dataset.jsonl JSONL dataset for RAG (step 5)
82
- ├── classification.json PDF classification results (step 1)
83
- ├── pipeline-log.txt Pipeline conversion log
84
- ├── ocr-log.txt OCR processing log
85
- └── errors.log Timestamped error log (all steps)
39
+ ├── markdown/ Enhanced markdown
40
+ ├── dataset/dataset.jsonl JSONL dataset for RAG
41
+ ├── classification.json PDF classification
42
+ ├── raw-md/ Raw converted markdown
43
+ ├── ocr-raw/ OCR markdown
44
+ └── errors.log Error log
86
45
  ```
87
46
 
88
- ## Input Requirements
89
-
90
- - Put documents anywhere inside `--input-dir` in any folder structure
91
- - Supports `.doc`, `.docx`, `.pdf` (native, scanned, mixed)
92
- - Output files use flat naming with `--` separator: `docs/foo/bar/doc.pdf` → `foo--bar--doc.md`
93
-
94
- ## PDF Fallback
95
-
96
- When marker-pdf fails on a PDF (e.g. index out of bounds), `anymd` automatically falls back to markitdown (pdfminer-six) for text extraction.
97
-
98
- ## Dataset Deduplication
99
-
100
- Step 3 deduplicates entries by content hash. If two source files produce identical markdown, only one entry appears in the JSONL. The completion summary shows the dedup count.
101
-
102
- ## Resume Support
103
-
104
- All steps support resume:
105
-
106
- - Classify: re-runs if `classification.json` missing
107
- - Convert: skips files already in `raw-md/`
108
- - OCR: skips files already in `ocr-raw/`
109
- - Enhance: skips files already in `markdown/`
110
- - Dataset: always regenerates from `markdown/`
111
-
112
- ## OCR Details
113
-
114
- Native PDFs use marker-pdf for structured markdown extraction (headings, bold, lists). Scanned/mixed PDFs use `mlx-community/chandra-8bit` via mlx-vlm at 150 DPI (~32s per page on Apple Silicon). Chandra was chosen for superior Vietnamese diacritical accuracy over marker's surya OCR. Mixed PDFs use native text for text-heavy pages and OCR for scanned pages.
115
-
116
- ## Development
117
-
118
- ```bash
119
- bun run doc # Run locally (uses ./data as input)
120
- bun test # 73 unit tests
121
- bun fix # TypeScript linting (oxlint + eslint + biome + tsc)
122
- ruff format && ruff check --fix # Python linting
123
- ```
47
+ Safe to Ctrl+C — re-run to resume. When marker-pdf fails, falls back to markitdown automatically.
124
48
 
125
49
  ## License
126
50
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "anymd",
3
- "version": "0.0.9",
3
+ "version": "0.0.11",
4
4
  "description": "Convert any document (PDF, DOC, DOCX) to clean Markdown for RAG",
5
5
  "keywords": [
6
6
  "markdown",
@@ -46,9 +46,13 @@
46
46
  "dependencies": {
47
47
  "markdownlint": "^0.40.0",
48
48
  "p-map": "^7.0.4",
49
+ "turndown": "^7.2.2",
49
50
  "yoctocolors": "^2.1.2",
50
51
  "zod": "^4.3.6"
51
52
  },
53
+ "devDependencies": {
54
+ "@types/turndown": "^5.0.6"
55
+ },
52
56
  "engines": {
53
57
  "bun": ">=1.0.0"
54
58
  },
@@ -24,7 +24,7 @@ OUTPUT_BASE = Path(_get_arg('--output-base', 'output/ocr-raw'))
24
24
  STATUS_FILE = Path(_get_arg('--status-file', 'output/ocr-progress.json'))
25
25
  LOG_FILE = Path(_get_arg('--log-file', 'output/ocr-log.txt'))
26
26
 
27
- MODEL_ID = 'mlx-community/chandra-8bit'
27
+ MODEL_ID = 'mlx-community/chandra-4bit'
28
28
  IMAGE_DPI = 150
29
29
  MAX_TOKENS = 8192
30
30
 
@@ -2,12 +2,21 @@
2
2
  import { readFile, writeFile } from 'node:fs/promises'
3
3
  import { basename, join } from 'node:path'
4
4
  import pMap from 'p-map'
5
+ import TurndownService from 'turndown'
5
6
 
6
7
  import type { CleanResult } from '~/types'
7
8
 
8
9
  import { ensureDir, loadExistingMdFiles, logger } from '~/utils'
9
10
 
10
- const BOLD_LINE_REGEX = /^\*\*(?<content>.+)\*\*$/u,
11
+ const td = new TurndownService({ bulletListMarker: '-', emDelimiter: '*', headingStyle: 'atx' })
12
+ td.remove('style')
13
+ const turndown = td,
14
+ HTML_DETECT_REGEX = /<\/?[a-z][a-z0-9]*[^>]*>/iu,
15
+ stripHtml = (text: string): string => {
16
+ if (!HTML_DETECT_REGEX.test(text)) return text
17
+ return turndown.turndown(text)
18
+ },
19
+ BOLD_LINE_REGEX = /^\*\*(?<content>.+)\*\*$/u,
11
20
  PHAN_REGEX = /^(?:Phần|PHẦN)\s+/u,
12
21
  CHUONG_REGEX = /^(?:Chương|CHƯƠNG)\s+/u,
13
22
  MUC_REGEX = /^(?:Mục|MỤC)\s+\d/u,
@@ -18,6 +27,7 @@ const BOLD_LINE_REGEX = /^\*\*(?<content>.+)\*\*$/u,
18
27
  HEADER_TABLE_LINE_REGEX = /^\|.*\|$/u,
19
28
  TABLE_SEP_REGEX = /^\|\s*-+/u,
20
29
  MULTIPLE_BLANKS_REGEX = /\n{3,}/gu,
30
+ MULTIPLE_SPACES_REGEX = / {2,}/gu,
21
31
  // oxlint-disable-next-line no-control-regex
22
32
  // eslint-disable-next-line no-control-regex
23
33
  CONTROL_CHARS_REGEX = /[\u0000-\u0008\u000B\u000C\u000E-\u001F]/gu,
@@ -85,9 +95,11 @@ const BOLD_LINE_REGEX = /^\*\*(?<content>.+)\*\*$/u,
85
95
  return enhanced
86
96
  },
87
97
  enhanceMarkdown = (text: string): string => {
88
- const enhanced = processLines(text.split('\n'))
98
+ const cleaned = stripHtml(text)
99
+ const enhanced = processLines(cleaned.split('\n'))
89
100
  let output = enhanced.join('\n')
90
101
  output = output.replace(MULTIPLE_BLANKS_REGEX, '\n\n')
102
+ output = output.replace(MULTIPLE_SPACES_REGEX, ' ')
91
103
  output = output.trim()
92
104
  return output
93
105
  },