npm - anymd - Versions diffs - 0.0.9 → 0.0.11 - Mend

anymd 0.0.9 → 0.0.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/README.md CHANGED Viewed

@@ -2,125 +2,49 @@
 Convert any document (PDF, DOC, DOCX) to clean Markdown for RAG. macOS Apple Silicon only.
-## Install
 ```bash
 bunx anymd --input-dir ./my-documents
 ```
-On first run, `anymd` automatically sets up a Python virtual environment in `~/.cache/anymd/` and installs all required Python packages (marker-pdf, mlx-vlm, pypdfium2). Subsequent runs start instantly.
-## Usage
-```bash
-bunx anymd --input-dir <path> [--output-dir <path>] [--config <path>]
-```
-| Flag | Required | Default | Description |
-|------|----------|---------|-------------|
-| `--input-dir` | Yes | — | Directory containing documents (any nested structure) |
-| `--output-dir` | No | `./output` | Where to write all output files |
-| `--config` | No | `./config.json` | Path to configuration file |
-Runs a 3-step pipeline with verbose per-file progress printed to stdout. A progress summary prints every 5 seconds during long-running steps. On completion: rings terminal bell and sends macOS notification. Safe to Ctrl+C — progress is saved, re-run to resume.
-## Pipeline Steps
+## Pipeline
-| Step | What | Tool | Output |
-|------|------|------|--------|
-| 1. Classify | Detect native/scanned/mixed PDFs | pdftotext (TypeScript) | `<output-dir>/classification.json` |
-| 2. Convert + OCR + Enhance | Doc/docx/native PDF → markdown, scanned/mixed PDF → OCR markdown, heading detection + cleanup | marker + markitdown + mlx-vlm + TypeScript | `<output-dir>/markdown/` |
-| 3. Dataset | Collect all markdown → JSONL | TypeScript | `<output-dir>/dataset/dataset.jsonl` |
+| Step | Description |
+|------|-------------|
+| Classify | Detect native/scanned/mixed PDFs via `pdftotext` |
+| Convert + OCR + Enhance | Parallel conversion (marker-pdf, markitdown) and OCR (mlx-vlm chandra-8bit), with incremental enhancement |
+| Dataset | Deduplicated JSONL from enhanced markdown |
-Convert and OCR run in parallel. Enhancement runs incrementally as files are ready — no separate step needed.
+Convert and OCR run in parallel. Enhancement runs incrementally as files land.
-## System Requirements
+## Requirements
-- macOS with Apple Silicon (64GB recommended for OCR)
-- [Bun](https://bun.sh) runtime
-- [uv](https://docs.astral.sh/uv/) — `curl -LsSf https://astral.sh/uv/install.sh | sh`
-- [poppler](https://poppler.freedesktop.org/) (`pdftotext`) — `brew install poppler`
-- [LibreOffice](https://www.libreoffice.org/) (`soffice`) — optional, only for `.doc` files
+- macOS Apple Silicon (64GB recommended for OCR)
+- [Bun](https://bun.sh), [uv](https://docs.astral.sh/uv/), [poppler](https://poppler.freedesktop.org/) (`brew install poppler`)
+- [LibreOffice](https://www.libreoffice.org/) — optional, for `.doc` files
-A preflight check runs at startup and tells you exactly what's missing.
+On first run, a Python 3.13 venv is auto-created at `~/.cache/anymd/` with all ML dependencies (~2 min).
-## Auto-Bootstrap
+## Options
-On first run, `anymd` uses `uv` to create `~/.cache/anymd/.venv` (Python 3.13) and installs:
-- `marker-pdf` — PDF to markdown conversion
-- `markitdown[docx,pdf]` — DOCX and PDF fallback conversion
-- `mlx-vlm` — Apple Silicon MLX inference for OCR
-- `pypdfium2` — PDF page rendering
-- `torchvision` — image processing for OCR
-This takes ~2 minutes. Progress is printed to stdout. Subsequent runs detect the existing venv and skip setup.
-## Configuration
-Create a `config.json` in your working directory (or pass `--config <path>`):
-| Key | Default | Description |
-|-----|---------|-------------|
-| `classifyBatchSize` | 20 | PDFs classified per batch |
-| `datasetConcurrency` | 50 | Parallel file reads during dataset build |
-| `enhanceConcurrency` | 10 | Parallel markdown enhancement workers |
-| `markerWorkers` | 3 | Parallel marker-pdf processes for PDF → markdown |
-| `minTextLength` | 50 | Minimum characters for a document to be included in dataset |
-| `nativeThreshold` | 200 | Alpha chars above which a PDF is classified as native |
-| `scannedThreshold` | 50 | Alpha chars below which a PDF is classified as scanned |
-All fields are optional — omitted fields use defaults. No config file = all defaults.
+```
+--input-dir <path>   Input directory (required)
+--output-dir <path>  Output directory (default: ./output)
+--config <path>      Config file (default: ./config.json)
+```
-## Output Structure
+## Output
 ```
 <output-dir>/
-├── raw-md/                      Raw markdown from conversion (step 2)
-├── ocr-raw/                     OCR markdown from scanned PDFs (step 3)
-├── markdown/                    Final enhanced markdown (step 4)
-├── dataset/dataset.jsonl        JSONL dataset for RAG (step 5)
-├── classification.json          PDF classification results (step 1)
-├── pipeline-log.txt             Pipeline conversion log
-├── ocr-log.txt                  OCR processing log
-└── errors.log                   Timestamped error log (all steps)
+├── markdown/              Enhanced markdown
+├── dataset/dataset.jsonl  JSONL dataset for RAG
+├── classification.json    PDF classification
+├── raw-md/                Raw converted markdown
+├── ocr-raw/               OCR markdown
+└── errors.log             Error log
 ```
-## Input Requirements
-- Put documents anywhere inside `--input-dir` in any folder structure
-- Supports `.doc`, `.docx`, `.pdf` (native, scanned, mixed)
-- Output files use flat naming with `--` separator: `docs/foo/bar/doc.pdf` → `foo--bar--doc.md`
-## PDF Fallback
-When marker-pdf fails on a PDF (e.g. index out of bounds), `anymd` automatically falls back to markitdown (pdfminer-six) for text extraction.
-## Dataset Deduplication
-Step 3 deduplicates entries by content hash. If two source files produce identical markdown, only one entry appears in the JSONL. The completion summary shows the dedup count.
-## Resume Support
-All steps support resume:
-- Classify: re-runs if `classification.json` missing
-- Convert: skips files already in `raw-md/`
-- OCR: skips files already in `ocr-raw/`
-- Enhance: skips files already in `markdown/`
-- Dataset: always regenerates from `markdown/`
-## OCR Details
-Native PDFs use marker-pdf for structured markdown extraction (headings, bold, lists). Scanned/mixed PDFs use `mlx-community/chandra-8bit` via mlx-vlm at 150 DPI (~32s per page on Apple Silicon). Chandra was chosen for superior Vietnamese diacritical accuracy over marker's surya OCR. Mixed PDFs use native text for text-heavy pages and OCR for scanned pages.
-## Development
-```bash
-bun run doc                    # Run locally (uses ./data as input)
-bun test                       # 73 unit tests
-bun fix                        # TypeScript linting (oxlint + eslint + biome + tsc)
-ruff format && ruff check --fix  # Python linting
-```
+Safe to Ctrl+C — re-run to resume. When marker-pdf fails, falls back to markitdown automatically.
 ## License

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "anymd",
-  "version": "0.0.9",
+  "version": "0.0.11",
   "description": "Convert any document (PDF, DOC, DOCX) to clean Markdown for RAG",
   "keywords": [
     "markdown",
@@ -46,9 +46,13 @@
   "dependencies": {
     "markdownlint": "^0.40.0",
     "p-map": "^7.0.4",
+    "turndown": "^7.2.2",
     "yoctocolors": "^2.1.2",
     "zod": "^4.3.6"
   },
+  "devDependencies": {
+    "@types/turndown": "^5.0.6"
+  },
   "engines": {
     "bun": ">=1.0.0"
   },

package/scripts/batch-ocr.py CHANGED Viewed

@@ -24,7 +24,7 @@ OUTPUT_BASE = Path(_get_arg('--output-base', 'output/ocr-raw'))
 STATUS_FILE = Path(_get_arg('--status-file', 'output/ocr-progress.json'))
 LOG_FILE = Path(_get_arg('--log-file', 'output/ocr-log.txt'))
-MODEL_ID = 'mlx-community/chandra-8bit'
+MODEL_ID = 'mlx-community/chandra-4bit'
 IMAGE_DPI = 150
 MAX_TOKENS = 8192

package/src/md-enhancer.ts CHANGED Viewed

@@ -2,12 +2,21 @@
 import { readFile, writeFile } from 'node:fs/promises'
 import { basename, join } from 'node:path'
 import pMap from 'p-map'
+import TurndownService from 'turndown'
 import type { CleanResult } from '~/types'
 import { ensureDir, loadExistingMdFiles, logger } from '~/utils'
-const BOLD_LINE_REGEX = /^\*\*(?<content>.+)\*\*$/u,
+const td = new TurndownService({ bulletListMarker: '-', emDelimiter: '*', headingStyle: 'atx' })
+td.remove('style')
+const turndown = td,
+  HTML_DETECT_REGEX = /<\/?[a-z][a-z0-9]*[^>]*>/iu,
+  stripHtml = (text: string): string => {
+    if (!HTML_DETECT_REGEX.test(text)) return text
+    return turndown.turndown(text)
+  },
+  BOLD_LINE_REGEX = /^\*\*(?<content>.+)\*\*$/u,
   PHAN_REGEX = /^(?:Phần|PHẦN)\s+/u,
   CHUONG_REGEX = /^(?:Chương|CHƯƠNG)\s+/u,
   MUC_REGEX = /^(?:Mục|MỤC)\s+\d/u,
@@ -18,6 +27,7 @@ const BOLD_LINE_REGEX = /^\*\*(?<content>.+)\*\*$/u,
   HEADER_TABLE_LINE_REGEX = /^\|.*\|$/u,
   TABLE_SEP_REGEX = /^\|\s*-+/u,
   MULTIPLE_BLANKS_REGEX = /\n{3,}/gu,
+  MULTIPLE_SPACES_REGEX = / {2,}/gu,
   // oxlint-disable-next-line no-control-regex
   // eslint-disable-next-line no-control-regex
   CONTROL_CHARS_REGEX = /[\u0000-\u0008\u000B\u000C\u000E-\u001F]/gu,
@@ -85,9 +95,11 @@ const BOLD_LINE_REGEX = /^\*\*(?<content>.+)\*\*$/u,
     return enhanced
   },
   enhanceMarkdown = (text: string): string => {
-    const enhanced = processLines(text.split('\n'))
+    const cleaned = stripHtml(text)
+    const enhanced = processLines(cleaned.split('\n'))
     let output = enhanced.join('\n')
     output = output.replace(MULTIPLE_BLANKS_REGEX, '\n\n')
+    output = output.replace(MULTIPLE_SPACES_REGEX, ' ')
     output = output.trim()
     return output
   },