macos-vision 1.3.0 → 1.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +10 -3
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -146,7 +146,7 @@ for (const block of layout) {
146
146
 
147
147
  ## API — Markdown pipeline (VisionScribe)
148
148
 
149
- `VisionScribe` converts an image or PDF to Markdown by combining Apple Vision OCR with a local LLM (via Ollama). The LLM never sees the image — it only formats text that Vision already extracted, which keeps the pipeline deterministic and prevents the hallucinations typical of cloud vision APIs.
149
+ `VisionScribe` converts an image or PDF to Markdown by combining Apple Vision OCR with a local LLM (via Ollama). The LLM never sees the image — it only formats text that Vision already extracted. This keeps image processing local and reduces the risk of vision-model hallucinations, but Markdown reconstruction is still best-effort and depends on the local model and document complexity.
150
150
 
151
151
  ### Prerequisites
152
152
 
@@ -178,7 +178,7 @@ import { VisionScribe } from 'macos-vision/markdown';
178
178
  Image / PDF
179
179
 
180
180
 
181
- Apple Vision OCR ← macOS native, deterministic, zero hallucination
181
+ Apple Vision OCR ← macOS native text extraction
182
182
  │ VisionBlock[] per page
183
183
 
184
184
  Per-page layout inference ← each page processed independently (page-local coords)
@@ -193,7 +193,7 @@ Ollama /api/chat ← system prompt as role:"system", OCR text as role:"
193
193
  Markdown string ← chunk results joined with blank lines
194
194
  ```
195
195
 
196
- The LLM never sees the raw image; it only formats text that Apple Vision has already extracted. The system prompt instructs the model to act as a high-fidelity document parser and explicitly forbids summarising, paraphrasing, or adding content. OCR text is wrapped in `<ocr_source>` tags so the model cannot mistake it for a user asking a question. Per-page processing keeps paragraph coordinates from different pages from being mixed.
196
+ The LLM never sees the raw image; it only formats text that Apple Vision has already extracted. The system prompt asks the model to preserve the source text, avoid summarising, and avoid adding content. OCR text is wrapped in `<ocr_source>` tags so the model is less likely to treat document text as user instructions. Per-page processing keeps paragraph coordinates from different pages from being mixed.
197
197
 
198
198
  ### `new VisionScribe(options?)`
199
199
 
@@ -236,6 +236,7 @@ for (const file of files) {
236
236
  - **Local model fidelity**: small models (`mistral-nemo`, `gemma`) may occasionally summarise or paraphrase long, dense documents. Larger models (`llama3.1:70b`, `qwen2.5:32b`) produce significantly better fidelity.
237
237
  - **Tables**: multi-column table layouts are partially supported. OCR reads cells in reading order but the LLM may not always reconstruct correct Markdown table syntax.
238
238
  - **Images / charts**: non-textual content (photos, diagrams, charts) is ignored — only text blocks extracted by Apple Vision are processed.
239
+ - **Markdown fidelity**: the prompt strongly asks for faithful reconstruction, but LLM output is not a cryptographic or deterministic guarantee. Review important legal, financial, or compliance documents before relying on the generated Markdown.
239
240
 
240
241
  ---
241
242
 
@@ -302,6 +303,12 @@ See `src/index.ts` for full type declarations.
302
303
 
303
304
  Apple Vision is the same engine used by macOS Spotlight, Live Text, and Shortcuts — highly optimized and accurate.
304
305
 
306
+ ### OCR evaluation notes
307
+
308
+ In internal tests on anonymized scanned contracts, forms, declarations, and UI screenshots, Apple Vision OCR produced fewer OCR artifacts than Tesseract in most cases. The strongest gains were on multi-column contract-style scans, where Apple Vision preserved substantially more usable text with far fewer artifacts. On simpler UI screenshots, both engines performed similarly.
309
+
310
+ These results are directional rather than a public benchmark suite. The corpus is not included in this repository, and future benchmark fixtures should use synthetic or public-domain documents only.
311
+
305
312
  ## License
306
313
 
307
314
  MIT
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "macos-vision",
3
- "version": "1.3.0",
3
+ "version": "1.3.1",
4
4
  "description": "Apple Vision OCR + image/PDF analysis for Node.js, with optional Ollama-driven Markdown pipeline — native, fast, offline",
5
5
  "author": "Adrian Wolczuk",
6
6
  "license": "MIT",