macos-vision 1.3.0 → 1.3.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +10 -3
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -146,7 +146,7 @@ for (const block of layout) {
|
|
|
146
146
|
|
|
147
147
|
## API — Markdown pipeline (VisionScribe)
|
|
148
148
|
|
|
149
|
-
`VisionScribe` converts an image or PDF to Markdown by combining Apple Vision OCR with a local LLM (via Ollama). The LLM never sees the image — it only formats text that Vision already extracted
|
|
149
|
+
`VisionScribe` converts an image or PDF to Markdown by combining Apple Vision OCR with a local LLM (via Ollama). The LLM never sees the image — it only formats text that Vision already extracted. This keeps image processing local and reduces the risk of vision-model hallucinations, but Markdown reconstruction is still best-effort and depends on the local model and document complexity.
|
|
150
150
|
|
|
151
151
|
### Prerequisites
|
|
152
152
|
|
|
@@ -178,7 +178,7 @@ import { VisionScribe } from 'macos-vision/markdown';
|
|
|
178
178
|
Image / PDF
|
|
179
179
|
│
|
|
180
180
|
▼
|
|
181
|
-
Apple Vision OCR ← macOS native
|
|
181
|
+
Apple Vision OCR ← macOS native text extraction
|
|
182
182
|
│ VisionBlock[] per page
|
|
183
183
|
▼
|
|
184
184
|
Per-page layout inference ← each page processed independently (page-local coords)
|
|
@@ -193,7 +193,7 @@ Ollama /api/chat ← system prompt as role:"system", OCR text as role:"
|
|
|
193
193
|
Markdown string ← chunk results joined with blank lines
|
|
194
194
|
```
|
|
195
195
|
|
|
196
|
-
The LLM never sees the raw image; it only formats text that Apple Vision has already extracted. The system prompt
|
|
196
|
+
The LLM never sees the raw image; it only formats text that Apple Vision has already extracted. The system prompt asks the model to preserve the source text, avoid summarising, and avoid adding content. OCR text is wrapped in `<ocr_source>` tags so the model is less likely to treat document text as user instructions. Per-page processing keeps paragraph coordinates from different pages from being mixed.
|
|
197
197
|
|
|
198
198
|
### `new VisionScribe(options?)`
|
|
199
199
|
|
|
@@ -236,6 +236,7 @@ for (const file of files) {
|
|
|
236
236
|
- **Local model fidelity**: small models (`mistral-nemo`, `gemma`) may occasionally summarise or paraphrase long, dense documents. Larger models (`llama3.1:70b`, `qwen2.5:32b`) produce significantly better fidelity.
|
|
237
237
|
- **Tables**: multi-column table layouts are partially supported. OCR reads cells in reading order but the LLM may not always reconstruct correct Markdown table syntax.
|
|
238
238
|
- **Images / charts**: non-textual content (photos, diagrams, charts) is ignored — only text blocks extracted by Apple Vision are processed.
|
|
239
|
+
- **Markdown fidelity**: the prompt strongly asks for faithful reconstruction, but LLM output is not a cryptographic or deterministic guarantee. Review important legal, financial, or compliance documents before relying on the generated Markdown.
|
|
239
240
|
|
|
240
241
|
---
|
|
241
242
|
|
|
@@ -302,6 +303,12 @@ See `src/index.ts` for full type declarations.
|
|
|
302
303
|
|
|
303
304
|
Apple Vision is the same engine used by macOS Spotlight, Live Text, and Shortcuts — highly optimized and accurate.
|
|
304
305
|
|
|
306
|
+
### OCR evaluation notes
|
|
307
|
+
|
|
308
|
+
In internal tests on anonymized scanned contracts, forms, declarations, and UI screenshots, Apple Vision OCR produced fewer OCR artifacts than Tesseract in most cases. The strongest gains were on multi-column contract-style scans, where Apple Vision preserved substantially more usable text with far fewer artifacts. On simpler UI screenshots, both engines performed similarly.
|
|
309
|
+
|
|
310
|
+
These results are directional rather than a public benchmark suite. The corpus is not included in this repository, and future benchmark fixtures should use synthetic or public-domain documents only.
|
|
311
|
+
|
|
305
312
|
## License
|
|
306
313
|
|
|
307
314
|
MIT
|
package/package.json
CHANGED