@dragon708/docmind-node 1.13.6 → 1.14.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +206 -0
- package/package.json +6 -6
package/README.md
ADDED
|
@@ -0,0 +1,206 @@
|
|
|
1
|
+
# `@dragon708/docmind-node`
|
|
2
|
+
|
|
3
|
+
**Cliente oficial de DocMind para Node.js** (API pública de alto nivel): un solo módulo para `analyzeFile`, extracción de texto/metadata/HTML, OCR, **documento estructurado**, **Markdown híbrido** (DOCX, PDF, HTML, CSV, Excel mediante el motor de conversión a Markdown integrado en servidor), texto para LLM y trozos con Markdown. Resuelve rutas de disco y `file:` URLs con `fs`.
|
|
4
|
+
|
|
5
|
+
**Requisito:** Node **≥ 18** (según `package.json` del paquete).
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Instalación
|
|
10
|
+
|
|
11
|
+
```bash
|
|
12
|
+
npm install @dragon708/docmind-node
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
Arrastra transitivamente `@dragon708/docmind-pdf`, `@dragon708/docmind-docx`, `@dragon708/docmind-ocr`, `@dragon708/docmind-markdown`, `@dragon708/docmind-shared` y dependencias pesadas (pdfjs, Tesseract, runtime de conversión a Markdown empaquetado con `docmind-markdown`, etc.). No es adecuado para bundlers “browser-only” sin sustituir por `@dragon708/docmind-browser`.
|
|
16
|
+
|
|
17
|
+
---
|
|
18
|
+
|
|
19
|
+
## Qué cubre este runtime
|
|
20
|
+
|
|
21
|
+
| Formato | Texto / HTML | Structured | Markdown (`extractMarkdown`) | Notas |
|
|
22
|
+
|---------|----------------|------------|------------------------------|--------|
|
|
23
|
+
| PDF | Sí (nativo + OCR opcional) | Sí | Sí (conversión a Markdown en Node) | `pdf.ocr`, `ocrStrategy`, `maxPages` |
|
|
24
|
+
| DOCX | Sí (Mammoth) | Sí | Sí (Mammoth → Turndown) | `docx.include` (headings, tables, blocks, …) |
|
|
25
|
+
| Imagen raster | OCR (Tesseract) | Sí | Vía structured fallback típico | PNG/JPEG/WebP; TIFF multipage |
|
|
26
|
+
| Texto UTF-8 | Sí | Normalización | Según contenido | — |
|
|
27
|
+
| HTML / CSV / Excel (bytes) | — | — | Sí | Rutas binarias en `@dragon708/docmind-markdown` (solo Node) |
|
|
28
|
+
|
|
29
|
+
---
|
|
30
|
+
|
|
31
|
+
## Uso rápido
|
|
32
|
+
|
|
33
|
+
### Analizar un fichero desde disco
|
|
34
|
+
|
|
35
|
+
```ts
|
|
36
|
+
import { readFile } from "node:fs/promises";
|
|
37
|
+
import { analyzeFile } from "@dragon708/docmind-node";
|
|
38
|
+
|
|
39
|
+
const data = await readFile("informe.pdf");
|
|
40
|
+
const result = await analyzeFile({
|
|
41
|
+
data,
|
|
42
|
+
name: "informe.pdf",
|
|
43
|
+
mimeType: "application/pdf",
|
|
44
|
+
});
|
|
45
|
+
// result.text, result.html, result.warnings, result.fileKind, …
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
### Entrada unificada (`NodeAnalyzeInput`)
|
|
49
|
+
|
|
50
|
+
```ts
|
|
51
|
+
import { readFileToInput, analyzeFile } from "@dragon708/docmind-node";
|
|
52
|
+
|
|
53
|
+
const input = await readFileToInput("/ruta/al/archivo.docx");
|
|
54
|
+
const result = await analyzeFile(input);
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
### Documento estructurado
|
|
58
|
+
|
|
59
|
+
```ts
|
|
60
|
+
import { extractStructuredData } from "@dragon708/docmind-node";
|
|
61
|
+
|
|
62
|
+
const structured = await extractStructuredData({
|
|
63
|
+
data: buffer,
|
|
64
|
+
name: "doc.docx",
|
|
65
|
+
});
|
|
66
|
+
// structured.blocks, structured.tables, structured.pages, …
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
### Markdown híbrido (PDF, DOCX, HTML, CSV, XLSX…)
|
|
70
|
+
|
|
71
|
+
```ts
|
|
72
|
+
import { extractMarkdown } from "@dragon708/docmind-node";
|
|
73
|
+
|
|
74
|
+
const md = await extractMarkdown(
|
|
75
|
+
{ data: buffer, name: "libro.xlsx" },
|
|
76
|
+
{
|
|
77
|
+
onMarkdownExtract: (info) => {
|
|
78
|
+
console.log(info.strategy, info.warnings, info.routing);
|
|
79
|
+
},
|
|
80
|
+
markdownSpreadsheet: { maxRowsPerSheet: 500 },
|
|
81
|
+
},
|
|
82
|
+
);
|
|
83
|
+
// md es string (el cliente Node devuelve el Markdown listo; ver tipos)
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
> La firma exacta y el tipo de retorno siguen los tipos exportados (`NodeExtractMarkdownOptions`, etc.); consulta el IDE o `dist/*.d.ts`.
|
|
87
|
+
|
|
88
|
+
### Texto para LLM y trozos
|
|
89
|
+
|
|
90
|
+
```ts
|
|
91
|
+
import { extractLlmContent, extractStructuredChunks } from "@dragon708/docmind-node";
|
|
92
|
+
|
|
93
|
+
const llmText = await extractLlmContent(input, { /* llm render options */ });
|
|
94
|
+
const chunks = await extractStructuredChunks(input, { chunks: { includeMarkdown: true } });
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
### Introspección
|
|
98
|
+
|
|
99
|
+
```ts
|
|
100
|
+
import { getCapabilities, explainAnalysisPlan } from "@dragon708/docmind-node";
|
|
101
|
+
|
|
102
|
+
const caps = await getCapabilities({ /* … */ });
|
|
103
|
+
const plan = await explainAnalysisPlan({ /* … */ });
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
### Clasificación de tipo de fichero
|
|
107
|
+
|
|
108
|
+
```ts
|
|
109
|
+
import { detectFileKind } from "@dragon708/docmind-node";
|
|
110
|
+
|
|
111
|
+
// Re-export de @dragon708/docmind-shared; útil tras resolver rutas
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
### Conversores directos HTML / CSV / Excel (Node)
|
|
115
|
+
|
|
116
|
+
```ts
|
|
117
|
+
import {
|
|
118
|
+
convertHtmlToMarkdown,
|
|
119
|
+
convertCsvToMarkdown,
|
|
120
|
+
convertSpreadsheetToMarkdown,
|
|
121
|
+
detectBinaryFormat,
|
|
122
|
+
} from "@dragon708/docmind-node";
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
### Imágenes incrustadas en DOCX (Node)
|
|
126
|
+
|
|
127
|
+
```ts
|
|
128
|
+
import {
|
|
129
|
+
extractImagesFromDocx,
|
|
130
|
+
convertDocxImagesForWeb,
|
|
131
|
+
docxImageToDataUri,
|
|
132
|
+
} from "@dragon708/docmind-node";
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
### Extractores estructurados por formato
|
|
136
|
+
|
|
137
|
+
```ts
|
|
138
|
+
import {
|
|
139
|
+
extractStructuredDataFromPdf,
|
|
140
|
+
extractStructuredDataFromDocx,
|
|
141
|
+
extractStructuredDataFromImage,
|
|
142
|
+
} from "@dragon708/docmind-node";
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
---
|
|
146
|
+
|
|
147
|
+
## Opciones destacadas (`NodeAnalyzeOptions`)
|
|
148
|
+
|
|
149
|
+
Resumen; el detalle está en `nodeAnalyzeOptions.ts` y JSDoc:
|
|
150
|
+
|
|
151
|
+
- **`ocr`**: política PDF / metadatos de pipeline (`off` | `auto` | `force`).
|
|
152
|
+
- **`language`**: idiomas OCR (BCP-47 / Tesseract).
|
|
153
|
+
- **`maxPages`**, **`ocrStrategy`**: PDF (estrategia del pipeline raster).
|
|
154
|
+
- **`docx`**: `include` (`headings`, `tables`, `blocks`, `pagesApprox`, `embeddedImages`, …).
|
|
155
|
+
- **`signal`**: `AbortSignal`.
|
|
156
|
+
- **`structuredOutput` / `output`**: opt-in de `structured` en `analyzeFile` (ver `analyzeFileRequestsStructured` en shared).
|
|
157
|
+
|
|
158
|
+
Para **Markdown / LLM / chunks**:
|
|
159
|
+
|
|
160
|
+
- **`NodeExtractMarkdownOptions`**: `markdown`, `markdownDocx`, `markdownPdf`, `markdownHtml`, `markdownCsv`, `markdownSpreadsheet`, `onMarkdownExtract`, structured routing, etc.
|
|
161
|
+
- **`NodeExtractLlmContentOptions`**, **`NodeExtractStructuredChunksOptions`**.
|
|
162
|
+
|
|
163
|
+
---
|
|
164
|
+
|
|
165
|
+
## Helpers de entrada
|
|
166
|
+
|
|
167
|
+
- **`readFileToInput`**, **`bufferToInput`**, **`resolveNodeAnalyzeInput`**: convierten path / buffer / URL en `NodeAnalyzeInput`.
|
|
168
|
+
|
|
169
|
+
---
|
|
170
|
+
|
|
171
|
+
## Constantes y capacidades
|
|
172
|
+
|
|
173
|
+
- **`DOCX_EMBEDDED_IMAGE_CAPABILITIES`**, **`DOCX_STRUCTURE_CAPABILITIES`**, **`docxIncludeRequested`**: matrices de capacidades para informes.
|
|
174
|
+
|
|
175
|
+
---
|
|
176
|
+
|
|
177
|
+
## Errores y advertencias
|
|
178
|
+
|
|
179
|
+
Esta API propaga **`DocMindError`** (`@dragon708/docmind-shared`) cuando el input es inválido o el formato no está soportado. Los PDF/DOCX corruptos pueden lanzar o devolver payloads con `warnings` según la ruta.
|
|
180
|
+
|
|
181
|
+
---
|
|
182
|
+
|
|
183
|
+
## Relación con otros paquetes
|
|
184
|
+
|
|
185
|
+
| Paquete | Uso desde Node |
|
|
186
|
+
|---------|----------------|
|
|
187
|
+
| `@dragon708/docmind-shared` | Tipos y `detectFileKind` |
|
|
188
|
+
| `@dragon708/docmind-markdown` | `extractMarkdown`, render LLM, chunks, conversión binaria a Markdown en Node |
|
|
189
|
+
| `@dragon708/docmind-pdf` | PDF nativo, PDF.js, formularios, pipeline OCR |
|
|
190
|
+
| `@dragon708/docmind-docx` | Mammoth + extractores OOXML |
|
|
191
|
+
| `@dragon708/docmind-ocr` | Tesseract, TIFF, preprocesado |
|
|
192
|
+
|
|
193
|
+
En **navegador** usa `@dragon708/docmind-browser` (sin PDF completo ni conversión binaria pesada a Markdown en el cliente).
|
|
194
|
+
|
|
195
|
+
---
|
|
196
|
+
|
|
197
|
+
## Seguridad
|
|
198
|
+
|
|
199
|
+
- No ejecutes `convertToHtml` / Markdown sobre **documentos no confiables** sin sanitizar antes de insertar en DOM.
|
|
200
|
+
- PDF y DOCX pueden contener recursos remotos según opciones de conversión.
|
|
201
|
+
|
|
202
|
+
---
|
|
203
|
+
|
|
204
|
+
## Licencia
|
|
205
|
+
|
|
206
|
+
MIT (monorepo DocMind).
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@dragon708/docmind-node",
|
|
3
|
-
"version": "1.
|
|
3
|
+
"version": "1.14.0",
|
|
4
4
|
"description": "Official DocMind Node facade: analyzeFile, intent APIs, PDF/DOCX/OCR, and fs helpers.",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"main": "./dist/index.js",
|
|
@@ -32,11 +32,11 @@
|
|
|
32
32
|
],
|
|
33
33
|
"license": "MIT",
|
|
34
34
|
"dependencies": {
|
|
35
|
-
"@dragon708/docmind-docx": "^1.
|
|
36
|
-
"@dragon708/docmind-markdown": "^1.
|
|
37
|
-
"@dragon708/docmind-ocr": "^1.
|
|
38
|
-
"@dragon708/docmind-pdf": "^2.
|
|
39
|
-
"@dragon708/docmind-shared": "^1.
|
|
35
|
+
"@dragon708/docmind-docx": "^1.9.0",
|
|
36
|
+
"@dragon708/docmind-markdown": "^1.3.0",
|
|
37
|
+
"@dragon708/docmind-ocr": "^1.2.0",
|
|
38
|
+
"@dragon708/docmind-pdf": "^2.3.0",
|
|
39
|
+
"@dragon708/docmind-shared": "^1.3.0"
|
|
40
40
|
},
|
|
41
41
|
"devDependencies": {
|
|
42
42
|
"@types/node": "^20.19.37",
|