npm - @dragon708/docmind-markdown - Versions diffs - 1.2.9 → 1.3.0 - Mend

@dragon708/docmind-markdown 1.2.9 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (3) hide show

package/README.md ADDED Viewed

@@ -0,0 +1,167 @@
+# `@dragon708/docmind-markdown`
+Convierte **`StructuredDocumentResult`** (modelo compartido de DocMind) en **Markdown** (GFM), **texto orientado a LLM** y **secciones/chunks** con Markdown opcional por trozo. Expone **`extractMarkdown`**: un router unificado sobre **estructurado**, **bytes** o **ruta** (Node) que delega en **DOCX** (Mammoth → Turndown), **PDF** y formatos **HTML / CSV / Excel** mediante el **motor de conversión a Markdown integrado** (solo en Node), con **fallback** a serialización estructurada cuando la ruta especializada no aplica o falla.
+---
+## Instalación
+```bash
+npm install @dragon708/docmind-markdown
+```
+Dependencias destacadas: `@dragon708/docmind-shared`, cadena de conversión a Markdown empaquetada con el paquete, `mammoth`, `turndown`, `turndown-plugin-gfm`. La lógica es **Node-first**; en navegador puro las rutas binarias pesadas y parte de DOCX bytes no están disponibles (ver mensajes `unsupported-runtime` / warnings).
+---
+## Módulos conceptuales
+| Área | Entradas principales | Salida |
+|------|----------------------|--------|
+| Structured → MD | `StructuredDocumentResult` | `convertStructuredToMarkdown` / `renderMarkdown` |
+| Structured → LLM text | Mismo envelope | `convertStructuredToLlmText` / `renderLlmText` / alias `extractLlmContent` |
+| Chunks | Structured + opciones | `splitStructuredIntoChunks` / `renderMarkdownSections` / `extractStructuredChunks` |
+| DOCX bytes → MD | Buffer/path | `convertDocxToMarkdown` (Mammoth + Turndown + GFM) |
+| PDF bytes → MD | Buffer/path | `convertPdfToMarkdown` (motor integrado en Node) |
+| HTML / CSV / XLSX | Bytes/path/text | `convertHtmlToMarkdown`, `convertCsvToMarkdown`, `convertSpreadsheetToMarkdown` |
+| Router único | structured \| bytes \| path | `extractMarkdown` |
+---
+## Uso rápido
+### Desde documento estructurado
+```ts
+import { convertStructuredToMarkdown, renderMarkdown } from "@dragon708/docmind-markdown";
+const md = convertStructuredToMarkdown(structured, {
+  appendUnreferencedTables: true,
+  pageTransitionMarkers: true,
+});
+// renderMarkdown es alias del mismo símbolo
+```
+### Texto para LLM (no es Markdown)
+```ts
+import { renderLlmText } from "@dragon708/docmind-markdown";
+const plain = renderLlmText(structured, { compact: true });
+```
+### Trozos con Markdown por sección
+```ts
+import { renderMarkdownSections } from "@dragon708/docmind-markdown";
+const sections = await renderMarkdownSections(structured, {
+  maxChars: 4000,
+  chunks: { includeMarkdown: true, preserveTables: true },
+});
+```
+### `extractMarkdown` unificado (Node con bytes)
+```ts
+import { extractMarkdown } from "@dragon708/docmind-markdown";
+const result = await extractMarkdown(
+  { data: buffer, filename: "informe.pdf" },
+  {
+    structuredFallback: structured,
+    pdf: { /* opciones del conversor PDF → Markdown */ },
+  },
+);
+console.log(result.markdown);
+console.log(result.strategy, result.warnings, result.routing);
+```
+Opciones frecuentes (ver tipos `ExtractMarkdownOptions`):
+- **`structuredFallback`**: envelope si la ruta especializada falla o está vacía.
+- **`markdown`**: opciones de `convertStructuredToMarkdown` en fallback.
+- **`docx` / `pdf` / `html` / `csv` / `spreadsheet`**: knobs por formato binario (p. ej. `markdownSpreadsheet.maxRowsPerSheet`, `includeSheetNames`, `compactMode`).
+- Post-proceso Excel: las tablas GFM “atascadas” en una sola línea se **normalizan** en el paquete (filas separadas por `\n`) antes de devolver el string.
+### Conversores directos por formato
+```ts
+import {
+  convertDocxToMarkdown,
+  convertPdfToMarkdown,
+  convertHtmlToMarkdown,
+  convertCsvToMarkdown,
+  convertSpreadsheetToMarkdown,
+} from "@dragon708/docmind-markdown";
+// Cada uno devuelve markdown, warnings y origen (p. ej. ruta especializada vs fallback estructurado)
+```
+### Detección de binario
+```ts
+import { detectBinaryFormat } from "@dragon708/docmind-markdown";
+const fmt = detectBinaryFormat(bytes, "archivo.bin", mimeType);
+```
+### HTML: cadena vs ruta
+```ts
+import { looksLikeHtmlString } from "@dragon708/docmind-markdown";
+if (looksLikeHtmlString(input)) {
+  // Tratar como markup, no como path
+}
+```
+---
+## Estrategias y routing (`extractMarkdown`)
+El resultado puede incluir:
+- **`strategy`**: qué ruta tomó el router (identificadores tipo `pdf-*-specialized`, `spreadsheet-structured-fallback`, etc.; ver tipos exportados).
+- **`warnings`**: trazas legibles (`[docmind-markdown:extractMarkdown]`, etc.).
+- **`routing`**: resumen opcional (`routingSummary`, `detectedFormat`, `specializedPipeline`, `usedStructuredFallback`, `mediaHint`).
+Útil para telemetría y UI de depuración (p. ej. playground).
+---
+## CSV y hojas de cálculo
+- **Preparación de CSV** (módulo tabular): cabecera sintética, `maxRows`, etc., antes de la conversión.
+- **`convertSpreadsheetToMarkdown`**: tras la conversión desde hoja de cálculo, opciones `includeSheetNames`, `compactMode`, `maxRowsPerSheet` vía `limitSpreadsheetMarkdownRowsPerSheet`.
+- **`splitJammedSpreadsheetPipeTableLines`**: corrige tablas GFM en una sola línea antes de devolver el Markdown.
+---
+## Empaquetado y `prepack`
+El `package.json` define scripts **`prepack`** que validan y copian dependencias **vendoradas** del runtime de conversión para que el tarball npm sea autocontenido. Si desarrollas en el monorepo, ejecuta `npm run build` antes de `npm pack` / publicar.
+---
+## Seguridad
+- El Markdown y el HTML intermedio (DOCX) pueden contener contenido hostil. **Sanitiza** antes de `dangerouslySetInnerHTML` o equivalentes.
+- No confíes en documentos arbitrarios para conversión sin cuarentena en servidor.
+---
+## Paquetes relacionados
+| Paquete | Rol |
+|---------|-----|
+| `@dragon708/docmind-node` | Cliente que llama a `extractMarkdown` con `structuredFallback` cableado |
+| `@dragon708/docmind-browser` | Mismas firmas; sin conversión binaria pesada en el cliente |
+| `@dragon708/docmind-shared` | `StructuredDocumentResult` |
+---
+## Licencia
+MIT (monorepo DocMind).

package/dist/index.js CHANGED Viewed

@@ -1197,6 +1197,131 @@ async function convertPdfBufferToMarkdown(input, options) {
   return { markdown: r.markdown };
 }
+// src/tabular-markdown-postprocess.ts
+function compactMarkdownOutput(markdown) {
+  return markdown.replace(/\n{3,}/g, "\n\n").split("\n").map((l) => l.trimEnd()).join("\n").trim();
+}
+function countCsvColumns(firstLine) {
+  let n = 1;
+  let inQuotes = false;
+  for (let i = 0; i < firstLine.length; i++) {
+    const c = firstLine[i];
+    if (c === '"') {
+      inQuotes = !inQuotes;
+    } else if (c === "," && !inQuotes) {
+      n++;
+    }
+  }
+  return n;
+}
+function prepareCsvTextForCognipeer(text, options) {
+  const warnings = [];
+  const includeHeader = options?.includeHeader !== false;
+  const maxRows = options?.maxRows;
+  let lines = text.split(/\r?\n/).filter((l) => l.length > 0);
+  if (lines.length === 0) {
+    return { text, warnings };
+  }
+  if (!includeHeader) {
+    const colCount = Math.max(1, countCsvColumns(lines[0]));
+    const synth = Array.from({ length: colCount }, (_, i) => `Column ${i + 1}`).join(",");
+    lines = [synth, ...lines];
+    warnings.push(
+      "[docmind-markdown:csv] includeHeader:false: prepended synthetic header row so the first CSV row appears as table data."
+    );
+  }
+  if (maxRows != null && maxRows >= 0) {
+    const header = lines[0];
+    const rest = lines.slice(1);
+    const data = rest.slice(0, maxRows);
+    if (rest.length > maxRows) {
+      warnings.push(
+        `[docmind-markdown:csv] maxRows:${maxRows}: truncated data rows before conversion (line-based; quoted newlines inside fields may skew counts).`
+      );
+    }
+    lines = [header, ...data];
+  }
+  return { text: lines.join("\n"), warnings };
+}
+function lineHasGfmPipeSeparatorFragment(line) {
+  return /\|(?:\s*:?-+\s*\|){2,}/.test(line);
+}
+function newlineBeforePipeTableAfterSheetHeading(markdown) {
+  return markdown.replace(/^##([^\n|]+)\|/gm, "##$1\n|");
+}
+function splitJammedSpreadsheetPipeTableLines(markdown) {
+  const md = newlineBeforePipeTableAfterSheetHeading(markdown.replace(/^\uFEFF/, ""));
+  const lines = md.split(/\r?\n/);
+  const out = [];
+  for (const line of lines) {
+    const trimmedStart = line.trimStart();
+    if (!trimmedStart.startsWith("|") || !lineHasGfmPipeSeparatorFragment(line)) {
+      out.push(line);
+      continue;
+    }
+    let split = line.replace(/\|\|(?=\s*:?-{2,})/g, "|\n|");
+    split = split.replace(/\|\s+(?=\|)/g, "|\n");
+    out.push(...split.split("\n"));
+  }
+  return out.join("\n");
+}
+function stripSpreadsheetSheetHeadings(markdown) {
+  return markdown.replace(/^##[^\n]+\n+/gm, "");
+}
+function limitSpreadsheetMarkdownRowsPerSheet(markdown, maxRowsPerSheet) {
+  const warnings = [];
+  if (maxRowsPerSheet < 0) return { markdown, warnings };
+  const lines = markdown.split("\n");
+  const out = [];
+  let i = 0;
+  let truncatedAny = false;
+  const emitLimitedTable = (tableLines) => {
+    if (tableLines.length >= 3) {
+      const header = tableLines[0];
+      const sep = tableLines[1];
+      const body = tableLines.slice(2, 2 + maxRowsPerSheet);
+      if (tableLines.length - 2 > maxRowsPerSheet) truncatedAny = true;
+      out.push(header, sep, ...body);
+    } else {
+      out.push(...tableLines);
+    }
+  };
+  while (i < lines.length) {
+    const line = lines[i];
+    const isSheetTitle = /^##\s+.+$/.test(line);
+    if (isSheetTitle) {
+      out.push(line);
+      i++;
+      while (i < lines.length && lines[i].trim() === "") {
+        out.push(lines[i]);
+        i++;
+      }
+      const tableStart = i;
+      while (i < lines.length && lines[i].trim().startsWith("|")) {
+        i++;
+      }
+      emitLimitedTable(lines.slice(tableStart, i));
+      continue;
+    }
+    if (line.trim().startsWith("|")) {
+      const tableStart = i;
+      while (i < lines.length && lines[i].trim().startsWith("|")) {
+        i++;
+      }
+      emitLimitedTable(lines.slice(tableStart, i));
+      continue;
+    }
+    out.push(line);
+    i++;
+  }
+  if (truncatedAny) {
+    warnings.push(
+      `[docmind-markdown:spreadsheet] maxRowsPerSheet:${maxRowsPerSheet}: truncated data rows in one or more sheet tables.`
+    );
+  }
+  return { markdown: out.join("\n"), warnings };
+}
 // src/cognipeer-file-markdown.ts
 var BROWSER = (label) => `@dragon708/docmind-markdown: ${label} \u2192 Markdown via @cognipeer/to-markdown requires Node.js. In the browser, use a server-side conversion or supply structured input / structuredFallback.`;
 function cognipeerOpts(options) {
@@ -1359,6 +1484,9 @@ async function convertCognipeerFileToMarkdown(format, defaultTempFile, input, op
         fallbackReason: "empty"
       };
     }
+    if (format === "spreadsheet") {
+      markdown = splitJammedSpreadsheetPipeTableLines(markdown);
+    }
     return { markdown, warnings, source: "cognipeer" };
   } finally {
     if (cleanup) {
@@ -1428,109 +1556,6 @@ async function convertHtmlToMarkdown(input, options) {
   return convertCognipeerFileToMarkdown("html", "document.html", input, cognipeerOptions);
 }
-// src/tabular-markdown-postprocess.ts
-function compactMarkdownOutput(markdown) {
-  return markdown.replace(/\n{3,}/g, "\n\n").split("\n").map((l) => l.trimEnd()).join("\n").trim();
-}
-function countCsvColumns(firstLine) {
-  let n = 1;
-  let inQuotes = false;
-  for (let i = 0; i < firstLine.length; i++) {
-    const c = firstLine[i];
-    if (c === '"') {
-      inQuotes = !inQuotes;
-    } else if (c === "," && !inQuotes) {
-      n++;
-    }
-  }
-  return n;
-}
-function prepareCsvTextForCognipeer(text, options) {
-  const warnings = [];
-  const includeHeader = options?.includeHeader !== false;
-  const maxRows = options?.maxRows;
-  let lines = text.split(/\r?\n/).filter((l) => l.length > 0);
-  if (lines.length === 0) {
-    return { text, warnings };
-  }
-  if (!includeHeader) {
-    const colCount = Math.max(1, countCsvColumns(lines[0]));
-    const synth = Array.from({ length: colCount }, (_, i) => `Column ${i + 1}`).join(",");
-    lines = [synth, ...lines];
-    warnings.push(
-      "[docmind-markdown:csv] includeHeader:false: prepended synthetic header row so the first CSV row appears as table data."
-    );
-  }
-  if (maxRows != null && maxRows >= 0) {
-    const header = lines[0];
-    const rest = lines.slice(1);
-    const data = rest.slice(0, maxRows);
-    if (rest.length > maxRows) {
-      warnings.push(
-        `[docmind-markdown:csv] maxRows:${maxRows}: truncated data rows before conversion (line-based; quoted newlines inside fields may skew counts).`
-      );
-    }
-    lines = [header, ...data];
-  }
-  return { text: lines.join("\n"), warnings };
-}
-function stripSpreadsheetSheetHeadings(markdown) {
-  return markdown.replace(/^##[^\n]+\n+/gm, "");
-}
-function limitSpreadsheetMarkdownRowsPerSheet(markdown, maxRowsPerSheet) {
-  const warnings = [];
-  if (maxRowsPerSheet < 0) return { markdown, warnings };
-  const lines = markdown.split("\n");
-  const out = [];
-  let i = 0;
-  let truncatedAny = false;
-  const emitLimitedTable = (tableLines) => {
-    if (tableLines.length >= 3) {
-      const header = tableLines[0];
-      const sep = tableLines[1];
-      const body = tableLines.slice(2, 2 + maxRowsPerSheet);
-      if (tableLines.length - 2 > maxRowsPerSheet) truncatedAny = true;
-      out.push(header, sep, ...body);
-    } else {
-      out.push(...tableLines);
-    }
-  };
-  while (i < lines.length) {
-    const line = lines[i];
-    const isSheetTitle = /^##\s+.+$/.test(line);
-    if (isSheetTitle) {
-      out.push(line);
-      i++;
-      while (i < lines.length && lines[i].trim() === "") {
-        out.push(lines[i]);
-        i++;
-      }
-      const tableStart = i;
-      while (i < lines.length && lines[i].trim().startsWith("|")) {
-        i++;
-      }
-      emitLimitedTable(lines.slice(tableStart, i));
-      continue;
-    }
-    if (line.trim().startsWith("|")) {
-      const tableStart = i;
-      while (i < lines.length && lines[i].trim().startsWith("|")) {
-        i++;
-      }
-      emitLimitedTable(lines.slice(tableStart, i));
-      continue;
-    }
-    out.push(line);
-    i++;
-  }
-  if (truncatedAny) {
-    warnings.push(
-      `[docmind-markdown:spreadsheet] maxRowsPerSheet:${maxRowsPerSheet}: truncated data rows in one or more sheet tables.`
-    );
-  }
-  return { markdown: out.join("\n"), warnings };
-}
 // src/csv-markdown.ts
 function looksLikeCsvContent(s) {
   return s.includes(",") && /[\r\n]/.test(s);

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@dragon708/docmind-markdown",
-  "version": "1.2.9",
+  "version": "1.3.0",
   "description": "StructuredDocumentResult → Markdown and LLM-oriented plain text for DocMind.",
   "type": "module",
   "sideEffects": false,
@@ -39,7 +39,7 @@
   "license": "MIT",
   "dependencies": {
     "@cognipeer/to-markdown": "^2.0.1",
-    "@dragon708/docmind-shared": "^1.2.0",
+    "@dragon708/docmind-shared": "^1.3.0",
     "mammoth": "^1.6.0",
     "turndown": "^7.0.0",
     "turndown-plugin-gfm": "^1.0.2"