npm - xlsx-for-ai - Versions diffs - 1.4.3 → 1.5.0 - Mend

xlsx-for-ai 1.4.3 → 1.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/README.md CHANGED Viewed

@@ -293,10 +293,84 @@ curl -o .cursor/rules/read-xlsx.mdc https://raw.githubusercontent.com/senoff/xls
 The same rule works for Claude Code (`.claude/rules/`), Copilot (`.github/copilot-instructions.md`), or any other agent — just adjust the path.
+## Embedding xlsx-for-ai as a library dependency
+The CLI install (`npm install -g xlsx-for-ai`) is clean — no deprecation warnings, modern transitive deps via npm `overrides`. If you embed xlsx-for-ai as a library dependency in another project, the picture is slightly different.
+**Why:** npm's `overrides` field only takes effect when xlsx-for-ai is the top-level project. When xlsx-for-ai is installed as a *transitive* dependency in another project, npm uses the original ExcelJS dep tree (unmodified), and you'll see the upstream ExcelJS deprecation warnings on install. The warnings come from ExcelJS's stale transitive deps (`glob@7`, `rimraf@2`, `lodash.isequal`, `fstream`, `inflight`) and are upstream noise — they don't affect functionality.
+**To get clean output in a project that depends on xlsx-for-ai**, copy the same overrides into your own `package.json`:
+```json
+{
+  "overrides": {
+    "glob": "^13.0.0",
+    "rimraf": "^5.0.10",
+    "unzipper": "^0.12.3",
+    "fast-csv": "^5.0.2"
+  }
+}
+```
+Run `rm -rf node_modules package-lock.json && npm install` and the warnings will clear. xlsx-for-ai's tests pass against these versions, so the upgrade is safe.
+A future release may apply these dep upgrades via `patch-package` so they travel through the dep graph automatically. The infrastructure is in place; the patches haven't been needed urgently because most installs are CLI-direct.
+## Reporting bugs
+**The privacy contract: we never auto-send your data.** xlsx-for-ai has no telemetry endpoint and no consent dialog to maintain — there's nothing to opt out of, because nothing leaves your machine unless you choose to attach it to a GitHub issue.
+When something breaks on a real workbook, two flags help us reproduce locally without asking you to share the original file:
+```bash
+# Required — small JSON describing the workbook's structure (no cell content)
+npx xlsx-for-ai --report-bug your-file.xlsx
+# Optional — full workbook with every cell value replaced by a typed placeholder
+npx xlsx-for-ai --export-redacted-workbook your-file.xlsx
+```
+### `--report-bug`
+Writes `xlsx-for-ai-bugreport-<ISO-timestamp>.json` to the current directory. The report contains:
+- File size, sheet count, per-sheet shape (rows × cols), per-sheet merge counts
+- Feature inventory detected via OOXML part inspection — pivot tables, charts, threaded comments, sensitivity labels, linked data types, sparklines, Power Query, slicers, timelines, dynamic arrays, conditional formatting, VBA, and more
+- Defined-name *labels* (e.g. `Totals`) — but NOT their target ranges or formulas
+- Tool version, Node version, OS + arch
+What the report **never** contains: cell values, formulas, shared strings, named-range targets, comment text, or your absolute file path. You can `cat` it before attaching to verify.
+### `--export-redacted-workbook`
+Writes `<input>-redacted.xlsx` next to the input. Every cell value is replaced by a typed placeholder:
+| Original cell type | Placeholder |
+|--------------------|-------------|
+| Number             | `0`         |
+| String             | `"x"`       |
+| Boolean            | `false`     |
+| ISO date           | `1899-12-30`|
+| Error              | preserved   |
+Formulas, sheet names, merges, named ranges (formulas), styles, conditional formatting, pivots, charts, queries, and macros are passed through byte-for-byte at the ZIP/XML level (no lossy ExcelJS round-trip). Shared strings and comment payloads are also rewritten to `"x"` for defense-in-depth. Open the redacted file in Excel to confirm it still triggers the bug, then attach it.
+### Filing the issue
+Open https://github.com/senoff/xlsx-for-ai/issues — the bug template asks you to drag-drop the JSON (and optionally the redacted workbook). That's the whole workflow. No accounts to create, no SDK to integrate, no consent screen to click through.
 ## Why This Exists
 Spreadsheets are everywhere in real projects — financial models, data exports, config files, tax estimates. AI coding agents choke on binary formats. This tool makes spreadsheets legible to AI with zero information loss, including the tricky bits like shared formulas, named ranges, and merged cells that other tools drop.
+## Security
+`xlsx-for-ai` parses untrusted `.xlsx` files on your machine. The
+project's security policy, supported-versions table, and reporting inbox
+are in [SECURITY.md](SECURITY.md). The supply-chain hardening that goes
+with it lives in [docs/INTEGRITY_PINNING.md](docs/INTEGRITY_PINNING.md)
+and [FORK_READINESS.md](FORK_READINESS.md).
 ## License
 MIT

package/SECURITY.md ADDED Viewed

@@ -0,0 +1,96 @@
+# Security policy
+`xlsx-for-ai` is a developer CLI that parses untrusted `.xlsx` files on
+end users' machines and emits text or JSON for AI coding agents. The
+project's security posture is documented across three files; this one is
+the entry point.
+## Reporting a vulnerability
+Please do **not** open a public GitHub issue for security reports.
+Email the maintainer at `bobsenoff@gmail.com` with:
+- a description of the issue and its impact;
+- a minimal reproducer (a workbook, command, or version pinning is ideal);
+- whether you intend to disclose, and on what timeline.
+You should expect an acknowledgement within 72 hours. If you do not hear
+back, follow up — the inbox occasionally eats things.
+This project has no embargo program and no CVE-issuing budget. Coordinate
+disclosure expectations in your first message.
+## Supported versions
+The latest published `1.x` minor on npm receives security fixes. Older
+minors do not. Today that is `1.4.x`. If a fix requires a breaking change,
+it is shipped as a `2.x` and the prior minor is deprecated on npm.
+| Version | Status      | Security fixes |
+|---------|-------------|----------------|
+| 1.4.x   | current     | yes            |
+| 1.3.x   | superseded  | no             |
+| ≤ 1.2.x | superseded  | no             |
+## What this project considers a security issue
+In scope:
+- A maliciously crafted `.xlsx` that causes `xlsx-for-ai` to execute
+  arbitrary code, exfiltrate data outside the workbook, write outside the
+  current working directory, or hang indefinitely on input that should
+  parse or fail in bounded time.
+- A dependency in the production tree (`exceljs` and its parser stack,
+  `xlsx`, `papaparse`, `@formulajs/formulajs`, `gpt-tokenizer`) shipping
+  a known-bad version through `xlsx-for-ai`'s lockfile.
+- An npm-publish vector — a re-published version of any production dep
+  with bytes that differ from the lockfile's pinned integrity hash.
+Out of scope:
+- Bugs in the AI agent that *consumes* the output. We dump bytes; we do
+  not vouch for what an LLM does with them.
+- Performance issues on legitimate workbooks that happen to be very
+  large. File a normal issue.
+- Vulnerabilities in dev-only dependencies that cannot be reached from
+  the published package surface (`files` in `package.json` controls
+  what ships).
+## How this is enforced
+Three documents and two CI workflows do the work:
+- `docs/INTEGRITY_PINNING.md` — the integrity-pinning contract: lockfile
+  is source of truth, `npm ci --ignore-scripts` everywhere in CI, SRI
+  hashes verified on every install, signature verification required on
+  every dep-touching PR, daily drift sweep, audit allowlist policy.
+- `FORK_READINESS.md` — the runbook for an upstream npm-account
+  compromise (specifically, `@protobi/exceljs`, the soft fork we may
+  adopt for pivot-table support). Covers triggers, pre-positioning, and
+  the freeze/diagnose/decide/fork response.
+- `.github/audit-allowlist.json` — the enumerated set of triaged
+  high-or-critical advisories the audit gate intentionally suppresses,
+  with rationale and reassess dates. Adding an entry is a security-policy
+  change.
+- `.github/workflows/audit.yml` — `npm audit` on every PR + a daily
+  cron, gated against the allowlist.
+- `.github/workflows/upgrade-verify.yml` — `npm audit signatures` plus a
+  registry re-resolve check on every PR that touches `package.json` or
+  `package-lock.json`. Catches the silent-republish vector.
+If you are reporting a finding, naming which of these failed (or which
+should have caught it) is helpful but not required.
+## Threat model in one paragraph
+The high-value attack against `xlsx-for-ai` is supply chain: an attacker
+who compromises the npm publish credentials of `exceljs`, `@protobi/exceljs`,
+or any package in the `exceljs-family` group can ship arbitrary code that
+runs on every `npm install`. The next-highest is a malicious workbook
+that leverages a parser bug in that same stack. We do not try to defend
+against the OS being compromised, nor against the user's AI agent acting
+on the output. Everything in `INTEGRITY_PINNING.md` and `FORK_READINESS.md`
+exists to detect or recover from supply-chain compromise; everything in
+the audit workflows exists to catch parser CVEs the moment they are
+disclosed.

package/index.js CHANGED Viewed

@@ -21,7 +21,12 @@ if (!process.env.XLSX_FOR_AI_RESPAWNED) {
 const path = require('path');
 const fs   = require('fs');
-const ExcelJS = require('exceljs');
+// All xlsx-engine access goes through the engine abstraction in lib/engine.js
+// — never require the underlying engine directly. To swap engines (fork,
+// different library, server-side service), replace lib/engine.js. Nothing
+// else changes. Current engine: @protobi/exceljs (drop-in fork of exceljs
+// with active maintenance + preservation patches; see ROADMAP for rationale).
+const engine = require('./lib/engine');
 // Lazy-load heavy deps only when their feature is used (keeps cold start fast
 // for the common --stdout / --json / --md path that needs none of them).
@@ -53,6 +58,8 @@ function parseArgs(argv) {
     maxRows: null,
     maxCols: null,
     maxTokens: null,
+    reportBug: null,
+    exportRedactedWorkbook: null,
     help: false,
   };
   let i = 0;
@@ -73,6 +80,8 @@ function parseArgs(argv) {
     else if (arg === '--max-rows')    { opts.maxRows = parseInt(argv[++i], 10); }
     else if (arg === '--max-cols')    { opts.maxCols = parseInt(argv[++i], 10); }
     else if (arg === '--max-tokens')  { opts.maxTokens = parseInt(argv[++i], 10); }
+    else if (arg === '--report-bug')              { opts.reportBug = argv[++i]; }
+    else if (arg === '--export-redacted-workbook'){ opts.exportRedactedWorkbook = argv[++i]; }
     else if (arg === '-h' || arg === '--help') opts.help = true;
     else                                opts.positional.push(arg);
     i++;
@@ -119,6 +128,19 @@ Other modes:
   --stream          Streaming reader for huge .xlsx files (>100MB);
                     emits row-by-row, drops some sheet metadata
+Bug reporting (privacy-by-design — no data leaves your machine):
+  --report-bug <input.xlsx>
+                    Generate xlsx-for-ai-bugreport-<ISO>.json describing
+                    the workbook's structure (sheet count/shape, feature
+                    inventory, env). Contains zero cell values, formulas,
+                    or named-range targets. Attach to a GitHub issue.
+  --export-redacted-workbook <input.xlsx>
+                    Produce <input>-redacted.xlsx with every cell value
+                    replaced by a typed placeholder (numbers→0,
+                    strings→"x", bools→false, dates→1900-01-01). Formulas,
+                    structure, styles, named ranges preserved. Optional
+                    attachment for hard-to-repro bugs.
 Misc:
   -h, --help        Show this help
@@ -353,7 +375,7 @@ function dumpSheet(ws, wb, opts = {}) {
     lines.push(`(${ws.columnCount - endCol} more columns truncated)`);
   }
-  const merges = Object.keys(ws._merges || {});
+  const merges = (ws.model && Array.isArray(ws.model.merges)) ? ws.model.merges : [];
   if (merges.length) lines.push(`Merged: ${merges.join(', ')}`);
   if (ws.autoFilter) {
@@ -413,7 +435,7 @@ function dumpSheet(ws, wb, opts = {}) {
       if (raw == null || raw === '') continue;
       const ref = `${colLetter(c)}${r}`;
       const tags = [];
-      if (cell.type === ExcelJS.ValueType.Formula && typeof raw === 'object') {
+      if (cell.type === engine.ValueType.Formula && typeof raw === 'object') {
         if (raw.formula) tags.push(`formula: =${raw.formula}`);
         else if (raw.sharedFormula) tags.push(`shared formula ref: ${raw.sharedFormula}`);
       }
@@ -480,7 +502,7 @@ function dumpSheetMarkdown(ws, wb, opts = {}) {
   meta.push(`Total: ${ws.rowCount} rows × ${ws.columnCount} cols`);
   const frozen = (ws.views || []).find(v => v.state === 'frozen');
   if (frozen) meta.push(`Frozen: row ${frozen.ySplit ?? 0}, col ${frozen.xSplit ?? 0}`);
-  const merges = Object.keys(ws._merges || {});
+  const merges = (ws.model && Array.isArray(ws.model.merges)) ? ws.model.merges : [];
   if (merges.length) meta.push(`Merged: ${merges.slice(0, 6).join(', ')}${merges.length > 6 ? ', ...' : ''}`);
   const namedRanges = getNamedRanges(wb, ws.name);
   if (namedRanges.length) meta.push(`Named ranges: ${namedRanges.map(n => n.name).join(', ')}`);
@@ -581,7 +603,8 @@ function dumpSheetJSON(ws, wb, opts = {}) {
     frozen: null,
     columns: [],
     hiddenColumns: [],
-    merges: Object.keys(ws._merges || {}),
+    hiddenRows: [],
+    merges: (ws.model && Array.isArray(ws.model.merges)) ? ws.model.merges.slice() : [],
     autoFilter: null,
     printArea: null,
     namedRanges: getNamedRanges(wb, ws.name),
@@ -652,6 +675,7 @@ function dumpSheetJSON(ws, wb, opts = {}) {
   for (let r = startRow; r <= endRow; r++) {
     const row = ws.getRow(r);
+    if (row.hidden) out.hiddenRows.push(r);
     for (let c = startCol; c <= endCol; c++) {
       const cell = row.getCell(c);
       const raw = cell.value;
@@ -902,12 +926,10 @@ function applyTokenBudget(text, maxTokens) {
 async function loadAnyWorkbook(filePath) {
   const ext = path.extname(filePath).toLowerCase();
   if (ext === '.xlsx') {
-    const wb = new ExcelJS.Workbook();
-    await wb.xlsx.readFile(filePath);
-    return wb;
+    return engine.loadWorkbook(filePath);
   }
   if (ext === '.csv' || ext === '.tsv') {
-    const wb = new ExcelJS.Workbook();
+    const wb = engine.createWorkbook();
     const ws = wb.addWorksheet(path.basename(filePath, ext));
     const text = fs.readFileSync(filePath, 'utf8');
     const papa = lazyPapa();
@@ -922,13 +944,13 @@ async function loadAnyWorkbook(filePath) {
   throw new Error(`Unsupported extension: ${ext}. Supported: .xlsx .xls .xlsb .ods .csv .tsv`);
 }
-// Read a non-xlsx spreadsheet via SheetJS, materialize into an ExcelJS
-// Workbook so the rest of the code (dump/markdown/json/sql/schema) works
-// unchanged. Loses some formatting; preserves values + formulas.
+// Read a non-xlsx spreadsheet via SheetJS, materialize into the engine's
+// workbook representation so the rest of the code (dump/markdown/json/sql/
+// schema) works unchanged. Loses some formatting; preserves values + formulas.
 function loadViaSheetJS(filePath) {
   const XLSX = lazyXlsx();
   const sheetJsWb = XLSX.readFile(filePath, { cellFormula: true, cellDates: true });
-  const wb = new ExcelJS.Workbook();
+  const wb = engine.createWorkbook();
   for (const name of sheetJsWb.SheetNames) {
     const sjsSheet = sheetJsWb.Sheets[name];
     const ws = wb.addWorksheet(name);
@@ -959,7 +981,7 @@ function loadViaSheetJS(filePath) {
 // ---------------------------------------------------------------------------
 async function streamDump(filePath, opts) {
-  const wb = new ExcelJS.stream.xlsx.WorkbookReader(filePath, {
+  const wb = engine.streamReader(filePath, {
     sharedStrings: 'cache',
     hyperlinks: 'ignore',
     worksheets: 'emit',
@@ -1287,7 +1309,7 @@ function applyNumberFormat(ws, ref, fmt) {
 }
 function buildWorkbook(spec) {
-  const wb = new ExcelJS.Workbook();
+  const wb = engine.createWorkbook();
   const warnings = []; // [{type, sheet, ref}, ...]
   function track(sheetName, ref, lossy) {
@@ -1382,6 +1404,17 @@ function buildWorkbook(spec) {
       }
     }
+    // Restore hidden-row state from --json round-trip. Without this, the
+    // `hiddenRows: [...]` field emitted on read is silently dropped on write,
+    // breaking the round-trip claim for fixtures like annotations.xlsx.
+    if (Array.isArray(sheet.hiddenRows)) {
+      for (const n of sheet.hiddenRows) {
+        if (typeof n === 'number' && n >= 1) {
+          try { ws.getRow(n).hidden = true; } catch (_) {}
+        }
+      }
+    }
     if (Array.isArray(sheet.merges)) {
       for (const m of sheet.merges) {
         try { ws.mergeCells(m); } catch (_) {}
@@ -1644,7 +1677,7 @@ async function mainWrite(argv) {
   outPath = path.resolve(outPath);
   try {
-    await wb.xlsx.writeFile(outPath);
+    await engine.writeWorkbook(wb, outPath);
   } catch (e) {
     console.error(`Write error: ${e.message}`);
     process.exit(1);
@@ -1675,6 +1708,28 @@ async function main() {
   const opts = parseArgs(argv);
   if (opts.help) { printHelp(); process.exit(0); }
+  // Bug-report and redacted-workbook modes consume their input via the
+  // flag itself, so they bypass the normal positional / loader path.
+  if (opts.reportBug) {
+    const { generateBugReport, writeBugReport } = require('./lib/bugReport');
+    const inputPath = path.resolve(opts.reportBug);
+    const report = await generateBugReport(inputPath);
+    const outPath = writeBugReport(report, process.cwd());
+    console.log(outPath);
+    return;
+  }
+  if (opts.exportRedactedWorkbook) {
+    const { exportRedactedWorkbook } = require('./lib/redactWorkbook');
+    const inputPath = path.resolve(opts.exportRedactedWorkbook);
+    const ext = path.extname(inputPath);
+    const base = path.basename(inputPath, ext);
+    const outPath = path.join(path.dirname(inputPath), `${base}-redacted${ext}`);
+    await exportRedactedWorkbook(inputPath, outPath);
+    console.log(outPath);
+    return;
+  }
   if (opts.positional.length < 1) { printHelp(); process.exit(1); }
   const filePath = path.resolve(opts.positional[0]);
@@ -1864,11 +1919,50 @@ async function main() {
   }
 }
-main().catch((err) => {
-  const msg = err && err.message ? err.message : String(err);
-  console.error(msg);
-  if (/Invalid string length/i.test(msg)) {
-    console.error('Hint: this sheet renders to a text dump larger than V8\'s 512MB string limit. Try --max-rows N, --max-cols N, --max-tokens N, --range A1:..., or --stream.');
-  }
-  process.exit(1);
-});
+// Run as CLI when invoked directly. Skip when imported so tests can require
+// this module and exercise its internals without triggering main().
+if (require.main === module) {
+  main().catch((err) => {
+    const msg = err && err.message ? err.message : String(err);
+    console.error(msg);
+    if (/Invalid string length/i.test(msg)) {
+      console.error('Hint: this sheet renders to a text dump larger than V8\'s 512MB string limit. Try --max-rows N, --max-cols N, --max-tokens N, --range A1:..., or --stream.');
+    }
+    process.exit(1);
+  });
+}
+// Export internals for unit tests. Production CLI use never touches these
+// exports — this is only for `require('./index.js')` in test files.
+module.exports = {
+  // arg parsing
+  parseArgs,
+  parseWriteArgs,
+  // pure utilities
+  colLetter,
+  colNum,
+  parseRange,
+  isDefaultTextColor,
+  describeFill,
+  describeFont,
+  formatValue,
+  plainValue,
+  jsonValue,
+  describeNote,
+  escapeMd,
+  coerceMaybeDate,
+  coerceMarkdownValue,
+  // schema/format
+  inferType,
+  sqlIdent,
+  sqlVal,
+  // spec parsing
+  parseMarkdownSpec,
+  validateSpec,
+  buildCellValue,
+  // workbook builders
+  buildWorkbook,
+  trySimpleEval,
+  // budget
+  applyTokenBudget,
+};

package/lib/bugReport.js ADDED Viewed

@@ -0,0 +1,251 @@
+// Bug-report generator for xlsx-for-ai.
+//
+// Produces a JSON blob describing the *structure* of an .xlsx workbook
+// (sheet count + shape, used-features inventory, env) with ZERO user
+// content (no cell values, no formulas, no shared strings, no
+// named-range formulas, no comment text). Designed to be safe for a
+// reporter to attach to a public GitHub issue.
+//
+// Implementation:
+//   1. Read the .xlsx as a ZIP via JSZip (already a transitive dep
+//      of exceljs). Walk the OOXML parts to detect features by
+//      filename pattern + targeted ContentType / relationship lookups.
+//   2. Use ExcelJS only for sheet shape (rowCount, columnCount),
+//      merge counts, and named-range *names* (not their refs/formulas).
+//
+// We deliberately avoid emitting anything sourced from cell text,
+// shared strings, or formula expressions. The bug-report consumer
+// should be able to grep the output for any user content and find none.
+const fs = require('fs');
+const path = require('path');
+const os = require('os');
+const JSZip = require('jszip');
+const ExcelJS = require('@protobi/exceljs');
+const PKG_VERSION = require('../package.json').version;
+// OOXML feature detectors. Each entry maps a feature key to a predicate
+// over the list of zip entry filenames. We choose names + content-type
+// matches that are stable across Excel versions.
+//
+// References:
+//   ECMA-376 part-1 (OOXML) section 18.x for sheet parts
+//   MS-OE376 for vendor extension parts
+const FEATURE_PATTERNS = [
+  // Pivot tables: xl/pivotTables/pivotTable*.xml + xl/pivotCache/*
+  { key: 'pivotTables',      test: (n) => /^xl\/pivotTables\/pivotTable\d+\.xml$/i.test(n) },
+  { key: 'pivotCaches',      test: (n) => /^xl\/pivotCache\/pivotCacheDefinition\d+\.xml$/i.test(n) },
+  // Charts (drawing-based + chartsheets)
+  { key: 'charts',           test: (n) => /^xl\/charts\/chart\d+\.xml$/i.test(n) },
+  { key: 'chartsheets',      test: (n) => /^xl\/chartsheets\/sheet\d+\.xml$/i.test(n) },
+  { key: 'drawings',         test: (n) => /^xl\/drawings\/drawing\d+\.xml$/i.test(n) },
+  // Threaded comments (modern; Office 365). Plain comments are detected separately.
+  { key: 'threadedComments', test: (n) => /^xl\/threadedComments\/threadedComment\d+\.xml$/i.test(n) },
+  { key: 'comments',         test: (n) => /^xl\/comments\d+\.xml$/i.test(n) },
+  { key: 'persons',          test: (n) => /^xl\/persons\/person\.xml$/i.test(n) },
+  // Sensitivity labels (MIP). docMetadata folder + LabelInfo part.
+  { key: 'sensitivityLabel', test: (n) => /^docMetadata\/LabelInfo\.xml$/i.test(n) },
+  // Linked / rich data types (the "Stocks", "Geography" data types).
+  { key: 'richValueData',    test: (n) => /^xl\/richData\/rdRichValues\.xml$/i.test(n) },
+  { key: 'richValueRel',     test: (n) => /^xl\/richData\/richValueRel\.xml$/i.test(n) },
+  // Power Query / Data Model
+  { key: 'powerQuery',       test: (n) => /^xl\/queryTables\/queryTable\d+\.xml$/i.test(n)
+                                       || /^customXml\/item\d+\.xml$/i.test(n) && false /* refined below */ },
+  { key: 'dataModel',        test: (n) => /^xl\/model\/item\.data$/i.test(n) },
+  { key: 'connections',      test: (n) => /^xl\/connections\.xml$/i.test(n) },
+  // Slicers / Timelines (modern PivotTable controls)
+  { key: 'slicers',          test: (n) => /^xl\/slicers\/slicer\d+\.xml$/i.test(n) },
+  { key: 'slicerCaches',     test: (n) => /^xl\/slicerCaches\/slicerCache\d+\.xml$/i.test(n) },
+  { key: 'timelines',        test: (n) => /^xl\/timelines\/timeline\d+\.xml$/i.test(n) },
+  { key: 'timelineCaches',   test: (n) => /^xl\/timelineCaches\/timelineCache\d+\.xml$/i.test(n) },
+  // Tables (Excel ListObjects)
+  { key: 'tables',           test: (n) => /^xl\/tables\/table\d+\.xml$/i.test(n) },
+  // External links / workbook references
+  { key: 'externalLinks',    test: (n) => /^xl\/externalLinks\/externalLink\d+\.xml$/i.test(n) },
+  // Macros / VBA
+  { key: 'vbaProject',       test: (n) => /^xl\/vbaProject\.bin$/i.test(n) },
+  // Custom XML parts (often used by enterprise add-ins / SharePoint)
+  { key: 'customXml',        test: (n) => /^customXml\/item\d+\.xml$/i.test(n) },
+  // Embedded objects (OLE)
+  { key: 'embeddings',       test: (n) => /^xl\/embeddings\/.+/i.test(n) },
+  // Theme + custom properties (low signal but cheap)
+  { key: 'customProps',      test: (n) => /^docProps\/custom\.xml$/i.test(n) },
+];
+// Detect dynamic arrays + sparklines from sheet XML. These don't have
+// dedicated parts — they're attributes on cell / extLst inside sheetN.xml.
+// We do a coarse string scan (no value extraction) just to flag presence.
+//
+// Dynamic arrays: <f t="array" ...> with <ext> CT_ExtensionList for
+//   x14ac:cm, or modern: presence of <ext> with namespace x17 + cm attr.
+// Sparklines: <ext><x14:sparklineGroups> inside <extLst>.
+async function detectInSheetFeatures(zip, sheetNames) {
+  const flags = { dynamicArrays: false, sparklines: false, conditionalFormatting: false };
+  for (const name of sheetNames) {
+    const file = zip.file(name);
+    if (!file) continue;
+    const xml = await file.async('string');
+    // Coarse but conservative — we look for tag names only, never values.
+    if (!flags.dynamicArrays && /\bcm="\d+"/.test(xml))           flags.dynamicArrays = true;
+    if (!flags.dynamicArrays && /<f[^>]*\bt="array"/.test(xml))   flags.dynamicArrays = true;
+    if (!flags.sparklines    && /sparklineGroup/.test(xml))       flags.sparklines    = true;
+    if (!flags.conditionalFormatting && /<conditionalFormatting/.test(xml))
+      flags.conditionalFormatting = true;
+  }
+  return flags;
+}
+function inventoryFeatures(filenames) {
+  const out = {};
+  for (const { key, test } of FEATURE_PATTERNS) {
+    const count = filenames.filter(test).length;
+    if (count > 0) out[key] = count;
+  }
+  return out;
+}
+// Given the workbook.xml, extract the sheet relationship Ids and order
+// without reading any user content. We just need names and rIds so we
+// can pair them with worksheet parts to compute per-sheet stats.
+function listSheetPartNames(zip) {
+  // Resolve via workbook rels: xl/_rels/workbook.xml.rels.
+  const out = [];
+  const relsFile = zip.file('xl/_rels/workbook.xml.rels');
+  if (!relsFile) return out;
+  // Sync — we already have the file in memory inside JSZip.
+  // We use a lightweight regex; structural only, no values inside.
+  // Each Relationship: <Relationship Id="rId1" Type="..." Target="worksheets/sheet1.xml"/>
+  // We can't do sync read without loading; caller already loaded.
+  return out;
+}
+async function generateBugReport(filePath) {
+  if (!fs.existsSync(filePath)) {
+    throw new Error(`File not found: ${filePath}`);
+  }
+  const ext = path.extname(filePath).toLowerCase();
+  if (ext !== '.xlsx' && ext !== '.xlsm') {
+    throw new Error(`--report-bug only supports .xlsx / .xlsm (got ${ext})`);
+  }
+  const stat = fs.statSync(filePath);
+  const buf = fs.readFileSync(filePath);
+  const zip = await JSZip.loadAsync(buf);
+  const filenames = Object.keys(zip.files).filter((n) => !zip.files[n].dir);
+  const features = inventoryFeatures(filenames);
+  // Sheet parts list — derived from filename pattern, not content.
+  const sheetParts = filenames.filter((n) => /^xl\/worksheets\/sheet\d+\.xml$/i.test(n));
+  // In-sheet feature flags (string scan, no extraction).
+  const inSheet = await detectInSheetFeatures(zip, sheetParts);
+  if (inSheet.dynamicArrays)         features.dynamicArrays         = true;
+  if (inSheet.sparklines)            features.sparklines            = true;
+  if (inSheet.conditionalFormatting) features.conditionalFormatting = true;
+  // Use ExcelJS for sheet shape, merges, and *names* of named ranges.
+  // We never read cell values or named-range formulas — only enumerate.
+  let sheetCount = 0;
+  let mergedTotal = 0;
+  let namedRangesCount = 0;
+  let definedNames = [];
+  const perSheet = [];
+  let exceljsError = null;
+  try {
+    const wb = new ExcelJS.Workbook();
+    await wb.xlsx.readFile(filePath);
+    sheetCount = wb.worksheets.length;
+    for (const ws of wb.worksheets) {
+      const merges = ws.model && ws.model.merges ? ws.model.merges.length : 0;
+      mergedTotal += merges;
+      perSheet.push({
+        index: ws.id,
+        rows: ws.rowCount || 0,
+        cols: ws.columnCount || 0,
+        merges,
+        hidden: ws.state && ws.state !== 'visible' ? ws.state : null,
+      });
+    }
+    // Defined names — names ONLY (deliberately drop ranges/formulas).
+    const dnModel = wb.definedNames && wb.definedNames.model;
+    if (Array.isArray(dnModel)) {
+      namedRangesCount = dnModel.length;
+      definedNames = dnModel
+        .map((d) => (d && typeof d.name === 'string' ? d.name : null))
+        .filter(Boolean);
+    }
+  } catch (err) {
+    // ExcelJS may fail on edge-case files; report the error class but
+    // don't include the message verbatim (could leak a path inside the
+    // workbook). Sheet count falls back to part count.
+    exceljsError = err && err.name ? err.name : 'Error';
+    sheetCount = sheetParts.length;
+  }
+  const report = {
+    schema: 'xlsx-for-ai/bug-report/v1',
+    generatedAt: new Date().toISOString(),
+    tool: {
+      name: 'xlsx-for-ai',
+      version: PKG_VERSION,
+    },
+    runtime: {
+      node: process.version,
+      platform: process.platform, // e.g. 'darwin', 'linux', 'win32'
+      arch: process.arch,         // e.g. 'arm64', 'x64'
+      osRelease: os.release(),
+    },
+    file: {
+      // ONLY the basename + size — never the absolute path (could leak
+      // user/dir names). The reporter knows what file they ran it on.
+      basename: path.basename(filePath),
+      ext,
+      sizeBytes: stat.size,
+    },
+    workbook: {
+      sheetCount,
+      mergedRangeCountTotal: mergedTotal,
+      namedRangesCount,
+      // Names only — Excel defined-name *names* are user-chosen labels
+      // ("Totals", "TaxRate"). We emit them because they're often the
+      // hint a maintainer needs. If a reporter considers their names
+      // sensitive, they should sanitize before attaching.
+      definedNames,
+      perSheet,
+      featuresPresent: features,
+    },
+    notes: [
+      'This report contains zero cell values, formulas, shared strings, named-range formulas, or comment text.',
+      'Defined-name *labels* are included (e.g. "Totals") but their target ranges are not.',
+      'Generated with --report-bug. Attach to a GitHub issue at https://github.com/senoff/xlsx-for-ai/issues',
+    ],
+  };
+  if (exceljsError) {
+    report.workbook.exceljsLoadError = exceljsError;
+  }
+  return report;
+}
+function writeBugReport(report, cwd) {
+  const ts = report.generatedAt.replace(/[:.]/g, '-');
+  const outPath = path.join(cwd, `xlsx-for-ai-bugreport-${ts}.json`);
+  fs.writeFileSync(outPath, JSON.stringify(report, null, 2), 'utf8');
+  return outPath;
+}
+module.exports = { generateBugReport, writeBugReport };

package/lib/engine.js ADDED Viewed

@@ -0,0 +1,65 @@
+// Engine abstraction layer.
+//
+// xlsx-for-ai's logic shouldn't depend directly on ExcelJS. This module is
+// the *seam* between xlsx-for-ai's code and the underlying xlsx engine —
+// today ExcelJS, tomorrow possibly a fork, a from-scratch JS port,
+// xlsx-populate, or SheetJS Pro server-side.
+//
+// The exposed surface is intentionally narrow: file I/O entry points
+// (load, stream, write), workbook construction, and the small set of
+// ExcelJS constants the rest of the codebase uses. The in-memory workbook
+// representation flows through this layer unchanged — at this stage the
+// goal is to centralize *which engine produces the workbook objects*, not
+// to define a fully-engine-agnostic in-memory model.
+//
+// To swap engines, replace this file. xlsx-for-ai's other modules import
+// only from here; nothing else has a direct require('@protobi/exceljs').
+'use strict';
+const ExcelJS = require('@protobi/exceljs');
+class ExcelJSEngine {
+  /** Engine identifier — useful for diagnostics. */
+  get name() { return 'exceljs'; }
+  get version() {
+    try { return require('exceljs/package.json').version; } catch (_) { return 'unknown'; }
+  }
+  /**
+   * Load a workbook from a file path. Returns the engine's workbook object
+   * (currently an ExcelJS Workbook).
+   */
+  async loadWorkbook(filePath) {
+    const wb = new ExcelJS.Workbook();
+    await wb.xlsx.readFile(filePath);
+    return wb;
+  }
+  /** Construct an empty workbook (used by write mode and CSV/TSV/legacy load paths). */
+  createWorkbook() {
+    return new ExcelJS.Workbook();
+  }
+  /** Write a workbook to disk. */
+  async writeWorkbook(wb, filePath) {
+    return wb.xlsx.writeFile(filePath);
+  }
+  /** Streaming reader for huge files. Returns an async iterator of sheets. */
+  streamReader(filePath, opts) {
+    return new ExcelJS.stream.xlsx.WorkbookReader(filePath, opts);
+  }
+  /**
+   * Constants the rest of the codebase needs. Keeping these here means
+   * the rest of xlsx-for-ai never imports ExcelJS directly — only from
+   * the engine.
+   */
+  get ValueType() { return ExcelJS.ValueType; }
+}
+// Singleton: the rest of the codebase imports this module and gets the
+// active engine. To swap engines, replace `module.exports` with a different
+// engine instance that implements the same surface.
+module.exports = new ExcelJSEngine();

package/lib/redactWorkbook.js ADDED Viewed

@@ -0,0 +1,183 @@
+// Redacted-workbook exporter.
+//
+// Reads an .xlsx as a ZIP, mutates only the *value* portions of each
+// cell (and the shared-string + comment payloads) to typed placeholders,
+// then repacks. Everything else — formulas, styles, sheet names, named
+// ranges, feature parts (pivots / charts / queries / vba) — is passed
+// through byte-for-byte where possible.
+//
+// Why ZIP-passthrough rather than ExcelJS round-trip:
+//   ExcelJS write() is lossy for many features (pivots, slicers,
+//   queries, conditional formatting, sparklines, threaded comments).
+//   For a bug-repro artifact we want maximum structural fidelity, so
+//   we operate at the XML-fragment level inside the existing ZIP.
+//
+// Placeholders:
+//   numbers   → 0
+//   strings   → "x"
+//   booleans  → false (0)
+//   dates     → 1900-01-01 (numeric date cells render to default date
+//                under their existing format; t="d" cells get the
+//                literal ISO string)
+//   errors    → preserved as-is
+//
+// Comments and shared strings are also rewritten to "x" because they
+// contain user text. Defined-name formulas are preserved (per spec).
+const fs = require('fs');
+const path = require('path');
+const JSZip = require('jszip');
+// Match each <c ...>...</c> or self-closing <c .../> element.
+// We deliberately restrict to a single regex pass per sheet — this is
+// fragile only if a cell contains a nested <c> in user-supplied XML,
+// which OOXML cells do not.
+const CELL_RE = /<c\b([^>]*?)(\/>|>([\s\S]*?)<\/c>)/g;
+// Cell type attribute extractor.
+function getAttr(attrs, name) {
+  const m = new RegExp(`\\b${name}="([^"]*)"`).exec(attrs);
+  return m ? m[1] : null;
+}
+function setAttr(attrs, name, value) {
+  if (new RegExp(`\\b${name}="`).test(attrs)) {
+    return attrs.replace(new RegExp(`\\b${name}="[^"]*"`), `${name}="${value}"`);
+  }
+  return `${attrs} ${name}="${value}"`;
+}
+function removeAttr(attrs, name) {
+  return attrs.replace(new RegExp(`\\s*\\b${name}="[^"]*"`), '');
+}
+// Extract first <f ...>...</f> or <f .../> from a cell body. Preserve verbatim.
+const F_RE = /<f\b[^>]*(?:\/>|>[\s\S]*?<\/f>)/;
+function redactCell(match, attrs, selfOrBody, body) {
+  // Self-closing <c r="A1"/> — empty cell, nothing to redact.
+  if (selfOrBody === '/>') return match;
+  const t = getAttr(attrs, 't');
+  const fMatch = body.match(F_RE);
+  const formulaXml = fMatch ? fMatch[0] : '';
+  // Errors: preserve the value as-is. Cell type is "e".
+  if (t === 'e') {
+    return match;
+  }
+  // Inline string: rebuild as <is><t>x</t></is>.
+  if (t === 'inlineStr') {
+    return `<c${attrs}>${formulaXml}<is><t>x</t></is></c>`;
+  }
+  // Shared string: convert to inline string so we don't depend on the
+  // sst index meaning anything. (We also rewrite sst payloads to "x"
+  // for defense-in-depth, but this avoids index-collision worries.)
+  if (t === 's') {
+    let newAttrs = setAttr(attrs, 't', 'inlineStr');
+    return `<c${newAttrs}>${formulaXml}<is><t>x</t></is></c>`;
+  }
+  // Formula returning a literal string.
+  if (t === 'str') {
+    return `<c${attrs}>${formulaXml}<v>x</v></c>`;
+  }
+  // Boolean → false (0).
+  if (t === 'b') {
+    return `<c${attrs}>${formulaXml}<v>0</v></c>`;
+  }
+  // ISO-date typed cell.
+  if (t === 'd') {
+    return `<c${attrs}>${formulaXml}<v>1900-01-01</v></c>`;
+  }
+  // Default = number (no t attribute, or t="n"). Whether it's a date
+  // is encoded in the *style* (numFmt), not the cell type. By
+  // replacing the numeric value with 0, a date-styled cell will render
+  // as 1900-01-00 / 1900-01-01 depending on the date system in use,
+  // which is the documented placeholder.
+  return `<c${attrs}>${formulaXml}<v>0</v></c>`;
+}
+function redactSheetXml(xml) {
+  return xml.replace(CELL_RE, redactCell);
+}
+// Shared strings: every <t>...</t> payload becomes "x". Preserves the
+// number of unique strings + their indices so cells that happen to
+// reference sst still resolve to a valid (redacted) string.
+function redactSharedStringsXml(xml) {
+  // Replace inner text of every <t> element (handles <t>x</t> and
+  // <t xml:space="preserve">x</t>). Empty payloads stay empty.
+  return xml.replace(/(<t\b[^>]*>)([\s\S]*?)(<\/t>)/g, (m, open, payload, close) => {
+    return open + (payload === '' ? '' : 'x') + close;
+  });
+}
+// Comments: <comment><text><r>...<t>USER TEXT</t></r></text></comment>
+// Replace every <t> payload with "x".
+function redactCommentsXml(xml) {
+  return xml.replace(/(<t\b[^>]*>)([\s\S]*?)(<\/t>)/g, (m, open, payload, close) => {
+    return open + (payload === '' ? '' : 'x') + close;
+  });
+}
+// Threaded comments: <threadedComment ... text="USER TEXT" .../>
+// Excel encodes the body as an attribute — must redact in place.
+function redactThreadedCommentsXml(xml) {
+  return xml.replace(/\btext="[^"]*"/g, 'text="x"');
+}
+async function exportRedactedWorkbook(inputPath, outputPath) {
+  if (!fs.existsSync(inputPath)) {
+    throw new Error(`File not found: ${inputPath}`);
+  }
+  const ext = path.extname(inputPath).toLowerCase();
+  if (ext !== '.xlsx' && ext !== '.xlsm') {
+    throw new Error(`--export-redacted-workbook only supports .xlsx / .xlsm (got ${ext})`);
+  }
+  const buf = fs.readFileSync(inputPath);
+  const zip = await JSZip.loadAsync(buf);
+  const filenames = Object.keys(zip.files).filter((n) => !zip.files[n].dir);
+  for (const name of filenames) {
+    const file = zip.file(name);
+    if (!file || file.dir) continue;
+    if (/^xl\/worksheets\/sheet\d+\.xml$/i.test(name)) {
+      const xml = await file.async('string');
+      zip.file(name, redactSheetXml(xml));
+    } else if (/^xl\/sharedStrings\.xml$/i.test(name)) {
+      const xml = await file.async('string');
+      zip.file(name, redactSharedStringsXml(xml));
+    } else if (/^xl\/comments\d+\.xml$/i.test(name)) {
+      const xml = await file.async('string');
+      zip.file(name, redactCommentsXml(xml));
+    } else if (/^xl\/threadedComments\/threadedComment\d+\.xml$/i.test(name)) {
+      const xml = await file.async('string');
+      zip.file(name, redactThreadedCommentsXml(xml));
+    }
+    // All other parts pass through untouched.
+  }
+  // Use store-or-deflate matching Excel's defaults (deflate level 6).
+  const out = await zip.generateAsync({
+    type: 'nodebuffer',
+    compression: 'DEFLATE',
+    compressionOptions: { level: 6 },
+    mimeType: 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
+  });
+  fs.writeFileSync(outputPath, out);
+  return outputPath;
+}
+module.exports = {
+  exportRedactedWorkbook,
+  // exported for unit testing
+  _redactSheetXml: redactSheetXml,
+  _redactSharedStringsXml: redactSharedStringsXml,
+};

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "xlsx-for-ai",
-  "version": "1.4.3",
+  "version": "1.5.0",
   "description": "CLI that converts .xlsx files into rich text or JSON dumps that AI coding agents (Claude, Cursor, Copilot, ChatGPT, etc.) can read — preserving values, formulas, formatting, colors, column widths, frozen panes, named ranges, tables, and more.",
   "main": "index.js",
   "bin": {
@@ -9,11 +9,16 @@
   },
   "files": [
     "index.js",
+    "lib",
     "cursor-rule-template",
     "README.md",
     "WHY.md",
+    "SECURITY.md",
     "LICENSE"
   ],
+  "scripts": {
+    "test": "node --test test/round-trip.test.js test/output-matrix.test.js test/unit/*.test.js"
+  },
   "keywords": [
     "xlsx",
     "excel",
@@ -43,7 +48,7 @@
   },
   "dependencies": {
     "@formulajs/formulajs": "^4.6.0",
-    "exceljs": "^4.4.0",
+    "@protobi/exceljs": "^4.4.0-protobi.9",
     "gpt-tokenizer": "^3.4.0",
     "papaparse": "^5.5.3",
     "xlsx": "^0.18.5"
@@ -51,9 +56,6 @@
   "devDependencies": {
     "patch-package": "^8.0.1"
   },
-  "scripts": {
-    "postinstall": "patch-package"
-  },
   "overrides": {
     "glob": "^13.0.0",
     "rimraf": "^5.0.10",