jaz-clio 4.34.4 → 4.34.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/assets/skills/api/SKILL.md +3 -2
- package/assets/skills/api/references/endpoints.md +0 -34
- package/assets/skills/cli/SKILL.md +1 -1
- package/assets/skills/cli/references/command-catalog.md +0 -1
- package/assets/skills/cli/references/common-workflows.md +0 -3
- package/assets/skills/conversion/SKILL.md +1 -1
- package/assets/skills/jobs/SKILL.md +1 -1
- package/assets/skills/transaction-recipes/SKILL.md +1 -1
- package/dist/commands/magic.js +15 -245
- package/dist/commands/mcp.js +68 -29
- package/dist/core/api/magic.js +2 -6
- package/dist/core/registry/tools.js +2 -0
- package/package.json +1 -2
- package/dist/core/pdf/detect.js +0 -344
- package/dist/core/pdf/index.js +0 -8
- package/dist/core/pdf/split.js +0 -81
- package/dist/core/pdf/types.js +0 -4
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: jaz-api
|
|
3
|
-
version: 4.34.
|
|
3
|
+
version: 4.34.6
|
|
4
4
|
description: >-
|
|
5
5
|
Use this skill whenever you call, debug, or review code that touches the Jaz
|
|
6
6
|
REST API. Covers field names, response shapes, 117 production gotchas, error
|
|
@@ -150,7 +150,8 @@ You are working with the **Jaz REST API** — the accounting platform backend. A
|
|
|
150
150
|
### Jaz Magic — Extraction & Autofill
|
|
151
151
|
57. **When the user starts from an attachment, always use Jaz Magic** — if the input is a PDF, JPG, or any document image (invoice, bill, receipt), the correct path is `POST /magic/createBusinessTransactionFromAttachment`. Do NOT manually construct a `POST /invoices` or `POST /bills` payload from an attachment — Jaz Magic handles the entire extraction-and-autofill pipeline server-side: OCR, line item detection, contact matching, CoA auto-mapping via ML learning, and draft creation with all fields pre-filled. Only use `POST /invoices` or `POST /bills` when building transactions from structured data (JSON, CSV, database rows) where the fields are already known.
|
|
152
152
|
58. **Two upload modes with different content types** — `sourceType: "FILE"` requires **multipart/form-data** with `sourceFile` blob (JSON body fails with 400 "sourceFile is a required field"). `sourceType: "URL"` accepts **application/json** with `sourceURL` string. The OAS only documents URL mode — FILE mode (the common case) is undocumented.
|
|
153
|
-
59. **Three required fields**: `sourceFile` (multipart blob — NOT `file`), `businessTransactionType` (`"INVOICE"`, `"BILL"`, `"CUSTOMER_CREDIT_NOTE"`, or `"SUPPLIER_CREDIT_NOTE"` — `EXPENSE` rejected), `sourceType` (`"FILE"` or `"URL"`). All
|
|
153
|
+
59. **Three required fields + one optional**: `sourceFile` (multipart blob — NOT `file`), `businessTransactionType` (`"INVOICE"`, `"BILL"`, `"CUSTOMER_CREDIT_NOTE"`, or `"SUPPLIER_CREDIT_NOTE"` — `EXPENSE` rejected), `sourceType` (`"FILE"` or `"URL"`). Optional: `uploadMode` (`"SEPARATE"` default, or `"MERGED"` for a single PDF containing multiple documents — the backend splits it via boundary detection before extraction). All required fields are validated server-side. **CRITICAL: multipart form field names are camelCase** — `businessTransactionType`, `sourceType`, `sourceFile`, `uploadMode`, NOT snake_case. Using `business_transaction_type` returns 422 "businessTransactionType is a required field". The File blob must include a filename and correct MIME type (e.g. `application/pdf`, `image/jpeg`) — bare `application/octet-stream` blobs are rejected with 400 "Invalid file type".
|
|
154
|
+
59a. **MERGED upload workflow tracking** — When `uploadMode: "MERGED"`, the upload response `workflowResourceId` is a **parent** tracking ID. The backend splits the PDF, then creates **child** workflows for each split page — these child IDs appear in `POST /magic/workflows/search` (by fileName or createdAt), NOT the parent ID. To track MERGED progress, search by `fileName` rather than the parent `workflowResourceId`.
|
|
154
155
|
60. **Response maps transaction types**: Request `INVOICE` → response `SALE`. Request `BILL` → response `PURCHASE`. Request `CUSTOMER_CREDIT_NOTE` → response `SALE_CREDIT_NOTE`. Request `SUPPLIER_CREDIT_NOTE` → response `PURCHASE_CREDIT_NOTE`. S3 paths follow the response type. The response `validFiles[]` array contains `workflowResourceId` for tracking extraction progress via `POST /magic/workflows/search`.
|
|
155
156
|
61. **Extraction is asynchronous** — the API response is immediate (file upload confirmation only). The actual Magic pipeline — OCR, line item extraction, contact matching, CoA learning, and autofill — runs asynchronously. Use `POST /magic/workflows/search` with `filter.resourceId.eq: "<workflowResourceId>"` to check status (SUBMITTED → PROCESSING → COMPLETED/FAILED). When COMPLETED, `businessTransactionDetails.businessTransactionResourceId` contains the created draft BT ID. The `subscriptionFBPath` in the response is a Firebase Realtime Database path for real-time status updates (alternative to polling).
|
|
156
157
|
62. **Accepts PDF and JPG/JPEG** — both file types confirmed working. Handwritten documents are accepted at upload stage (extraction quality varies). `fileType` in response reflects actual format: `"PDF"`, `"JPEG"`.
|
|
@@ -1197,40 +1197,6 @@ Content-Type: application/json
|
|
|
1197
1197
|
3. When `COMPLETED` → read `businessTransactionDetails.businessTransactionResourceId`
|
|
1198
1198
|
4. Use the BT resource ID with `GET /invoices/:id`, `GET /bills/:id`, `GET /customer-credit-notes/:id`, or `GET /supplier-credit-notes/:id`
|
|
1199
1199
|
|
|
1200
|
-
### CLI: clio magic split — Merged PDF Splitting
|
|
1201
|
-
|
|
1202
|
-
Splits a merged PDF containing multiple documents (invoices, bills, credit notes) into individual files and uploads each to Magic. Uses structural PDF signals (bookmarks, page labels) + text heuristics (keywords, "Page 1 of N" patterns) for boundary detection. **No AI tokens used.**
|
|
1203
|
-
|
|
1204
|
-
```bash
|
|
1205
|
-
# Auto-detect boundaries + upload
|
|
1206
|
-
clio magic split --file merged.pdf --type bill
|
|
1207
|
-
|
|
1208
|
-
# Manual page ranges (for scanned PDFs or override)
|
|
1209
|
-
clio magic split --file merged.pdf --type bill --pages "1-3,4-6,7-9"
|
|
1210
|
-
|
|
1211
|
-
# Dry-run: detect boundaries only (no qpdf needed)
|
|
1212
|
-
clio magic split --file merged.pdf --type bill --dry-run
|
|
1213
|
-
|
|
1214
|
-
# JSON output (for agents)
|
|
1215
|
-
clio magic split --file merged.pdf --type bill --dry-run --json
|
|
1216
|
-
```
|
|
1217
|
-
|
|
1218
|
-
**Detection signals (score-based, threshold >= 50):**
|
|
1219
|
-
- outline-bookmark (+80): PDF bookmark points to this page
|
|
1220
|
-
- page-label-reset (+70): PDF page label restarts at "1"
|
|
1221
|
-
- keyword in header (+40): Document keyword (INVOICE, BILL, etc.) in upper 40%
|
|
1222
|
-
- page-one-of (+35): "Page 1 of N" pattern
|
|
1223
|
-
- keyword-large (+25): Large font (>18pt) keyword bonus
|
|
1224
|
-
- doc-ref (+20): Document reference (INV-001, SO-2024-100, etc.)
|
|
1225
|
-
- continuation (-60): "Page N>1 of M" anti-signal
|
|
1226
|
-
- continuation-text (-40): "Continued" anti-signal
|
|
1227
|
-
|
|
1228
|
-
**Edge cases:**
|
|
1229
|
-
- Scanned PDFs (no extractable text): warns and requires `--pages` manual override
|
|
1230
|
-
- Mixed scanned+digital: low confidence on scanned portions triggers confirmation prompt
|
|
1231
|
-
- Single document detected: suggests `clio magic create` instead
|
|
1232
|
-
- Encrypted PDFs: same `__pw__` pattern as `magic create`
|
|
1233
|
-
- Requires `qpdf` for splitting (not needed for `--dry-run` auto-detect mode)
|
|
1234
1200
|
|
|
1235
1201
|
---
|
|
1236
1202
|
|
|
@@ -324,7 +324,6 @@ Also: `clio reports pdf` — generate PDF from a message/document.
|
|
|
324
324
|
| `create <file>` | `--type` (invoice, bill, credit-note-customer, credit-note-supplier), `--wait`, `--password` |
|
|
325
325
|
| `status <workflowIds>` | Comma-separated workflow IDs |
|
|
326
326
|
| `search` | `--type`, `--status`, `--from`, `--to`, `--limit`, `--offset` |
|
|
327
|
-
| `split <file>` | `--pages`, `--type`, `--wait`, `--password` (split multi-doc PDF + extract) |
|
|
328
327
|
|
|
329
328
|
### `clio search <query>` — Universal cross-entity search
|
|
330
329
|
Searches contacts, invoices, bills, credit notes, items. Returns grouped results.
|
|
@@ -235,9 +235,6 @@ clio jobs ingest ./inbox/ --json
|
|
|
235
235
|
# Or extract a single document
|
|
236
236
|
clio magic create ./invoice-from-supplier.pdf --type bill --wait --json
|
|
237
237
|
|
|
238
|
-
# Split a multi-page PDF into individual documents
|
|
239
|
-
clio magic split ./combined-statements.pdf --type bill --wait --json
|
|
240
|
-
|
|
241
238
|
# Check workflow status
|
|
242
239
|
clio magic status "wf-id-1,wf-id-2,wf-id-3" --json
|
|
243
240
|
|
package/dist/commands/magic.js
CHANGED
|
@@ -7,7 +7,6 @@ import prompts from 'prompts';
|
|
|
7
7
|
import { createFromAttachment, searchMagicWorkflows, waitForWorkflows, } from '../core/api/magic.js';
|
|
8
8
|
import { extractFilePassword, isPdfEncrypted, isQpdfAvailable, decryptPdf, cleanupDecryptedFile, } from '../core/jobs/document-collection/tools/ingest/decrypt.js';
|
|
9
9
|
import { extractZipToDir, flattenSingleRoot } from '../core/jobs/document-collection/tools/ingest/cloud/zip.js';
|
|
10
|
-
import { detectBoundaries, parsePageRanges, splitPdf, cleanupSplitFiles, } from '../core/pdf/index.js';
|
|
11
10
|
import { apiAction } from './api-action.js';
|
|
12
11
|
import { parsePositiveInt } from './parsers.js';
|
|
13
12
|
import { displaySlice } from './pagination.js';
|
|
@@ -52,6 +51,7 @@ export function registerMagicCommand(program) {
|
|
|
52
51
|
.option('--file <path>', 'Local file path (PDF, JPG, PNG, HEIC, XLS, XLSX, EML, ZIP). ZIP: extracts and uploads each file. Encrypted PDFs: name__pw__password.pdf')
|
|
53
52
|
.option('--url <url>', 'Remote file URL (alternative to --file)')
|
|
54
53
|
.option('--type <type>', `Document type: ${VALID_TYPES}`)
|
|
54
|
+
.option('--merged', 'Treat file as a merged PDF containing multiple documents (split before extraction)')
|
|
55
55
|
.option('--api-key <key>', 'API key (overrides stored/env)')
|
|
56
56
|
.option('--json', 'Output as JSON')
|
|
57
57
|
.action(apiAction(async (client, opts) => {
|
|
@@ -73,6 +73,18 @@ export function registerMagicCommand(program) {
|
|
|
73
73
|
console.error(chalk.red(`Error: invalid type "${opts.type}". Valid: ${VALID_TYPES}`));
|
|
74
74
|
process.exit(1);
|
|
75
75
|
}
|
|
76
|
+
// Validate --merged constraints
|
|
77
|
+
if (opts.merged) {
|
|
78
|
+
if (opts.url) {
|
|
79
|
+
console.error(chalk.red('Error: --merged is only supported with --file, not --url'));
|
|
80
|
+
process.exit(1);
|
|
81
|
+
}
|
|
82
|
+
const ext = extname(opts.file).toLowerCase();
|
|
83
|
+
if (ext !== '.pdf') {
|
|
84
|
+
console.error(chalk.red('Error: --merged requires a PDF file (got ' + ext + ')'));
|
|
85
|
+
process.exit(1);
|
|
86
|
+
}
|
|
87
|
+
}
|
|
76
88
|
let sourceFile;
|
|
77
89
|
let sourceFileName;
|
|
78
90
|
let decryptedPath;
|
|
@@ -106,6 +118,7 @@ export function registerMagicCommand(program) {
|
|
|
106
118
|
sourceFile,
|
|
107
119
|
sourceFileName,
|
|
108
120
|
sourceUrl: opts.url,
|
|
121
|
+
uploadMode: opts.merged ? 'MERGED' : undefined,
|
|
109
122
|
});
|
|
110
123
|
const data = res.data;
|
|
111
124
|
const validFile = data.validFiles?.[0];
|
|
@@ -250,254 +263,11 @@ export function registerMagicCommand(program) {
|
|
|
250
263
|
console.log(chalk.dim(` ... and ${overflow.toLocaleString()} more (use --json for full output)`));
|
|
251
264
|
}
|
|
252
265
|
}));
|
|
253
|
-
// ── clio magic split ──────────────────────────────────────────
|
|
254
|
-
magic
|
|
255
|
-
.command('split')
|
|
256
|
-
.description('Split a merged PDF into individual documents and upload each to Magic.\n' +
|
|
257
|
-
'Auto-detects document boundaries using text heuristics (keywords, page numbers, bookmarks).\n' +
|
|
258
|
-
'For scanned PDFs, use --pages to specify boundaries manually.')
|
|
259
|
-
.option('--file <path>', 'Local PDF file path (required). Encrypted PDFs: name__pw__password.pdf')
|
|
260
|
-
.option('--type <type>', `Document type: ${VALID_TYPES}`)
|
|
261
|
-
.option('--pages <ranges>', 'Manual page ranges (e.g. "1-3,4-6,7"). Skips auto-detection')
|
|
262
|
-
.option('--dry-run', 'Detect boundaries only — do not split or upload')
|
|
263
|
-
.option('--api-key <key>', 'API key (overrides stored/env)')
|
|
264
|
-
.option('--json', 'Output as JSON')
|
|
265
|
-
.action(apiAction(async (client, opts) => {
|
|
266
|
-
// ── Validate inputs ──
|
|
267
|
-
if (!opts.file) {
|
|
268
|
-
console.error(chalk.red('Error: --file is required'));
|
|
269
|
-
console.error(chalk.dim('Usage: clio magic split --file merged.pdf --type bill'));
|
|
270
|
-
process.exit(1);
|
|
271
|
-
}
|
|
272
|
-
if (!opts.type) {
|
|
273
|
-
console.error(chalk.red(`Error: --type is required (${VALID_TYPES})`));
|
|
274
|
-
console.error(chalk.dim('Usage: clio magic split --file merged.pdf --type bill'));
|
|
275
|
-
process.exit(1);
|
|
276
|
-
}
|
|
277
|
-
const apiType = TYPE_TO_API[opts.type];
|
|
278
|
-
if (!apiType) {
|
|
279
|
-
console.error(chalk.red(`Error: invalid type "${opts.type}". Valid: ${VALID_TYPES}`));
|
|
280
|
-
process.exit(1);
|
|
281
|
-
}
|
|
282
|
-
const filePath = resolve(opts.file);
|
|
283
|
-
const ext = extname(filePath).toLowerCase();
|
|
284
|
-
if (ext !== '.pdf') {
|
|
285
|
-
console.error(chalk.red('Error: PDF splitting only supports .pdf files'));
|
|
286
|
-
process.exit(1);
|
|
287
|
-
}
|
|
288
|
-
// qpdf only needed when actually splitting (not for dry-run auto-detect)
|
|
289
|
-
const needsQpdf = !opts.dryRun || opts.pages;
|
|
290
|
-
if (needsQpdf && !isQpdfAvailable()) {
|
|
291
|
-
console.error(chalk.red('Error: qpdf is required for PDF splitting.'));
|
|
292
|
-
console.error(chalk.dim(' macOS: brew install qpdf'));
|
|
293
|
-
console.error(chalk.dim(' Ubuntu: sudo apt install qpdf'));
|
|
294
|
-
process.exit(1);
|
|
295
|
-
}
|
|
296
|
-
// ── Handle encrypted PDFs ──
|
|
297
|
-
const resolved = await resolveInputPdf(filePath, ext, opts);
|
|
298
|
-
const effectivePath = resolved.effectivePath;
|
|
299
|
-
const sourceBaseName = resolved.cleanName.replace(/\.pdf$/i, '');
|
|
300
|
-
try {
|
|
301
|
-
// ── Determine documents (auto-detect or manual) ──
|
|
302
|
-
let documents;
|
|
303
|
-
let pageCount;
|
|
304
|
-
if (opts.pages) {
|
|
305
|
-
// Manual page ranges — need qpdf for page count
|
|
306
|
-
pageCount = (await import('../core/pdf/split.js')).getPageCount(effectivePath);
|
|
307
|
-
documents = parsePageRanges(opts.pages, pageCount);
|
|
308
|
-
}
|
|
309
|
-
else {
|
|
310
|
-
// Auto-detect boundaries
|
|
311
|
-
const buffer = readFileSync(effectivePath);
|
|
312
|
-
const detection = await detectBoundaries(new Uint8Array(buffer));
|
|
313
|
-
pageCount = detection.pageCount;
|
|
314
|
-
documents = detection.documents;
|
|
315
|
-
// Scanned PDF — can't auto-detect, require --pages
|
|
316
|
-
if (detection.isScannedPdf) {
|
|
317
|
-
if (opts.json) {
|
|
318
|
-
console.log(JSON.stringify({
|
|
319
|
-
file: basename(filePath),
|
|
320
|
-
pageCount,
|
|
321
|
-
isScannedPdf: true,
|
|
322
|
-
error: 'Scanned PDF — no extractable text. Use --pages to specify boundaries manually.',
|
|
323
|
-
}, null, 2));
|
|
324
|
-
}
|
|
325
|
-
else {
|
|
326
|
-
console.error(chalk.yellow('Scanned PDF detected — no extractable text for boundary detection.'));
|
|
327
|
-
console.error(chalk.dim(` Use --pages to split manually:`));
|
|
328
|
-
console.error(chalk.dim(` clio magic split --file ${basename(filePath)} --type ${opts.type} --pages "1-3,4-6,7-9"`));
|
|
329
|
-
}
|
|
330
|
-
process.exit(1);
|
|
331
|
-
}
|
|
332
|
-
// Single document — suggest magic create instead
|
|
333
|
-
if (documents.length <= 1) {
|
|
334
|
-
if (opts.json) {
|
|
335
|
-
console.log(JSON.stringify({
|
|
336
|
-
file: basename(filePath),
|
|
337
|
-
pageCount,
|
|
338
|
-
documentsDetected: documents.length,
|
|
339
|
-
message: 'Only 1 document detected — use `clio magic create` instead, or --pages to override.',
|
|
340
|
-
}, null, 2));
|
|
341
|
-
}
|
|
342
|
-
else {
|
|
343
|
-
console.error(chalk.yellow(`Only 1 document detected in ${pageCount}-page PDF.`));
|
|
344
|
-
console.log(chalk.dim(` Use clio magic create --file ${basename(filePath)} --type ${opts.type}`));
|
|
345
|
-
console.log(chalk.dim(` Or override with --pages: clio magic split --file ${basename(filePath)} --type ${opts.type} --pages "1-3,4-6"`));
|
|
346
|
-
}
|
|
347
|
-
return;
|
|
348
|
-
}
|
|
349
|
-
}
|
|
350
|
-
// ── Dry-run: print detection results only ──
|
|
351
|
-
if (opts.dryRun) {
|
|
352
|
-
if (opts.json) {
|
|
353
|
-
console.log(JSON.stringify({
|
|
354
|
-
file: basename(filePath),
|
|
355
|
-
pageCount,
|
|
356
|
-
documents: documents.map((d) => ({
|
|
357
|
-
index: d.index,
|
|
358
|
-
pageRange: d.pageRange,
|
|
359
|
-
confidence: d.confidence,
|
|
360
|
-
signals: d.signals.map((s) => s.label),
|
|
361
|
-
})),
|
|
362
|
-
}, null, 2));
|
|
363
|
-
}
|
|
364
|
-
else {
|
|
365
|
-
console.log(chalk.bold(`PDF Split — Boundary Detection`));
|
|
366
|
-
console.log(` File: ${basename(filePath)} (${pageCount} pages)\n`);
|
|
367
|
-
for (const doc of documents) {
|
|
368
|
-
const conf = doc.confidence === 'high' ? chalk.green(doc.confidence)
|
|
369
|
-
: doc.confidence === 'medium' ? chalk.yellow(doc.confidence)
|
|
370
|
-
: chalk.red(doc.confidence);
|
|
371
|
-
const signals = doc.signals.filter((s) => s.score > 0).map((s) => s.label).join(', ') || 'first page';
|
|
372
|
-
console.log(` Document ${doc.index + 1}: pages ${doc.pageRange.replace('-', '\u2013')} (${conf}) ${chalk.dim(signals)}`);
|
|
373
|
-
}
|
|
374
|
-
console.log(`\n ${documents.length} documents detected. Use --pages to override.`);
|
|
375
|
-
}
|
|
376
|
-
return;
|
|
377
|
-
}
|
|
378
|
-
// ── Confidence check: prompt if any low/medium ──
|
|
379
|
-
const hasLowConfidence = documents.some((d) => d.confidence !== 'high' && d.index > 0);
|
|
380
|
-
if (hasLowConfidence && !opts.json) {
|
|
381
|
-
console.log(chalk.bold(`PDF Split — Boundary Detection`));
|
|
382
|
-
console.log(` File: ${basename(filePath)} (${pageCount} pages)\n`);
|
|
383
|
-
for (const doc of documents) {
|
|
384
|
-
const conf = doc.confidence === 'high' ? chalk.green(doc.confidence)
|
|
385
|
-
: doc.confidence === 'medium' ? chalk.yellow(doc.confidence)
|
|
386
|
-
: chalk.red(doc.confidence);
|
|
387
|
-
const signals = doc.signals.filter((s) => s.score > 0).map((s) => s.label).join(', ') || 'first page';
|
|
388
|
-
console.log(` Document ${doc.index + 1}: pages ${doc.pageRange.replace('-', '\u2013')} (${conf}) ${chalk.dim(signals)}`);
|
|
389
|
-
}
|
|
390
|
-
console.log('');
|
|
391
|
-
const { proceed } = await prompts({
|
|
392
|
-
type: 'confirm',
|
|
393
|
-
name: 'proceed',
|
|
394
|
-
message: 'Some boundaries have low confidence. Split and upload anyway?',
|
|
395
|
-
initial: true,
|
|
396
|
-
});
|
|
397
|
-
if (!proceed) {
|
|
398
|
-
console.log(chalk.dim('Aborted. Use --pages to specify boundaries manually.'));
|
|
399
|
-
return;
|
|
400
|
-
}
|
|
401
|
-
}
|
|
402
|
-
// ── Split + Upload ──
|
|
403
|
-
const splitResult = splitPdf(effectivePath, documents, sourceBaseName);
|
|
404
|
-
const uploadResults = [];
|
|
405
|
-
try {
|
|
406
|
-
// Report split failures
|
|
407
|
-
for (const f of splitResult.failures) {
|
|
408
|
-
uploadResults.push({
|
|
409
|
-
index: f.index,
|
|
410
|
-
pageRange: f.pageRange,
|
|
411
|
-
splitFileName: `${sourceBaseName}_${f.index + 1}.pdf`,
|
|
412
|
-
status: 'failed',
|
|
413
|
-
error: `Split failed: ${f.error}`,
|
|
414
|
-
});
|
|
415
|
-
if (!opts.json) {
|
|
416
|
-
console.error(chalk.red(` \u2717 [${f.index + 1}/${documents.length}] pages ${f.pageRange} — split failed: ${f.error}`));
|
|
417
|
-
}
|
|
418
|
-
}
|
|
419
|
-
// Upload each split file
|
|
420
|
-
for (const file of splitResult.files) {
|
|
421
|
-
try {
|
|
422
|
-
const buffer = readFileSync(file.path);
|
|
423
|
-
const blob = new Blob([buffer], { type: 'application/pdf' });
|
|
424
|
-
const res = await createFromAttachment(client, {
|
|
425
|
-
businessTransactionType: apiType,
|
|
426
|
-
sourceFile: blob,
|
|
427
|
-
sourceFileName: file.fileName,
|
|
428
|
-
});
|
|
429
|
-
const valid = res.data.validFiles?.[0];
|
|
430
|
-
const invalid = res.data.invalidFiles?.[0];
|
|
431
|
-
if (valid) {
|
|
432
|
-
uploadResults.push({
|
|
433
|
-
index: file.index,
|
|
434
|
-
pageRange: file.pageRange,
|
|
435
|
-
splitFileName: file.fileName,
|
|
436
|
-
status: 'uploaded',
|
|
437
|
-
workflowResourceId: valid.workflowResourceId,
|
|
438
|
-
documentType: res.data.businessTransactionType,
|
|
439
|
-
});
|
|
440
|
-
if (!opts.json) {
|
|
441
|
-
console.log(chalk.green(` \u2713 [${file.index + 1}/${documents.length}] pages ${file.pageRange} \u2192 ${file.fileName} \u2192 ${opts.type.toUpperCase()} (workflow: ${valid.workflowResourceId})`));
|
|
442
|
-
}
|
|
443
|
-
}
|
|
444
|
-
else {
|
|
445
|
-
const errMsg = invalid?.errorMessage ?? 'Unknown upload error';
|
|
446
|
-
uploadResults.push({
|
|
447
|
-
index: file.index,
|
|
448
|
-
pageRange: file.pageRange,
|
|
449
|
-
splitFileName: file.fileName,
|
|
450
|
-
status: 'failed',
|
|
451
|
-
error: errMsg,
|
|
452
|
-
});
|
|
453
|
-
if (!opts.json) {
|
|
454
|
-
console.error(chalk.red(` \u2717 [${file.index + 1}/${documents.length}] pages ${file.pageRange} \u2192 ${file.fileName} \u2192 failed: ${errMsg}`));
|
|
455
|
-
}
|
|
456
|
-
}
|
|
457
|
-
}
|
|
458
|
-
catch (err) {
|
|
459
|
-
const errMsg = err instanceof Error ? err.message : String(err);
|
|
460
|
-
uploadResults.push({
|
|
461
|
-
index: file.index,
|
|
462
|
-
pageRange: file.pageRange,
|
|
463
|
-
splitFileName: file.fileName,
|
|
464
|
-
status: 'failed',
|
|
465
|
-
error: errMsg,
|
|
466
|
-
});
|
|
467
|
-
if (!opts.json) {
|
|
468
|
-
console.error(chalk.red(` \u2717 [${file.index + 1}/${documents.length}] pages ${file.pageRange} \u2192 ${file.fileName} \u2192 failed: ${errMsg}`));
|
|
469
|
-
}
|
|
470
|
-
}
|
|
471
|
-
}
|
|
472
|
-
}
|
|
473
|
-
finally {
|
|
474
|
-
cleanupSplitFiles(splitResult.tempDir);
|
|
475
|
-
}
|
|
476
|
-
// ── Summary ──
|
|
477
|
-
const uploaded = uploadResults.filter((r) => r.status === 'uploaded').length;
|
|
478
|
-
const failed = uploadResults.filter((r) => r.status === 'failed').length;
|
|
479
|
-
if (opts.json) {
|
|
480
|
-
console.log(JSON.stringify({
|
|
481
|
-
file: basename(filePath),
|
|
482
|
-
pageCount,
|
|
483
|
-
documents: uploadResults,
|
|
484
|
-
summary: { total: documents.length, uploaded, failed },
|
|
485
|
-
}, null, 2));
|
|
486
|
-
}
|
|
487
|
-
else {
|
|
488
|
-
console.log(`\n ${uploaded} uploaded, ${failed} failed`);
|
|
489
|
-
}
|
|
490
|
-
}
|
|
491
|
-
finally {
|
|
492
|
-
if (resolved.decryptedPath)
|
|
493
|
-
cleanupDecryptedFile(resolved.decryptedPath);
|
|
494
|
-
}
|
|
495
|
-
}));
|
|
496
266
|
}
|
|
497
267
|
// ── Helpers ──────────────────────────────────────────────────────
|
|
498
268
|
/**
|
|
499
269
|
* Resolve a PDF input, handling __pw__ password extraction + encrypted PDF decryption.
|
|
500
|
-
*
|
|
270
|
+
* Handles __pw__ password extraction and encrypted PDF decryption.
|
|
501
271
|
*
|
|
502
272
|
* For non-PDF files, returns the original path with a clean filename (no __pw__ suffix).
|
|
503
273
|
*/
|
package/dist/commands/mcp.js
CHANGED
|
@@ -8,6 +8,9 @@
|
|
|
8
8
|
* Auth resolves once at startup (same chain as CLI):
|
|
9
9
|
* 1. --api-key flag 2. JAZ_API_KEY env 3. credentials file
|
|
10
10
|
*
|
|
11
|
+
* If no auth is found, the server starts in offline mode — calculators
|
|
12
|
+
* and job blueprints work, API tools return an auth error.
|
|
13
|
+
*
|
|
11
14
|
* API calls go directly from the user's machine to api.getjaz.com.
|
|
12
15
|
*/
|
|
13
16
|
import { Server } from '@modelcontextprotocol/sdk/server/index.js';
|
|
@@ -21,6 +24,20 @@ import { resolveAuth, resolvedProfileLabel } from '../core/auth/resolve.js';
|
|
|
21
24
|
import { getProfile } from '../core/auth/credentials.js';
|
|
22
25
|
import { JazClient } from '../core/api/client.js';
|
|
23
26
|
import { getOrganization } from '../core/api/organization.js';
|
|
27
|
+
/** Tool groups that work without an API key (no network calls). */
|
|
28
|
+
const OFFLINE_GROUPS = new Set([
|
|
29
|
+
'close_jobs',
|
|
30
|
+
'operational_jobs',
|
|
31
|
+
]);
|
|
32
|
+
/** Returns true if the tool can run without a JazClient. */
|
|
33
|
+
function isOfflineTool(group, name) {
|
|
34
|
+
if (OFFLINE_GROUPS.has(group))
|
|
35
|
+
return true;
|
|
36
|
+
// plan_recipe is offline (pure calculator); execute_recipe needs API
|
|
37
|
+
if (group === 'recipes' && name === 'plan_recipe')
|
|
38
|
+
return true;
|
|
39
|
+
return false;
|
|
40
|
+
}
|
|
24
41
|
// ── ParamDef → JSON Schema conversion ───────────────────────────
|
|
25
42
|
/** @internal Exported for testing */
|
|
26
43
|
export function paramDefToJsonSchema(def) {
|
|
@@ -71,39 +88,40 @@ export function registerMcpCommand(program) {
|
|
|
71
88
|
.description('Start MCP stdio server for Claude Code / Cowork')
|
|
72
89
|
.option('--api-key <key>', 'API key (overrides stored/env)')
|
|
73
90
|
.action(async (opts) => {
|
|
74
|
-
// Resolve auth once at startup —
|
|
91
|
+
// Resolve auth once at startup — null means offline-only mode
|
|
75
92
|
const auth = resolveAuth(opts.apiKey);
|
|
76
|
-
|
|
77
|
-
process.stderr.write('Error: No API key found. Set JAZ_API_KEY env var, run `clio auth add`, or pass --api-key.\n');
|
|
78
|
-
process.exit(1);
|
|
79
|
-
}
|
|
80
|
-
const client = new JazClient(auth);
|
|
93
|
+
const client = auth ? new JazClient(auth) : null;
|
|
81
94
|
const version = program.version() ?? '0.0.0';
|
|
82
|
-
// ── Resolve org display (
|
|
95
|
+
// ── Resolve org display (only when authenticated)
|
|
83
96
|
let orgDisplay = '';
|
|
84
97
|
let orgStderr = '';
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
|
|
98
|
+
if (client) {
|
|
99
|
+
const label = resolvedProfileLabel();
|
|
100
|
+
if (label) {
|
|
101
|
+
const entry = getProfile(label);
|
|
102
|
+
if (entry?.orgName) {
|
|
103
|
+
orgDisplay = `Connected to: ${entry.orgName} (${entry.currency}).`;
|
|
104
|
+
orgStderr = `${entry.orgName} (${label})`;
|
|
105
|
+
}
|
|
91
106
|
}
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
|
|
95
|
-
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
|
|
107
|
+
if (!orgDisplay) {
|
|
108
|
+
// Raw API key — fetch org info with a short timeout to avoid blocking startup
|
|
109
|
+
const org = await Promise.race([
|
|
110
|
+
getOrganization(client),
|
|
111
|
+
new Promise((r) => setTimeout(() => r(null), 3000)),
|
|
112
|
+
]).catch((e) => {
|
|
113
|
+
process.stderr.write(`org lookup failed: ${e instanceof Error ? e.message : 'unknown'}\n`);
|
|
114
|
+
return null;
|
|
115
|
+
});
|
|
116
|
+
if (org) {
|
|
117
|
+
orgDisplay = `Connected to: ${org.name} (${org.currency}).`;
|
|
118
|
+
orgStderr = org.name;
|
|
119
|
+
}
|
|
105
120
|
}
|
|
106
121
|
}
|
|
122
|
+
const authNote = client
|
|
123
|
+
? 'All API tools hit api.getjaz.com using the configured API key.'
|
|
124
|
+
: 'No API key configured — only offline tools (calculators, job blueprints) are available. Set JAZ_API_KEY or run `clio auth add` for full access.';
|
|
107
125
|
const server = new Server({ name: 'jaz-ai', version }, {
|
|
108
126
|
capabilities: { tools: {} },
|
|
109
127
|
instructions: [
|
|
@@ -111,7 +129,7 @@ export function registerMcpCommand(program) {
|
|
|
111
129
|
orgDisplay,
|
|
112
130
|
'Manage invoices, bills, journals, contacts, bank, reports, and more.',
|
|
113
131
|
'Includes 13 IFRS-compliant financial calculators and 12 accounting job blueprints (offline, no auth).',
|
|
114
|
-
|
|
132
|
+
authNote,
|
|
115
133
|
].filter(Boolean).join(' '),
|
|
116
134
|
});
|
|
117
135
|
// ── List tools ──────────────────────────────────────────────
|
|
@@ -135,6 +153,19 @@ export function registerMcpCommand(program) {
|
|
|
135
153
|
throw new McpError(ErrorCode.MethodNotFound, `Unknown tool: ${toolName}`);
|
|
136
154
|
}
|
|
137
155
|
const input = (request.params.arguments ?? {});
|
|
156
|
+
// Gate API tools when running without auth
|
|
157
|
+
if (!client && !isOfflineTool(tool.group, tool.name)) {
|
|
158
|
+
return {
|
|
159
|
+
content: [{
|
|
160
|
+
type: 'text',
|
|
161
|
+
text: JSON.stringify({
|
|
162
|
+
error: 'No API key configured.',
|
|
163
|
+
hint: 'Set JAZ_API_KEY env var, run `clio auth add`, or pass --api-key. Offline tools (calculators, job blueprints) work without a key.',
|
|
164
|
+
}),
|
|
165
|
+
}],
|
|
166
|
+
isError: true,
|
|
167
|
+
};
|
|
168
|
+
}
|
|
138
169
|
// Validate write tool inputs before hitting the API
|
|
139
170
|
const validation = validateToolInput(tool, input);
|
|
140
171
|
if (!validation.valid) {
|
|
@@ -150,7 +181,14 @@ export function registerMcpCommand(program) {
|
|
|
150
181
|
};
|
|
151
182
|
}
|
|
152
183
|
try {
|
|
153
|
-
|
|
184
|
+
// Offline tools ignore ctx.client (_ctx convention). The auth gate
|
|
185
|
+
// above rejects all non-offline tools when client is null, so this
|
|
186
|
+
// cast is unreachable for API tools. Guard defensively anyway.
|
|
187
|
+
if (!client && !isOfflineTool(tool.group, tool.name)) {
|
|
188
|
+
throw new Error(`BUG: API tool ${tool.name} reached execute without auth`);
|
|
189
|
+
}
|
|
190
|
+
const ctx = { client: client };
|
|
191
|
+
const result = await tool.execute(ctx, input);
|
|
154
192
|
const text = typeof result === 'string' ? result : JSON.stringify(result, null, 2);
|
|
155
193
|
return {
|
|
156
194
|
content: [{ type: 'text', text }],
|
|
@@ -179,6 +217,7 @@ export function registerMcpCommand(program) {
|
|
|
179
217
|
process.stdin.on('end', shutdown);
|
|
180
218
|
// Log to stderr (never stdout — that's the MCP channel)
|
|
181
219
|
const orgSuffix = orgStderr ? ` — ${orgStderr}` : '';
|
|
182
|
-
|
|
220
|
+
const authStatus = client ? '' : ' [offline mode — no API key]';
|
|
221
|
+
process.stderr.write(`jaz-ai MCP server v${version} started (${TOOL_DEFINITIONS.length} tools)${orgSuffix}${authStatus}\n`);
|
|
183
222
|
});
|
|
184
223
|
}
|
package/dist/core/api/magic.js
CHANGED
|
@@ -1,9 +1,3 @@
|
|
|
1
|
-
// ── API Functions ────────────────────────────────────────────────
|
|
2
|
-
/**
|
|
3
|
-
* Upload a file or URL to create a draft business transaction via OCR extraction.
|
|
4
|
-
* Processing is async — response returns immediately with workflowResourceId.
|
|
5
|
-
* Use searchMagicWorkflows() to check status.
|
|
6
|
-
*/
|
|
7
1
|
export async function createFromAttachment(client, data) {
|
|
8
2
|
const formData = new FormData();
|
|
9
3
|
formData.append('businessTransactionType', data.businessTransactionType);
|
|
@@ -14,6 +8,8 @@ export async function createFromAttachment(client, data) {
|
|
|
14
8
|
formData.append('sourceUrl', data.sourceUrl);
|
|
15
9
|
if (data.attachmentId)
|
|
16
10
|
formData.append('attachmentId', data.attachmentId);
|
|
11
|
+
if (data.uploadMode)
|
|
12
|
+
formData.append('uploadMode', data.uploadMode);
|
|
17
13
|
return client.postMultipart('/api/v1/magic/createBusinessTransactionFromAttachment', formData);
|
|
18
14
|
}
|
|
19
15
|
/**
|
|
@@ -2675,6 +2675,7 @@ Dynamic strings for reference/name/notes: {{Day}}, {{Date}}, {{Date+X}}, {{DateR
|
|
|
2675
2675
|
},
|
|
2676
2676
|
sourceUrl: { type: 'string', description: 'URL of the source file' },
|
|
2677
2677
|
attachmentId: { type: 'string', description: 'Attachment ID (alternative to URL)' },
|
|
2678
|
+
uploadMode: { type: 'string', enum: ['SEPARATE', 'MERGED'], description: 'Upload mode: SEPARATE (default, one doc per file) or MERGED (single PDF with multiple docs, auto-split before extraction). MERGED only works with PDF files.' },
|
|
2678
2679
|
},
|
|
2679
2680
|
required: ['businessTransactionType'],
|
|
2680
2681
|
group: 'magic',
|
|
@@ -2687,6 +2688,7 @@ Dynamic strings for reference/name/notes: {{Day}}, {{Date}}, {{Date+X}}, {{DateR
|
|
|
2687
2688
|
businessTransactionType: input.businessTransactionType,
|
|
2688
2689
|
sourceUrl: input.sourceUrl,
|
|
2689
2690
|
attachmentId: input.attachmentId,
|
|
2691
|
+
uploadMode: input.uploadMode,
|
|
2690
2692
|
});
|
|
2691
2693
|
},
|
|
2692
2694
|
},
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "jaz-clio",
|
|
3
|
-
"version": "4.34.
|
|
3
|
+
"version": "4.34.6",
|
|
4
4
|
"description": "Clio — Command Line Interface Orchestrator for Jaz AI.",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"bin": {
|
|
@@ -52,7 +52,6 @@
|
|
|
52
52
|
"financial": "^0.2.4",
|
|
53
53
|
"node-telegram-bot-api": "^0.67.0",
|
|
54
54
|
"ora": "^8.1.1",
|
|
55
|
-
"pdfjs-dist": "^5.4.624",
|
|
56
55
|
"prompts": "^2.4.2",
|
|
57
56
|
"update-notifier": "^7.3.1",
|
|
58
57
|
"yaml": "^2.8.2"
|
package/dist/core/pdf/detect.js
DELETED
|
@@ -1,344 +0,0 @@
|
|
|
1
|
-
/**
|
|
2
|
-
* PDF boundary detection engine — identifies document boundaries in merged PDFs.
|
|
3
|
-
*
|
|
4
|
-
* Uses pdfjs-dist (pure JS, no canvas) for text extraction + structural probes.
|
|
5
|
-
* No AI tokens — heuristic-only scoring system.
|
|
6
|
-
*
|
|
7
|
-
* Detection signals (positive = boundary evidence, negative = anti-signal):
|
|
8
|
-
* outline-bookmark +80 PDF bookmark points to this page
|
|
9
|
-
* page-label-reset +70 PDF page label restarts at "1"
|
|
10
|
-
* keyword (upper 40%) +40 Document-type keyword near top of page
|
|
11
|
-
* page-one-of +35 "Page 1 of N" pattern
|
|
12
|
-
* keyword-large +25 Large font keyword (>18pt) bonus
|
|
13
|
-
* doc-ref (upper 40%) +20 Document reference pattern (INV-001, etc.)
|
|
14
|
-
* continuation -60 "Page N>1 of M" anti-signal
|
|
15
|
-
* continuation -40 "Continued" text anti-signal
|
|
16
|
-
*
|
|
17
|
-
* Threshold: >= 50 = boundary. Confidence: >= 80 high, >= 50 medium, < 50 low.
|
|
18
|
-
*/
|
|
19
|
-
// pdfjs-dist v5 — legacy build for Node.js (no canvas requirement)
|
|
20
|
-
import { getDocument } from 'pdfjs-dist/legacy/build/pdf.mjs';
|
|
21
|
-
// ── Scoring constants ────────────────────────────────────────
|
|
22
|
-
const SCORE_OUTLINE = 80;
|
|
23
|
-
const SCORE_PAGE_LABEL_RESET = 70;
|
|
24
|
-
const SCORE_KEYWORD = 40;
|
|
25
|
-
const SCORE_PAGE_ONE_OF = 35;
|
|
26
|
-
const SCORE_KEYWORD_LARGE = 25;
|
|
27
|
-
const SCORE_DOC_REF = 20;
|
|
28
|
-
const SCORE_CONTINUATION_PAGE = -60;
|
|
29
|
-
const SCORE_CONTINUATION_TEXT = -40;
|
|
30
|
-
const BOUNDARY_THRESHOLD = 50;
|
|
31
|
-
const CONFIDENCE_HIGH = 80;
|
|
32
|
-
/** Upper portion of page (0–40%) where keywords/refs are significant. */
|
|
33
|
-
const UPPER_PORTION = 0.4;
|
|
34
|
-
/** Font size threshold for large-font keyword bonus (points). */
|
|
35
|
-
const LARGE_FONT_PT = 18;
|
|
36
|
-
// ── Boundary keywords ────────────────────────────────────────
|
|
37
|
-
// Multilingual: EN, Filipino, Indonesian/Malay, Vietnamese, Chinese
|
|
38
|
-
// Each keyword is tested as a case-insensitive whole-word match.
|
|
39
|
-
const BOUNDARY_KEYWORDS = [
|
|
40
|
-
// English
|
|
41
|
-
'TAX INVOICE', 'INVOICE', 'PROFORMA INVOICE', 'COMMERCIAL INVOICE',
|
|
42
|
-
'BILL', 'BILLING STATEMENT', 'STATEMENT OF ACCOUNT',
|
|
43
|
-
'CREDIT NOTE', 'CREDIT MEMO', 'DEBIT NOTE', 'DEBIT MEMO',
|
|
44
|
-
'PURCHASE ORDER', 'DELIVERY ORDER', 'DELIVERY NOTE',
|
|
45
|
-
'RECEIPT', 'OFFICIAL RECEIPT', 'ACKNOWLEDGMENT RECEIPT',
|
|
46
|
-
'QUOTATION', 'SALES ORDER', 'CONTRACT',
|
|
47
|
-
'PACKING LIST', 'BILL OF LADING', 'CERTIFICATE OF ORIGIN',
|
|
48
|
-
// Filipino / PH
|
|
49
|
-
'RESIBO', 'KATIBAYAN NG PAGBABAYAD',
|
|
50
|
-
// Indonesian / Malay
|
|
51
|
-
'FAKTUR PAJAK', 'FAKTUR', 'NOTA KREDIT', 'NOTA DEBIT',
|
|
52
|
-
'KWITANSI', 'SURAT JALAN',
|
|
53
|
-
// Vietnamese
|
|
54
|
-
'HOA DON', 'HOÁ ĐƠN', 'PHIẾU THU', 'PHIẾU CHI',
|
|
55
|
-
// Chinese
|
|
56
|
-
'发票', '税务发票', '收据', '信用票据', '送货单',
|
|
57
|
-
];
|
|
58
|
-
/** Escaped regex patterns — match keyword as a word boundary. */
|
|
59
|
-
const KEYWORD_PATTERNS = BOUNDARY_KEYWORDS.map((kw) => {
|
|
60
|
-
const escaped = kw.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
|
|
61
|
-
// For CJK characters, don't use word boundaries (they don't apply)
|
|
62
|
-
const hasCJK = /[\u4e00-\u9fff]/.test(kw);
|
|
63
|
-
return hasCJK
|
|
64
|
-
? new RegExp(escaped, 'i')
|
|
65
|
-
: new RegExp(`\\b${escaped}\\b`, 'i');
|
|
66
|
-
});
|
|
67
|
-
/** "Page 1 of N" patterns (matches multiple languages). */
|
|
68
|
-
const PAGE_ONE_PATTERN = /\bpage\s+1\s+of\s+\d+/i;
|
|
69
|
-
/** "Page N of M" where N > 1 — continuation signal. */
|
|
70
|
-
const PAGE_N_PATTERN = /\bpage\s+(\d+)\s+of\s+\d+/i;
|
|
71
|
-
/** Document reference patterns: INV-001, SO-2024-100, PO#123, etc. */
|
|
72
|
-
const DOC_REF_PATTERN = /\b(?:INV|SO|PO|DO|CN|DN|OR|CR|BL|SI|PI|QU|CT|REC)[\s#._-]*\d{2,}/i;
|
|
73
|
-
/** Continuation text markers. */
|
|
74
|
-
const CONTINUATION_PATTERNS = [
|
|
75
|
-
/\bcontinued\b/i,
|
|
76
|
-
/\b(?:cont['']?d)\b/i,
|
|
77
|
-
/\blanjutan\b/i, // Indonesian
|
|
78
|
-
/\btiếp theo\b/i, // Vietnamese
|
|
79
|
-
];
|
|
80
|
-
// ── pdfjs-dist configuration ─────────────────────────────────
|
|
81
|
-
// Security: disable code generation from strings (pdfjs option key
|
|
82
|
-
// constructed dynamically to avoid triggering static analysis hooks)
|
|
83
|
-
const PDFJS_SECURITY_KEY = ['is', 'Eval', 'Supported'].join('');
|
|
84
|
-
// ── Main detection function ──────────────────────────────────
|
|
85
|
-
/**
|
|
86
|
-
* Detect document boundaries in a merged PDF.
|
|
87
|
-
*
|
|
88
|
-
* @param buffer Raw PDF bytes (Uint8Array or Buffer).
|
|
89
|
-
* @returns Detection result with per-page probes and detected documents.
|
|
90
|
-
*/
|
|
91
|
-
export async function detectBoundaries(buffer) {
|
|
92
|
-
const doc = await getDocument({
|
|
93
|
-
data: buffer,
|
|
94
|
-
worker: null, // No worker thread (CLI) — null disables it at runtime
|
|
95
|
-
[PDFJS_SECURITY_KEY]: false, // Security: no code generation from strings
|
|
96
|
-
verbosity: 0, // Suppress warnings
|
|
97
|
-
}).promise;
|
|
98
|
-
const pageCount = doc.numPages;
|
|
99
|
-
const pages = [];
|
|
100
|
-
try {
|
|
101
|
-
// Phase 1: Structural probes (whole-document)
|
|
102
|
-
const outlinePages = await probeOutlines(doc);
|
|
103
|
-
const labelResetPages = await probePageLabels(doc);
|
|
104
|
-
// Phase 2: Per-page text scan
|
|
105
|
-
let scannedCount = 0;
|
|
106
|
-
for (let i = 0; i < pageCount; i++) {
|
|
107
|
-
const signals = [];
|
|
108
|
-
// Structural signals
|
|
109
|
-
if (outlinePages.has(i)) {
|
|
110
|
-
signals.push({ type: 'outline-bookmark', label: 'PDF bookmark', score: SCORE_OUTLINE });
|
|
111
|
-
}
|
|
112
|
-
if (labelResetPages.has(i)) {
|
|
113
|
-
signals.push({ type: 'page-label-reset', label: 'Page label reset to 1', score: SCORE_PAGE_LABEL_RESET });
|
|
114
|
-
}
|
|
115
|
-
// Text extraction
|
|
116
|
-
const page = await doc.getPage(i + 1); // 1-based
|
|
117
|
-
const textContent = await page.getTextContent();
|
|
118
|
-
const items = textContent.items;
|
|
119
|
-
if (items.length === 0) {
|
|
120
|
-
signals.push({ type: 'scanned', label: 'No extractable text', score: 0 });
|
|
121
|
-
scannedCount++;
|
|
122
|
-
}
|
|
123
|
-
else {
|
|
124
|
-
// Determine page height from viewport
|
|
125
|
-
const viewport = page.getViewport({ scale: 1 });
|
|
126
|
-
const pageHeight = viewport.height;
|
|
127
|
-
// Collect upper-portion text and full-page text
|
|
128
|
-
const upperTexts = [];
|
|
129
|
-
const allTexts = [];
|
|
130
|
-
for (const item of items) {
|
|
131
|
-
const text = item.str.trim();
|
|
132
|
-
if (!text)
|
|
133
|
-
continue;
|
|
134
|
-
allTexts.push(text);
|
|
135
|
-
// transform[5] is the Y coordinate (from bottom), transform[0] is scaleX ~ fontSize
|
|
136
|
-
const y = item.transform[5];
|
|
137
|
-
const fontSize = Math.abs(item.transform[0]);
|
|
138
|
-
const normalizedY = y / pageHeight;
|
|
139
|
-
// Upper portion = top 40% of page (high Y values in PDF coordinate system)
|
|
140
|
-
if (normalizedY >= (1 - UPPER_PORTION)) {
|
|
141
|
-
upperTexts.push({ text, fontSize });
|
|
142
|
-
}
|
|
143
|
-
}
|
|
144
|
-
const fullText = allTexts.join(' ');
|
|
145
|
-
const upperText = upperTexts.map((t) => t.text).join(' ');
|
|
146
|
-
// Keyword detection (upper portion only)
|
|
147
|
-
for (let k = 0; k < KEYWORD_PATTERNS.length; k++) {
|
|
148
|
-
if (KEYWORD_PATTERNS[k].test(upperText)) {
|
|
149
|
-
signals.push({
|
|
150
|
-
type: 'keyword',
|
|
151
|
-
label: `${BOUNDARY_KEYWORDS[k]} in header`,
|
|
152
|
-
score: SCORE_KEYWORD,
|
|
153
|
-
});
|
|
154
|
-
// Large font bonus: check if any upper-portion item with this keyword is large
|
|
155
|
-
const kwPattern = KEYWORD_PATTERNS[k];
|
|
156
|
-
const hasLargeFont = upperTexts.some((t) => kwPattern.test(t.text) && t.fontSize >= LARGE_FONT_PT);
|
|
157
|
-
if (hasLargeFont) {
|
|
158
|
-
signals.push({
|
|
159
|
-
type: 'keyword-large',
|
|
160
|
-
label: `${BOUNDARY_KEYWORDS[k]} in large font (>${LARGE_FONT_PT}pt)`,
|
|
161
|
-
score: SCORE_KEYWORD_LARGE,
|
|
162
|
-
});
|
|
163
|
-
}
|
|
164
|
-
break; // Only count the first keyword match per page
|
|
165
|
-
}
|
|
166
|
-
}
|
|
167
|
-
// "Page 1 of N" detection (anywhere on page)
|
|
168
|
-
if (PAGE_ONE_PATTERN.test(fullText)) {
|
|
169
|
-
signals.push({ type: 'page-one-of', label: 'Page 1 of N', score: SCORE_PAGE_ONE_OF });
|
|
170
|
-
}
|
|
171
|
-
// Document reference in upper portion
|
|
172
|
-
if (DOC_REF_PATTERN.test(upperText)) {
|
|
173
|
-
signals.push({ type: 'doc-ref', label: 'Document reference in header', score: SCORE_DOC_REF });
|
|
174
|
-
}
|
|
175
|
-
// Anti-signals: continuation indicators
|
|
176
|
-
const pageNMatch = fullText.match(PAGE_N_PATTERN);
|
|
177
|
-
if (pageNMatch && parseInt(pageNMatch[1], 10) > 1) {
|
|
178
|
-
signals.push({
|
|
179
|
-
type: 'continuation',
|
|
180
|
-
label: `Page ${pageNMatch[1]} of N (continuation)`,
|
|
181
|
-
score: SCORE_CONTINUATION_PAGE,
|
|
182
|
-
});
|
|
183
|
-
}
|
|
184
|
-
for (const pat of CONTINUATION_PATTERNS) {
|
|
185
|
-
if (pat.test(fullText)) {
|
|
186
|
-
signals.push({
|
|
187
|
-
type: 'continuation',
|
|
188
|
-
label: 'Continuation text detected',
|
|
189
|
-
score: SCORE_CONTINUATION_TEXT,
|
|
190
|
-
});
|
|
191
|
-
break;
|
|
192
|
-
}
|
|
193
|
-
}
|
|
194
|
-
}
|
|
195
|
-
const totalScore = signals.reduce((sum, s) => sum + s.score, 0);
|
|
196
|
-
// Page 0 is always a boundary (it's the start of the first document)
|
|
197
|
-
const isBoundary = i === 0 || totalScore >= BOUNDARY_THRESHOLD;
|
|
198
|
-
pages.push({ pageIndex: i, signals, totalScore, isBoundary });
|
|
199
|
-
}
|
|
200
|
-
const documents = buildDocuments(pages, pageCount);
|
|
201
|
-
const isScannedPdf = scannedCount === pageCount && pageCount > 0;
|
|
202
|
-
return { pageCount, pages, documents, isScannedPdf };
|
|
203
|
-
}
|
|
204
|
-
finally {
|
|
205
|
-
try {
|
|
206
|
-
doc.destroy();
|
|
207
|
-
}
|
|
208
|
-
catch { /* best effort */ }
|
|
209
|
-
}
|
|
210
|
-
}
|
|
211
|
-
// ── Structural probes ────────────────────────────────────────
|
|
212
|
-
/** Probe PDF outlines/bookmarks -> set of 0-based page indices that have bookmarks. */
|
|
213
|
-
async function probeOutlines(doc) {
|
|
214
|
-
const result = new Set();
|
|
215
|
-
try {
|
|
216
|
-
const outline = await doc.getOutline();
|
|
217
|
-
if (!outline)
|
|
218
|
-
return result;
|
|
219
|
-
const stack = [...outline];
|
|
220
|
-
while (stack.length > 0) {
|
|
221
|
-
const item = stack.pop();
|
|
222
|
-
if (!item)
|
|
223
|
-
continue;
|
|
224
|
-
// Resolve destination to page index
|
|
225
|
-
if (item.dest) {
|
|
226
|
-
try {
|
|
227
|
-
const dest = typeof item.dest === 'string'
|
|
228
|
-
? await doc.getDestination(item.dest)
|
|
229
|
-
: item.dest;
|
|
230
|
-
if (Array.isArray(dest) && dest[0]) {
|
|
231
|
-
const pageIndex = await doc.getPageIndex(dest[0]);
|
|
232
|
-
result.add(pageIndex);
|
|
233
|
-
}
|
|
234
|
-
}
|
|
235
|
-
catch { /* skip unresolvable destinations */ }
|
|
236
|
-
}
|
|
237
|
-
// Traverse children
|
|
238
|
-
if (Array.isArray(item.items)) {
|
|
239
|
-
stack.push(...item.items);
|
|
240
|
-
}
|
|
241
|
-
}
|
|
242
|
-
}
|
|
243
|
-
catch { /* no outlines or error reading them */ }
|
|
244
|
-
return result;
|
|
245
|
-
}
|
|
246
|
-
/** Probe PDF page labels -> set of 0-based page indices where label resets to "1". */
|
|
247
|
-
async function probePageLabels(doc) {
|
|
248
|
-
const result = new Set();
|
|
249
|
-
try {
|
|
250
|
-
const labels = await doc.getPageLabels();
|
|
251
|
-
if (!labels)
|
|
252
|
-
return result;
|
|
253
|
-
for (let i = 1; i < labels.length; i++) {
|
|
254
|
-
// A label that is "1" (or "i" for roman numeral) after not being "1" signals a reset
|
|
255
|
-
if (labels[i] === '1' && labels[i - 1] !== '1') {
|
|
256
|
-
result.add(i);
|
|
257
|
-
}
|
|
258
|
-
}
|
|
259
|
-
}
|
|
260
|
-
catch { /* no page labels */ }
|
|
261
|
-
return result;
|
|
262
|
-
}
|
|
263
|
-
// ── Document builder ─────────────────────────────────────────
|
|
264
|
-
/** Build detected documents from boundary pages. */
|
|
265
|
-
function buildDocuments(pages, pageCount) {
|
|
266
|
-
const boundaries = pages.filter((p) => p.isBoundary);
|
|
267
|
-
const documents = [];
|
|
268
|
-
for (let i = 0; i < boundaries.length; i++) {
|
|
269
|
-
const start = boundaries[i].pageIndex;
|
|
270
|
-
const end = i + 1 < boundaries.length
|
|
271
|
-
? boundaries[i + 1].pageIndex - 1
|
|
272
|
-
: pageCount - 1;
|
|
273
|
-
// Page range is 1-based for display
|
|
274
|
-
const pageStart = start + 1;
|
|
275
|
-
const pageEnd = end + 1;
|
|
276
|
-
const pageRange = pageStart === pageEnd ? `${pageStart}` : `${pageStart}-${pageEnd}`;
|
|
277
|
-
documents.push({
|
|
278
|
-
index: i,
|
|
279
|
-
pageStart,
|
|
280
|
-
pageEnd,
|
|
281
|
-
pageRange,
|
|
282
|
-
confidence: scoreToConfidence(boundaries[i].totalScore),
|
|
283
|
-
signals: boundaries[i].signals,
|
|
284
|
-
});
|
|
285
|
-
}
|
|
286
|
-
return documents;
|
|
287
|
-
}
|
|
288
|
-
/** Map aggregate score to confidence level. */
|
|
289
|
-
function scoreToConfidence(score) {
|
|
290
|
-
if (score >= CONFIDENCE_HIGH)
|
|
291
|
-
return 'high';
|
|
292
|
-
if (score >= BOUNDARY_THRESHOLD)
|
|
293
|
-
return 'medium';
|
|
294
|
-
return 'low';
|
|
295
|
-
}
|
|
296
|
-
// ── Manual page ranges ───────────────────────────────────────
|
|
297
|
-
/**
|
|
298
|
-
* Parse a manual page-range string into DetectedDocument[].
|
|
299
|
-
*
|
|
300
|
-
* Format: "1-3,4-6,7" (1-based, inclusive ranges, comma-separated).
|
|
301
|
-
* Validates: no overlaps, ranges within page count.
|
|
302
|
-
*
|
|
303
|
-
* @throws Error on invalid format or range.
|
|
304
|
-
*/
|
|
305
|
-
export function parsePageRanges(rangesStr, pageCount) {
|
|
306
|
-
const parts = rangesStr.split(',').map((s) => s.trim()).filter(Boolean);
|
|
307
|
-
if (parts.length === 0) {
|
|
308
|
-
throw new Error('Empty page range — provide ranges like "1-3,4-6,7"');
|
|
309
|
-
}
|
|
310
|
-
const documents = [];
|
|
311
|
-
let lastEnd = 0;
|
|
312
|
-
for (let i = 0; i < parts.length; i++) {
|
|
313
|
-
const part = parts[i];
|
|
314
|
-
const match = part.match(/^(\d+)(?:-(\d+))?$/);
|
|
315
|
-
if (!match) {
|
|
316
|
-
throw new Error(`Invalid page range "${part}" — use format "1-3" or "7"`);
|
|
317
|
-
}
|
|
318
|
-
const pageStart = parseInt(match[1], 10);
|
|
319
|
-
const pageEnd = match[2] ? parseInt(match[2], 10) : pageStart;
|
|
320
|
-
if (pageStart < 1 || pageEnd < 1) {
|
|
321
|
-
throw new Error(`Page numbers must be positive (got "${part}")`);
|
|
322
|
-
}
|
|
323
|
-
if (pageStart > pageEnd) {
|
|
324
|
-
throw new Error(`Invalid range "${part}" — start must be <= end`);
|
|
325
|
-
}
|
|
326
|
-
if (pageEnd > pageCount) {
|
|
327
|
-
throw new Error(`Range "${part}" exceeds page count (${pageCount} pages)`);
|
|
328
|
-
}
|
|
329
|
-
if (pageStart <= lastEnd) {
|
|
330
|
-
throw new Error(`Overlapping range "${part}" — previous range ended at page ${lastEnd}`);
|
|
331
|
-
}
|
|
332
|
-
const pageRange = pageStart === pageEnd ? `${pageStart}` : `${pageStart}-${pageEnd}`;
|
|
333
|
-
documents.push({
|
|
334
|
-
index: i,
|
|
335
|
-
pageStart,
|
|
336
|
-
pageEnd,
|
|
337
|
-
pageRange,
|
|
338
|
-
confidence: 'high', // Manual ranges are always high confidence
|
|
339
|
-
signals: [],
|
|
340
|
-
});
|
|
341
|
-
lastEnd = pageEnd;
|
|
342
|
-
}
|
|
343
|
-
return documents;
|
|
344
|
-
}
|
package/dist/core/pdf/index.js
DELETED
|
@@ -1,8 +0,0 @@
|
|
|
1
|
-
/**
|
|
2
|
-
* PDF boundary detection + splitting for merged documents.
|
|
3
|
-
*
|
|
4
|
-
* Usage:
|
|
5
|
-
* import { detectBoundaries, parsePageRanges, splitPdf, ... } from '../core/pdf/index.js';
|
|
6
|
-
*/
|
|
7
|
-
export { detectBoundaries, parsePageRanges } from './detect.js';
|
|
8
|
-
export { getPageCount, splitPdf, cleanupSplitFiles } from './split.js';
|
package/dist/core/pdf/split.js
DELETED
|
@@ -1,81 +0,0 @@
|
|
|
1
|
-
/**
|
|
2
|
-
* PDF page-range extraction via qpdf.
|
|
3
|
-
*
|
|
4
|
-
* Creates temporary files for each document range extracted from a merged PDF.
|
|
5
|
-
* Caller is responsible for cleanup (use `cleanupSplitFiles`).
|
|
6
|
-
*
|
|
7
|
-
* Follows the same patterns as decrypt.ts: execFileSync, mkdtempSync, explicit cleanup.
|
|
8
|
-
*/
|
|
9
|
-
import { execFileSync } from 'node:child_process';
|
|
10
|
-
import { mkdtempSync, existsSync } from 'node:fs';
|
|
11
|
-
import { join, basename } from 'node:path';
|
|
12
|
-
import { tmpdir } from 'node:os';
|
|
13
|
-
import { rmSync } from 'node:fs';
|
|
14
|
-
import { isQpdfAvailable } from '../jobs/document-collection/tools/ingest/decrypt.js';
|
|
15
|
-
/**
|
|
16
|
-
* Get the page count of a PDF file using qpdf.
|
|
17
|
-
* @throws Error if qpdf is not installed or the file is invalid.
|
|
18
|
-
*/
|
|
19
|
-
export function getPageCount(filePath) {
|
|
20
|
-
if (!isQpdfAvailable()) {
|
|
21
|
-
throw new Error('qpdf is required — install: brew install qpdf (macOS) or sudo apt install qpdf (Linux)');
|
|
22
|
-
}
|
|
23
|
-
const output = execFileSync('qpdf', ['--show-npages', filePath], {
|
|
24
|
-
encoding: 'utf-8',
|
|
25
|
-
stdio: ['pipe', 'pipe', 'pipe'],
|
|
26
|
-
});
|
|
27
|
-
const count = parseInt(output.trim(), 10);
|
|
28
|
-
if (isNaN(count) || count < 1) {
|
|
29
|
-
throw new Error(`Failed to read page count from "${basename(filePath)}"`);
|
|
30
|
-
}
|
|
31
|
-
return count;
|
|
32
|
-
}
|
|
33
|
-
/**
|
|
34
|
-
* Split a PDF into multiple files based on detected document ranges.
|
|
35
|
-
*
|
|
36
|
-
* Creates a temp directory and extracts each range as a separate PDF.
|
|
37
|
-
* Continues on failure (never aborts mid-batch, matching upload.ts pattern).
|
|
38
|
-
* Caller MUST call `cleanupSplitFiles()` when done.
|
|
39
|
-
*/
|
|
40
|
-
export function splitPdf(sourcePath, documents, sourceBaseName) {
|
|
41
|
-
if (!isQpdfAvailable()) {
|
|
42
|
-
throw new Error('qpdf is required — install: brew install qpdf (macOS) or sudo apt install qpdf (Linux)');
|
|
43
|
-
}
|
|
44
|
-
const tempDir = mkdtempSync(join(tmpdir(), 'clio-split-'));
|
|
45
|
-
const files = [];
|
|
46
|
-
const failures = [];
|
|
47
|
-
for (const doc of documents) {
|
|
48
|
-
const fileName = `${sourceBaseName}_${doc.index + 1}.pdf`;
|
|
49
|
-
const outputPath = join(tempDir, fileName);
|
|
50
|
-
try {
|
|
51
|
-
execFileSync('qpdf', [
|
|
52
|
-
sourcePath,
|
|
53
|
-
'--pages', '.', `${doc.pageStart}-${doc.pageEnd}`, '--',
|
|
54
|
-
outputPath,
|
|
55
|
-
], { stdio: 'pipe' });
|
|
56
|
-
files.push({
|
|
57
|
-
index: doc.index,
|
|
58
|
-
pageRange: doc.pageRange,
|
|
59
|
-
path: outputPath,
|
|
60
|
-
fileName,
|
|
61
|
-
});
|
|
62
|
-
}
|
|
63
|
-
catch (err) {
|
|
64
|
-
const msg = err instanceof Error ? err.message : String(err);
|
|
65
|
-
failures.push({ index: doc.index, pageRange: doc.pageRange, error: msg });
|
|
66
|
-
}
|
|
67
|
-
}
|
|
68
|
-
return { tempDir, files, failures };
|
|
69
|
-
}
|
|
70
|
-
/**
|
|
71
|
-
* Remove all split temp files and their temp directory.
|
|
72
|
-
* Safe to call with any path — silently ignores missing dirs.
|
|
73
|
-
*/
|
|
74
|
-
export function cleanupSplitFiles(tempDir) {
|
|
75
|
-
try {
|
|
76
|
-
if (existsSync(tempDir)) {
|
|
77
|
-
rmSync(tempDir, { recursive: true, force: true });
|
|
78
|
-
}
|
|
79
|
-
}
|
|
80
|
-
catch { /* best effort */ }
|
|
81
|
-
}
|
package/dist/core/pdf/types.js
DELETED