@pyxmate/memory 0.12.1 → 0.12.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json
CHANGED
|
@@ -93,7 +93,16 @@ curl -s -X POST {{ENDPOINT}}/api/memory/ingest/file \
|
|
|
93
93
|
|
|
94
94
|
Supported formats: txt, md, csv, tsv, log, pdf, docx, xlsx, pptx, json, jsonl, html, htm, png, jpg, jpeg, webp, gif, bmp, tiff, svg (100MB limit).
|
|
95
95
|
|
|
96
|
-
Ingestion
|
|
96
|
+
Ingestion memory profile, by format:
|
|
97
|
+
|
|
98
|
+
- **Plain-text + structured-text** (`csv`, `tsv`, `txt`, `log`, `json`, `jsonl`, `html`, `htm`, `md`): truly streaming — peak memory ≈ a few MB regardless of file size, up to the 100 MB file cap.
|
|
99
|
+
- **`pdf`**: streaming via the poppler `pdftotext` first-class path — peak memory ≈ a few MB. The `pdf-parse` fallback (when poppler is absent) loads the whole buffer; install poppler-utils for streaming.
|
|
100
|
+
- **`xlsx`**: row-by-row streaming via `ExcelJS.stream.xlsx.WorkbookReader`, but **shared strings are cached for the whole workbook** (`sharedStrings: 'cache'`). Peak memory grows with shared-string count, not just file size — workbooks with dense, repeated cell strings can use significantly more memory than the file cap suggests.
|
|
101
|
+
- **`pptx`**: the full ZIP is decompressed in memory (same shape as docx); only one slide's XML is decoded at a time. For a typical 30 MB presentation, peak memory is ~90 MB. Capped at 100 MB file / 200 MB decompressed.
|
|
102
|
+
- **`docx`**: hard-capped at 10 MB. The underlying library (mammoth) has no streaming API and loads the whole document into memory; above 10 MB the server returns a `MemoryError` asking you to pre-extract the text upstream and re-upload as `.txt` or `.md`.
|
|
103
|
+
- **Images** (`png`, `jpg`, `jpeg`, `webp`, `gif`, `bmp`, `tiff`, `svg`): the file is held in memory once for embedding/storage; size scales with the file, not with content complexity.
|
|
104
|
+
|
|
105
|
+
If you need deterministic memory or timing for production UX (e.g. RAG over large user-supplied workbooks), prefer pre-extracting `.pptx` and large `.xlsx` upstream and uploading the text as `.txt`. See `patterns/file-uploads.md` for the consumer-side pattern.
|
|
97
106
|
|
|
98
107
|
**Images require a `description`** — this is how the content gets embedded and becomes searchable. Without it, the image is stored but not findable. Use your vision capabilities to generate the description when the user doesn't provide one.
|
|
99
108
|
|
|
@@ -0,0 +1,78 @@
|
|
|
1
|
+
# Pattern: File Uploads
|
|
2
|
+
|
|
3
|
+
This is consumer-side guidance: how to decide whether to forward a file
|
|
4
|
+
straight to `ingestFile()` or to pre-extract its text upstream first.
|
|
5
|
+
|
|
6
|
+
## TL;DR
|
|
7
|
+
|
|
8
|
+
| Format | Default action |
|
|
9
|
+
|---|---|
|
|
10
|
+
| `txt`, `md`, `csv`, `tsv`, `log`, `json`, `jsonl`, `html`, `htm` | Forward raw — pyx-memory streams. |
|
|
11
|
+
| `pdf` | Forward raw — pyx-memory streams via poppler. Install `poppler-utils` on the server image. |
|
|
12
|
+
| Images (`png`, `jpg`, `jpeg`, `webp`, `gif`, `bmp`, `tiff`, `svg`) | Forward raw with a `description` (use vision capability). |
|
|
13
|
+
| `docx` ≤ 10 MB | Forward raw. |
|
|
14
|
+
| `docx` > 10 MB | **Pre-extract upstream** as `.txt` or `.md`. The server returns a `MemoryError` if you don't. |
|
|
15
|
+
| `xlsx` (large or shared-string-heavy) | **Pre-extract upstream** as `.xxx.xlsx.txt` for deterministic UX. |
|
|
16
|
+
| `pptx` (production UX) | **Pre-extract upstream** as `.xxx.pptx.txt` for deterministic UX. |
|
|
17
|
+
|
|
18
|
+
"Production UX" here means: you can't afford a single hung upload to wedge
|
|
19
|
+
the user-facing layer for 30+ seconds, you need actionable error messages
|
|
20
|
+
on every failure, and you have your own copy of the original file
|
|
21
|
+
(separate from pyx-memory's internal storage).
|
|
22
|
+
|
|
23
|
+
## Why pre-extract pptx and large xlsx
|
|
24
|
+
|
|
25
|
+
The server's pptx parser decompresses the full ZIP in memory (~3× file
|
|
26
|
+
size peak). The xlsx parser streams rows but caches shared strings for
|
|
27
|
+
the entire workbook (`ExcelJS.WorkbookReader { sharedStrings: 'cache' }`).
|
|
28
|
+
Both are bounded by the 100 MB file / 200 MB decompressed caps, but
|
|
29
|
+
"bounded" is not "constant" — pathological files (huge shared-string
|
|
30
|
+
tables, dense cell formulas, embedded media) can push peak memory and
|
|
31
|
+
parse time well past what naive callers expect.
|
|
32
|
+
|
|
33
|
+
If you control the upload boundary (e.g. you operate a runtime/proxy
|
|
34
|
+
service that fronts pyx-memory), upstream pre-extraction lets you:
|
|
35
|
+
|
|
36
|
+
1. **Catch parse failures at your boundary**, where you can return an
|
|
37
|
+
actionable error to the user (`"Excel formula evaluation failed at
|
|
38
|
+
sheet 'Q3 Revenue', row 412"`) instead of a generic upstream 5xx.
|
|
39
|
+
2. **Bound the wire payload to pyx-memory** — text/plain only — so the
|
|
40
|
+
memory server's parser is never the bottleneck.
|
|
41
|
+
3. **Keep the original binary in your own storage**, so users can still
|
|
42
|
+
download the file. pyx-memory's catalog only holds the indexed text.
|
|
43
|
+
|
|
44
|
+
## Reference implementation
|
|
45
|
+
|
|
46
|
+
[ai-rag-hub](https://github.com/fysoul17/one-query-v1) (a consumer of
|
|
47
|
+
pyx-memory) implements this pattern in its runtime:
|
|
48
|
+
|
|
49
|
+
- `packages/server/src/text-extractors.ts` — local extractors for `pptx`
|
|
50
|
+
and `xlsx`, both wrapped in the same OOXML safety envelope (zip-bomb
|
|
51
|
+
defense, path-traversal check, macro reject, decompressed-size cap,
|
|
52
|
+
char limit).
|
|
53
|
+
- `packages/server/src/routes/memory.ts` — `prepareFileForIngest`
|
|
54
|
+
dispatches via `getTextExtractor(mimeType)`; matched formats are
|
|
55
|
+
re-uploaded as `<original>.txt` with `text/plain`. Catalog metadata
|
|
56
|
+
flags `downloadableFromMemory: false` so the consumer's own
|
|
57
|
+
`/api/team/documents/[id]/download` route serves the original
|
|
58
|
+
binary instead.
|
|
59
|
+
|
|
60
|
+
## When NOT to pre-extract
|
|
61
|
+
|
|
62
|
+
Single-tenant lab usage, internal tools, batch jobs where a 30-second
|
|
63
|
+
parse latency is acceptable, or any case where you don't have your own
|
|
64
|
+
copy of the file and need pyx-memory's `GET /api/memory/files/download/:filename`
|
|
65
|
+
to return the original binary. In those cases, the native pyx-memory
|
|
66
|
+
parsers are exactly what you want.
|
|
67
|
+
|
|
68
|
+
## What about other formats
|
|
69
|
+
|
|
70
|
+
- **HTML**: pyx-memory strips `<script>` and `<style>` during parse
|
|
71
|
+
(`parsers/html.ts`). No upstream sanitizer needed for indexing.
|
|
72
|
+
- **PDF with images**: use the SDK's two-phase enrichment via
|
|
73
|
+
`EnrichmentCallbacks` — see `reference/sdk-guide.md`. Don't pre-extract
|
|
74
|
+
the PDF as text and lose image enrichment.
|
|
75
|
+
- **SVG**: currently classified as an image but pyx-memory's image
|
|
76
|
+
parser only stores it as a placeholder; if you need SVG text indexed,
|
|
77
|
+
pre-extract the `<text>` and `<desc>` content yourself or convert to
|
|
78
|
+
raster + describeImage.
|
|
@@ -71,7 +71,15 @@ const client = new MemoryClient('http://localhost:7822', process.env.MEMORY_API_
|
|
|
71
71
|
|
|
72
72
|
**Supported formats**: `.txt`, `.md`, `.csv`, `.tsv`, `.log`, `.pdf`, `.docx`, `.xlsx`, `.pptx`, `.json`, `.jsonl`, `.html`, `.htm`, `.png`, `.jpg`, `.jpeg`, `.webp`, `.gif`, `.bmp`, `.tiff`, `.svg`
|
|
73
73
|
|
|
74
|
-
**Memory behavior
|
|
74
|
+
**Memory behavior, by format**:
|
|
75
|
+
|
|
76
|
+
- **Plain-text + structured-text** (`csv`, `tsv`, `txt`, `log`, `json`, `jsonl`, `html`, `htm`, `md`) — fully streaming; peak server memory ≈ a few MB regardless of file size up to the 100 MB cap.
|
|
77
|
+
- **`pdf`** — streaming via `pdftotext` (poppler-utils); fall-back path (`pdf-parse`) buffers the file, so install poppler-utils on the server for streaming behavior.
|
|
78
|
+
- **`xlsx`** — `ExcelJS.stream.xlsx.WorkbookReader` yields rows as it parses, but the shared-string table is cached for the whole workbook. Peak memory grows with shared-string count; dense workbooks can exceed "a few MB" significantly.
|
|
79
|
+
- **`pptx`** — the full ZIP is decompressed in memory (~3× file size peak for typical decks). One slide's XML is decoded at a time. Bounded by the 100 MB file cap and a 200 MB decompressed cap.
|
|
80
|
+
- **`docx`** — hard-capped at 10 MB. mammoth has no streaming API; above 10 MB the server returns a `MemoryError` asking you to pre-extract the text upstream and re-upload as `.txt` or `.md`.
|
|
81
|
+
|
|
82
|
+
For deterministic peak memory or timing (production UX), consumers should pre-extract `.pptx` and large `.xlsx` upstream and upload the text as `.txt` — see `patterns/file-uploads.md`.
|
|
75
83
|
|
|
76
84
|
**What happens on upload**:
|
|
77
85
|
1. Original file is saved to `{DATA_DIR}/files/{filename}` (persistent across restarts)
|