npm - node-es-transformer - Versions diffs - 1.1.0 → 1.2.1 - Mend

node-es-transformer 1.1.0 → 1.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/README.md +76 -12
package/dist/node-es-transformer.cjs.js +567 -127
package/dist/node-es-transformer.cjs.js.map +1 -1
package/dist/node-es-transformer.esm.js +548 -127
package/dist/node-es-transformer.esm.js.map +1 -1
package/index.d.ts +24 -2
package/package.json +12 -7

package/README.md CHANGED Viewed

@@ -8,7 +8,7 @@
 # node-es-transformer
-Stream-based library for ingesting and transforming large data files (CSV/JSON) into Elasticsearch indices.
+Stream-based library for ingesting and transforming large data files (NDJSON/CSV/Parquet/Arrow IPC) into Elasticsearch indices.
 ## Quick Start
@@ -36,7 +36,7 @@ See [Usage](#usage) for more examples.
 ## Why Use This?
-If you need to ingest large CSV/JSON files (GigaBytes) into Elasticsearch without running out of memory, this is the tool for you. Other solutions often run out of JS heap, hammer ES with too many requests, time out, or try to do everything in a single bulk request.
+If you need to ingest large NDJSON/CSV/Parquet/Arrow IPC files (GigaBytes) into Elasticsearch without running out of memory, this is the tool for you. Other solutions often run out of JS heap, hammer ES with too many requests, time out, or try to do everything in a single bulk request.
 **When to use this:**
 - Large file ingestion (20-30 GB tested)
@@ -156,6 +156,58 @@ transformer({
 });
 ```
+### Read Parquet from a file
+```javascript
+const transformer = require('node-es-transformer');
+transformer({
+  fileName: 'users.parquet',
+  sourceFormat: 'parquet',
+  targetIndexName: 'users-index',
+  mappings: {
+    properties: {
+      id: { type: 'integer' },
+      first_name: { type: 'keyword' },
+      last_name: { type: 'keyword' },
+      full_name: { type: 'keyword' },
+    },
+  },
+  transform(row) {
+    return {
+      ...row,
+      id: Number(row.id),
+      full_name: `${row.first_name} ${row.last_name}`,
+    };
+  },
+});
+```
+### Read Arrow IPC from a file
+```javascript
+const transformer = require('node-es-transformer');
+transformer({
+  fileName: 'users.arrow',
+  sourceFormat: 'arrow',
+  targetIndexName: 'users-index',
+  mappings: {
+    properties: {
+      id: { type: 'integer' },
+      first_name: { type: 'keyword' },
+      last_name: { type: 'keyword' },
+    },
+  },
+  transform(row) {
+    return {
+      ...row,
+      id: Number(row.id),
+    };
+  },
+});
+```
 ### Infer mappings from CSV sample
 ```javascript
@@ -286,10 +338,14 @@ All options are passed to the main `transformer()` function.
 Choose **one** of these sources:
-- **`fileName`** (string): Source filename to ingest. Supports wildcards (e.g., `logs/*.json` or `data/*.csv`).
+- **`fileName`** (string): Source filename to ingest. Supports wildcards (e.g., `logs/*.json`, `data/*.csv`, `data/*.parquet`, `data/*.arrow`).
 - **`sourceIndexName`** (string): Source Elasticsearch index to reindex from.
 - **`stream`** (Readable): Node.js readable stream to ingest from.
-- **`sourceFormat`** (`'ndjson' | 'csv'`): Format for file/stream sources. Default: `'ndjson'`.
+- **`sourceFormat`** (`'ndjson' | 'csv' | 'parquet' | 'arrow'`): Format for file/stream sources. Default: `'ndjson'`.
+  - `arrow` expects Arrow IPC file/stream payloads.
+  - `parquet` stream sources are currently buffered in memory before row iteration (file sources remain streaming by row cursor).
+  - `parquet` supports ZSTD-compressed files when running on Node.js 22+ (uses the built-in zlib zstd implementation).
+  - `parquet` INT64 values are normalized for JSON: safe-range values become numbers, larger values become strings.
 - **`csvOptions`** (object): CSV parser options (delimiter, quote, columns, etc.) used when `sourceFormat: 'csv'`.
 #### Client Configuration
@@ -305,7 +361,7 @@ Choose **one** of these sources:
 - **`mappings`** (object): Elasticsearch document mappings for target index. If reindexing and not provided, mappings are copied from source index.
 - **`mappingsOverride`** (boolean): When reindexing, apply `mappings` on top of source index mappings. Default: `false`.
-- **`inferMappings`** (boolean): Infer mappings for `fileName` sources via `/_text_structure/find_structure`. Ignored when `mappings` is provided. If inference returns `ingest_pipeline`, it is created as `<targetIndexName>-inferred-pipeline` and applied as the index default pipeline (unless `pipeline` is explicitly set). Default: `false`.
+- **`inferMappings`** (boolean): Infer mappings for `fileName` sources via `/_text_structure/find_structure`. Supported for `sourceFormat: 'ndjson'` and `sourceFormat: 'csv'` only. Ignored when `mappings` is provided. If inference returns `ingest_pipeline`, it is created as `<targetIndexName>-inferred-pipeline` and applied as the index default pipeline (unless `pipeline` is explicitly set). Default: `false`.
 - **`inferMappingsOptions`** (object): Options for `/_text_structure/find_structure` (for example `sampleBytes`, `lines_to_sample`, `delimiter`, `quote`, `has_header_row`, `timeout`).
 - **`deleteIndex`** (boolean): Delete target index if it exists before starting. Default: `false`.
 - **`indexMappingTotalFieldsLimit`** (number): Field limit for target index (`index.mapping.total_fields.limit` setting).
@@ -330,9 +386,11 @@ When `inferMappings` is enabled, the target cluster must allow `/_text_structure
 - **`skipHeader`** (boolean): Header skipping for file/stream sources.
   - NDJSON: skips the first non-empty line
   - CSV: skips the first data line only when `csvOptions.columns` does not consume headers
+  - Parquet/Arrow: ignored
   - Default: `false`
   - Applies only to `fileName`/`stream` sources
-- **`verbose`** (boolean): Enable logging and progress bars. Default: `true`.
+- **`verbose`** (boolean): Enable verbose logging and progress bars when using the built-in logger. Default: `true`.
+- **`logger`** (object): Optional custom Pino-compatible logger. If omitted, the library creates an internal Pino logger (`name: node-es-transformer`) and uses `LOG_LEVEL` (if set) or `info`/`error` based on `verbose`.
 ### Return Value
@@ -345,16 +403,19 @@ The `transformer()` function returns a Promise that resolves to an object with:
   - `'error'`: Error occurred
 ```javascript
+const pino = require('pino');
+const logger = pino({ name: 'my-app', level: process.env.LOG_LEVEL || 'info' });
 const result = await transformer({
   /* options */
 });
 result.events.on('complete', () => {
-  console.log('Ingestion complete!');
+  logger.info('Ingestion complete');
 });
 result.events.on('error', err => {
-  console.error('Error:', err);
+  logger.error({ err }, 'Ingestion failed');
 });
 ```
@@ -392,20 +453,23 @@ See [examples/typescript-example.ts](examples/typescript-example.ts) for more ex
 Always handle errors when using the library:
 ```javascript
+const pino = require('pino');
+const logger = pino({ name: 'my-app', level: process.env.LOG_LEVEL || 'info' });
 transformer({
   /* options */
 })
-  .then(() => console.log('Success'))
-  .catch(err => console.error('Error:', err));
+  .then(() => logger.info('Success'))
+  .catch(err => logger.error({ err }, 'Transformer failed'));
 // Or with async/await
 try {
   await transformer({
     /* options */
   });
-  console.log('Success');
+  logger.info('Success');
 } catch (err) {
-  console.error('Error:', err);
+  logger.error({ err }, 'Transformer failed');
 }
 ```