node-es-transformer 1.0.0-beta2 → 1.0.0-beta4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -14,23 +14,12 @@ If you're looking for a nodejs based tool which allows you to ingest large CSV/J
14
14
 
15
15
  While I'd generally recommend using [Logstash](https://www.elastic.co/products/logstash), [filebeat](https://www.elastic.co/products/beats/filebeat), [Ingest Nodes](https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html), [Elastic Agent](https://www.elastic.co/guide/en/fleet/current/fleet-overview.html) or [Elasticsearch Transforms](https://www.elastic.co/guide/en/elasticsearch/reference/current/transforms.html) for established use cases, this tool may be of help especially if you feel more at home in the JavaScript/nodejs universe and have use cases with customized ingestion and data transformation needs.
16
16
 
17
- **This is experimental code, use at your own risk. Nonetheless, I encourage you to give it a try so I can gather some feedback.**
18
-
19
- ### So why is this still _alpha_?
20
-
21
- - The API is not quite final and might change from release to release.
22
- - The code needs some more safety measures to avoid some possible accidental data loss scenarios.
23
- - No test coverage yet.
24
-
25
- ---
26
-
27
- Now that we've talked about the caveats, let's have a look what you actually get with this tool:
28
-
29
17
  ## Features
30
18
 
31
19
  - Buffering/Streaming for both reading and indexing. Files are read using streaming and Elasticsearch ingestion is done using buffered bulk indexing. This is tailored towards ingestion of large files. Successfully tested so far with JSON and CSV files in the range of 20-30 GBytes. On a single machine running both `node-es-transformer` and Elasticsearch ingestion rates up to 20k documents/second were achieved (2,9 GHz Intel Core i7, 16GByte RAM, SSD), depending on document size.
32
20
  - Supports wildcards to ingest/transform a range of files in one go.
33
21
  - Supports fetching documents from existing indices using search/scroll. This allows you to reindex with custom data transformations just using JavaScript in the `transform` callback.
22
+ - Supports ingesting docs based on a nodejs stream.
34
23
  - The `transform` callback gives you each source document, but you can split it up in multiple ones and return an array of documents. An example use case for this: Each source document is a Tweet and you want to transform that into an entity centric index based on Hashtags.
35
24
 
36
25
  ## Getting started
@@ -110,10 +99,12 @@ transformer({
110
99
 
111
100
  - `deleteIndex`: Setting to automatically delete an existing index, default is `false`.
112
101
  - `sourceClientConfig`/`targetClientConfig`: Optional Elasticsearch client options, defaults to `{ node: 'http://localhost:9200' }`.
113
- - `bufferSize`: The amount of documents inserted with each Elasticsearch bulk insert request, default is `1000`.
114
- - `fileName`: Source filename to ingest, supports wildcards. If this is set, `sourceIndexName` is not allowed.
102
+ - `bufferSize`: The threshold to flush bulk index request in KBytes, defaults to `5120`.
103
+ - `searchSize`: The amount of documents to be fetched with each search request when reindexing from another source index.
104
+ - `fileName`: Source filename to ingest, supports wildcards. If this is set, `sourceIndexName` and `stream` are not allowed.
105
+ - `stream`: Source nodejs stream to ingest. If this is set, `sourceIndexName` and `fileName` are not allowed.
115
106
  - `splitRegex`: Custom line split regex, defaults to `/\n/`.
116
- - `sourceIndexName`: The source Elasticsearch index to reindex from. If this is set, `fileName` is not allowed.
107
+ - `sourceIndexName`: The source Elasticsearch index to reindex from. If this is set, `fileName` and `stream` are not allowed.
117
108
  - `targetIndexName`: The target Elasticsearch index where documents will be indexed.
118
109
  - `mappings`: Optional Elasticsearch document mappings. If not set and you're reindexing from another index, the mappings from the existing index will be used.
119
110
  - `mappingsOverride`: If you're reindexing and this is set to `true`, `mappings` will be applied on top of the source index's mappings. Defaults to `false`.
@@ -147,10 +138,10 @@ yarn
147
138
 
148
139
  ```bash
149
140
  # Download the docker image
150
- docker pull docker.elastic.co/elasticsearch/elasticsearch:8.10.4
141
+ docker pull docker.elastic.co/elasticsearch/elasticsearch:8.17.0
151
142
 
152
143
  # Run the container
153
- docker run --name es01 --net elastic -p 9200:9200 -it -m 1GB -e "discovery.type=single-node" -e "xpack.security.enabled=false" docker.elastic.co/elasticsearch/elasticsearch:8.10.4
144
+ docker run --name es01 --net elastic -p 9200:9200 -it -m 1GB -e "discovery.type=single-node" -e "xpack.security.enabled=false" docker.elastic.co/elasticsearch/elasticsearch:8.17.0
154
145
  ```
155
146
 
156
147
  To commit, use `cz`. To prepare a release, use e.g. `yarn release -- --release-as 1.0.0-beta2`.
@@ -5,10 +5,20 @@ function _interopDefault (ex) { return (ex && (typeof ex === 'object') && 'defau
5
5
  var fs = _interopDefault(require('fs'));
6
6
  var es = _interopDefault(require('event-stream'));
7
7
  var glob = _interopDefault(require('glob'));
8
+ var split = _interopDefault(require('split2'));
9
+ var stream = require('stream');
8
10
  var cliProgress = _interopDefault(require('cli-progress'));
9
11
  var elasticsearch = _interopDefault(require('@elastic/elasticsearch'));
10
12
 
11
- var DEFAULT_BUFFER_SIZE = 1000;
13
+ // In earlier versions this was used to set the number of docs to index in a
14
+ // single bulk request. Since we switched to use the helpers.bulk() method from
15
+ // the ES client, this now translates to the `flushBytes` option of the helper.
16
+ // However, for kind of a backwards compability with the old values, this uses
17
+ // KBytes instead of Bytes. It will be multiplied by 1024 in the index queue.
18
+ var DEFAULT_BUFFER_SIZE = 5120;
19
+
20
+ // The default number of docs to fetch in a single search request when reindexing.
21
+ var DEFAULT_SEARCH_SIZE = 1000;
12
22
 
13
23
  function createMappingFactory(ref) {
14
24
  var sourceClient = ref.sourceClient;
@@ -19,6 +29,7 @@ function createMappingFactory(ref) {
19
29
  var mappingsOverride = ref.mappingsOverride;
20
30
  var indexMappingTotalFieldsLimit = ref.indexMappingTotalFieldsLimit;
21
31
  var verbose = ref.verbose;
32
+ var deleteIndex = ref.deleteIndex;
22
33
 
23
34
  return async function () {
24
35
  var targetMappings = mappingsOverride ? undefined : mappings;
@@ -28,7 +39,14 @@ function createMappingFactory(ref) {
28
39
  var mapping = await sourceClient.indices.getMapping({
29
40
  index: sourceIndexName,
30
41
  });
31
- targetMappings = mapping[sourceIndexName].mappings;
42
+ if (mapping[sourceIndexName]) {
43
+ targetMappings = mapping[sourceIndexName].mappings;
44
+ } else {
45
+ var allMappings = Object.values(mapping);
46
+ if (allMappings.length > 0) {
47
+ targetMappings = Object.values(mapping)[0].mappings;
48
+ }
49
+ }
32
50
  } catch (err) {
33
51
  console.log('Error reading source mapping', err);
34
52
  return;
@@ -43,18 +61,28 @@ function createMappingFactory(ref) {
43
61
  }
44
62
 
45
63
  try {
46
- var resp = await targetClient.indices.create({
47
- index: targetIndexName,
48
- body: Object.assign({}, {mappings: targetMappings},
49
- (indexMappingTotalFieldsLimit !== undefined
50
- ? {
51
- settings: {
52
- 'index.mapping.total_fields.limit': indexMappingTotalFieldsLimit,
53
- },
54
- }
55
- : {})),
56
- });
57
- if (verbose) { console.log('Created target mapping', resp); }
64
+ var indexExists = await targetClient.indices.exists({ index: targetIndexName });
65
+
66
+ if (indexExists === true && deleteIndex === true) {
67
+ await targetClient.indices.delete({ index: targetIndexName });
68
+ }
69
+
70
+ if (indexExists === false || deleteIndex === true) {
71
+ var resp = await targetClient.indices.create({
72
+ index: targetIndexName,
73
+ body: Object.assign({}, {mappings: targetMappings},
74
+ (indexMappingTotalFieldsLimit !== undefined
75
+ ? {
76
+ settings: {
77
+ 'index.mapping.total_fields.limit': indexMappingTotalFieldsLimit,
78
+ 'index.number_of_shards': 1,
79
+ 'index.number_of_replicas': 0,
80
+ },
81
+ }
82
+ : {})),
83
+ });
84
+ if (verbose) { console.log('Created target mapping', resp); }
85
+ }
58
86
  } catch (err) {
59
87
  console.log('Error creating target mapping', err);
60
88
  }
@@ -62,17 +90,14 @@ function createMappingFactory(ref) {
62
90
  };
63
91
  }
64
92
 
65
- var MAX_QUEUE_SIZE = 15;
66
-
67
93
  function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
68
94
  function startIndex(files) {
69
- var ingestQueueSize = 0;
70
95
  var finished = false;
71
96
 
72
97
  var file = files.shift();
73
98
  var s = fs
74
99
  .createReadStream(file)
75
- .pipe(es.split(splitRegex))
100
+ .pipe(split(splitRegex))
76
101
  .pipe(
77
102
  es
78
103
  .mapSync(function (line) {
@@ -120,20 +145,13 @@ function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
120
145
  })
121
146
  );
122
147
 
123
- indexer.queueEmitter.on('queue-size', async function (size) {
148
+ indexer.queueEmitter.on('pause', function () {
124
149
  if (finished) { return; }
125
- ingestQueueSize = size;
126
-
127
- if (ingestQueueSize < MAX_QUEUE_SIZE) {
128
- s.resume();
129
- } else {
130
- s.pause();
131
- }
150
+ s.pause();
132
151
  });
133
152
 
134
153
  indexer.queueEmitter.on('resume', function () {
135
154
  if (finished) { return; }
136
- ingestQueueSize = 0;
137
155
  s.resume();
138
156
  });
139
157
  }
@@ -149,7 +167,7 @@ var EventEmitter = require('events');
149
167
 
150
168
  var queueEmitter = new EventEmitter();
151
169
 
152
- var parallelCalls = 1;
170
+ var parallelCalls = 5;
153
171
 
154
172
  // a simple helper queue to bulk index documents
155
173
  function indexQueueFactory(ref) {
@@ -159,78 +177,74 @@ function indexQueueFactory(ref) {
159
177
  var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
160
178
  var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
161
179
 
162
- var buffer = [];
163
- var queue = [];
164
- var ingesting = 0;
165
- var ingestTimes = [];
166
- var finished = false;
180
+ var flushBytes = bufferSize * 1024; // Convert KB to Bytes
181
+ var highWaterMark = flushBytes * parallelCalls;
167
182
 
168
- var ingest = function (b) {
169
- if (typeof b !== 'undefined') {
170
- queue.push(b);
171
- queueEmitter.emit('queue-size', queue.length);
172
- }
183
+ // Create a Readable stream
184
+ var stream$$1 = new stream.Readable({
185
+ read: function read() {}, // Implement read but we manage pushing manually
186
+ highWaterMark: highWaterMark, // Buffer size for backpressure management
187
+ });
173
188
 
174
- if (ingestTimes.length > 5) { ingestTimes = ingestTimes.slice(-5); }
189
+ async function* ndjsonStreamIterator(readableStream) {
190
+ var buffer = ''; // To hold the incomplete data
191
+ var skippedHeader = false;
175
192
 
176
- if (ingesting < parallelCalls) {
177
- var docs = queue.shift();
193
+ // Iterate over the stream using async iteration
194
+ for await (var chunk of readableStream) {
195
+ buffer += chunk.toString(); // Accumulate the chunk data in the buffer
178
196
 
179
- queueEmitter.emit('queue-size', queue.length);
180
- if (queue.length <= 5) {
181
- queueEmitter.emit('resume');
182
- }
197
+ // Split the buffer into lines (NDJSON items)
198
+ var lines = buffer.split('\n');
183
199
 
184
- ingesting += 1;
185
-
186
- if (verbose)
187
- { console.log(("bulk ingest docs: " + (docs.length / 2) + ", queue length: " + (queue.length))); }
188
-
189
- var start = Date.now();
190
- client
191
- .bulk({ body: docs })
192
- .then(function () {
193
- var end = Date.now();
194
- var delta = end - start;
195
- ingestTimes.push(delta);
196
- ingesting -= 1;
197
-
198
- var ingestTimesMovingAverage =
199
- ingestTimes.length > 0
200
- ? ingestTimes.reduce(function (p, c) { return p + c; }, 0) / ingestTimes.length
201
- : 0;
202
- var ingestTimesMovingAverageSeconds = Math.floor(ingestTimesMovingAverage / 1000);
203
-
204
- if (
205
- ingestTimes.length > 0 &&
206
- ingestTimesMovingAverageSeconds < 30 &&
207
- parallelCalls < 10
208
- ) {
209
- parallelCalls += 1;
210
- } else if (
211
- ingestTimes.length > 0 &&
212
- ingestTimesMovingAverageSeconds >= 30 &&
213
- parallelCalls > 1
214
- ) {
215
- parallelCalls -= 1;
216
- }
200
+ // The last line might be incomplete, so hold it back in the buffer
201
+ buffer = lines.pop();
217
202
 
218
- if (queue.length > 0) {
219
- ingest();
220
- } else if (queue.length === 0 && finished) {
221
- queueEmitter.emit('finish');
222
- }
223
- })
224
- .catch(function (error) {
225
- console.error(error);
226
- ingesting -= 1;
227
- parallelCalls = 1;
228
- if (queue.length > 0) {
229
- ingest();
203
+ // Yield each complete JSON object
204
+ for (var line of lines) {
205
+ if (line.trim()) {
206
+ try {
207
+ if (!skipHeader || (skipHeader && !skippedHeader)) {
208
+ yield JSON.parse(line); // Parse and yield the JSON object
209
+ skippedHeader = true;
210
+ }
211
+ } catch (err) {
212
+ // Handle JSON parse errors if necessary
213
+ console.error('Failed to parse JSON:', err);
230
214
  }
231
- });
215
+ }
216
+ }
232
217
  }
233
- };
218
+
219
+ // Handle any remaining data in the buffer after the stream ends
220
+ if (buffer.trim()) {
221
+ try {
222
+ yield JSON.parse(buffer);
223
+ } catch (err) {
224
+ console.error('Failed to parse final JSON:', err);
225
+ }
226
+ }
227
+ }
228
+
229
+ var finished = false;
230
+
231
+ // Async IIFE to start bulk indexing
232
+ (async function () {
233
+ await client.helpers.bulk({
234
+ concurrency: parallelCalls,
235
+ flushBytes: flushBytes,
236
+ flushInterval: 1000,
237
+ refreshOnCompletion: true,
238
+ datasource: ndjsonStreamIterator(stream$$1),
239
+ onDocument: function onDocument(doc) {
240
+ return {
241
+ index: { _index: targetIndexName },
242
+ };
243
+ },
244
+ });
245
+
246
+ queueEmitter.emit('finish');
247
+ })();
234
248
 
235
249
  return {
236
250
  add: function (doc) {
@@ -238,37 +252,22 @@ function indexQueueFactory(ref) {
238
252
  throw new Error('Unexpected doc added after indexer should finish.');
239
253
  }
240
254
 
241
- if (!skipHeader) {
242
- var header = { index: { _index: targetIndexName } };
243
- buffer.push(header);
244
- }
245
- buffer.push(doc);
246
-
247
- if (queue.length === 0) {
248
- queueEmitter.emit('resume');
249
- }
250
-
251
- if (buffer.length >= bufferSize * 2) {
252
- ingest(buffer);
253
- buffer = [];
255
+ var canContinue = stream$$1.push(((JSON.stringify(doc)) + "\n"));
256
+ if (!canContinue) {
257
+ queueEmitter.emit('pause');
258
+ stream$$1.once('drain', function () {
259
+ queueEmitter.emit('resume');
260
+ });
254
261
  }
255
262
  },
256
263
  finish: function () {
257
264
  finished = true;
258
-
259
- if (buffer.length > 0) {
260
- ingest(buffer);
261
- buffer = [];
262
- } else if (queue.length === 0 && ingesting === 0) {
263
- queueEmitter.emit('finish');
264
- }
265
+ stream$$1.push(null);
265
266
  },
266
267
  queueEmitter: queueEmitter,
267
268
  };
268
269
  }
269
270
 
270
- var MAX_QUEUE_SIZE$1 = 15;
271
-
272
271
  // create a new progress bar instance and use shades_classic theme
273
272
  var progressBar = new cliProgress.SingleBar({}, cliProgress.Presets.shades_classic);
274
273
 
@@ -278,32 +277,33 @@ function indexReaderFactory(
278
277
  transform,
279
278
  client,
280
279
  query,
281
- bufferSize,
280
+ searchSize,
282
281
  populatedFields
283
282
  ) {
284
- if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
283
+ if ( searchSize === void 0 ) searchSize = DEFAULT_SEARCH_SIZE;
285
284
  if ( populatedFields === void 0 ) populatedFields = false;
286
285
 
287
286
  return async function indexReader() {
288
- var responseQueue = [];
289
287
  var docsNum = 0;
288
+ var scrollId;
289
+ var finished = false;
290
+ var readActive = false;
291
+ var backPressurePause = false;
290
292
 
291
293
  async function fetchPopulatedFields() {
292
294
  try {
293
- var response = await client.search({
294
- index: sourceIndexName,
295
- size: bufferSize,
296
- query: {
297
- function_score: {
298
- query: query,
299
- random_score: {},
300
- },
295
+ // Get all populated fields from the index
296
+ var response = await client.fieldCaps(
297
+ {
298
+ index: sourceIndexName,
299
+ fields: '*',
300
+ include_empty_fields: false,
301
+ filters: '-metadata',
301
302
  },
302
- });
303
+ { maxRetries: 0 }
304
+ );
303
305
 
304
- // Get all field names for each returned doc and flatten it
305
- // to a list of unique field names used across all docs.
306
- return new Set(response.hits.hits.map(function (d) { return Object.keys(d._source); }).flat(1));
306
+ return Object.keys(response.fields);
307
307
  } catch (e) {
308
308
  console.log('error', e);
309
309
  }
@@ -312,7 +312,7 @@ function indexReaderFactory(
312
312
  function search(fields) {
313
313
  return client.search(Object.assign({}, {index: sourceIndexName,
314
314
  scroll: '600s',
315
- size: bufferSize,
315
+ size: searchSize,
316
316
  query: query},
317
317
  (fields ? { _source: fields } : {})));
318
318
  }
@@ -329,21 +329,14 @@ function indexReaderFactory(
329
329
  // identify populated fields
330
330
  if (populatedFields) {
331
331
  fieldsWithData = await fetchPopulatedFields();
332
- console.log('fieldsWithData', fieldsWithData);
333
332
  }
334
333
 
335
- // start things off by searching, setting a scroll timeout, and pushing
336
- // our first response into the queue to be processed
337
- var se = await search(fieldsWithData);
338
- responseQueue.push(se);
339
- progressBar.start(se.hits.total.value, 0);
340
- console.log('se', se.hits.hits[0]);
334
+ await fetchNextResponse();
341
335
 
342
336
  function processHit(hit) {
343
337
  docsNum += 1;
344
338
  try {
345
339
  var doc = typeof transform === 'function' ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle
346
- // console.log('doc', doc);
347
340
 
348
341
  // if doc is undefined we'll skip indexing it
349
342
  if (typeof doc === 'undefined') {
@@ -363,68 +356,116 @@ function indexReaderFactory(
363
356
  }
364
357
  }
365
358
 
366
- var ingestQueueSize = 0;
367
- var scrollId = se._scroll_id; // eslint-disable-line no-underscore-dangle
368
- var readActive = false;
369
-
370
- async function processResponseQueue() {
371
- while (responseQueue.length) {
372
- readActive = true;
373
- var response = responseQueue.shift();
359
+ async function fetchNextResponse() {
360
+ readActive = true;
374
361
 
375
- // collect the docs from this response
376
- response.hits.hits.forEach(processHit);
362
+ var sc = scrollId ? await scroll(scrollId) : await search(fieldsWithData);
377
363
 
378
- progressBar.update(docsNum);
364
+ if (!scrollId) {
365
+ progressBar.start(sc.hits.total.value, 0);
366
+ }
379
367
 
380
- // check to see if we have collected all of the docs
381
- if (response.hits.total.value === docsNum) {
382
- indexer.finish();
383
- break;
384
- }
368
+ scrollId = sc._scroll_id;
369
+ readActive = false;
385
370
 
386
- if (ingestQueueSize < MAX_QUEUE_SIZE$1) {
387
- // get the next response if there are more docs to fetch
388
- var sc = await scroll(response._scroll_id); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
389
- scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
390
- responseQueue.push(sc);
391
- } else {
392
- readActive = false;
393
- }
394
- }
371
+ processResponse(sc);
395
372
  }
396
373
 
397
- indexer.queueEmitter.on('queue-size', async function (size) {
398
- ingestQueueSize = size;
374
+ async function processResponse(response) {
375
+ // collect the docs from this response
376
+ response.hits.hits.forEach(processHit);
377
+
378
+ progressBar.update(docsNum);
379
+
380
+ // check to see if we have collected all of the docs
381
+ if (response.hits.total.value === docsNum) {
382
+ indexer.finish();
383
+ return;
384
+ }
399
385
 
400
- if (!readActive && ingestQueueSize < MAX_QUEUE_SIZE$1) {
401
- // get the next response if there are more docs to fetch
402
- var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
403
- scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
404
- responseQueue.push(sc);
405
- processResponseQueue();
386
+ if (!backPressurePause) {
387
+ await fetchNextResponse();
406
388
  }
389
+ }
390
+
391
+ indexer.queueEmitter.on('pause', async function () {
392
+ backPressurePause = true;
407
393
  });
408
394
 
409
395
  indexer.queueEmitter.on('resume', async function () {
410
- ingestQueueSize = 0;
396
+ backPressurePause = false;
411
397
 
412
- if (readActive) {
398
+ if (readActive || finished) {
413
399
  return;
414
400
  }
415
401
 
416
- // get the next response if there are more docs to fetch
417
- var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
418
- scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
419
- responseQueue.push(sc);
420
- processResponseQueue();
402
+ await fetchNextResponse();
421
403
  });
422
404
 
423
405
  indexer.queueEmitter.on('finish', function () {
406
+ finished = true;
424
407
  progressBar.stop();
425
408
  });
409
+ };
410
+ }
411
+
412
+ function streamReaderFactory(indexer, stream$$1, transform, splitRegex, verbose) {
413
+ function startIndex() {
414
+ var finished = false;
415
+
416
+ var s = stream$$1.pipe(split(splitRegex)).pipe(
417
+ es
418
+ .mapSync(function (line) {
419
+ try {
420
+ // skip empty lines
421
+ if (line === '') {
422
+ return;
423
+ }
424
+
425
+ var doc =
426
+ typeof transform === 'function' ? JSON.stringify(transform(JSON.parse(line))) : line;
426
427
 
427
- processResponseQueue();
428
+ // if doc is undefined we'll skip indexing it
429
+ if (typeof doc === 'undefined') {
430
+ s.resume();
431
+ return;
432
+ }
433
+
434
+ // the transform callback may return an array of docs so we can emit
435
+ // multiple docs from a single line
436
+ if (Array.isArray(doc)) {
437
+ doc.forEach(function (d) { return indexer.add(d); });
438
+ return;
439
+ }
440
+
441
+ indexer.add(doc);
442
+ } catch (e) {
443
+ console.log('error', e);
444
+ }
445
+ })
446
+ .on('error', function (err) {
447
+ console.log('Error while reading stream.', err);
448
+ })
449
+ .on('end', function () {
450
+ if (verbose) { console.log('Read entire stream.'); }
451
+ indexer.finish();
452
+ finished = true;
453
+ })
454
+ );
455
+
456
+ indexer.queueEmitter.on('pause', function () {
457
+ if (finished) { return; }
458
+ s.pause();
459
+ });
460
+
461
+ indexer.queueEmitter.on('resume', function () {
462
+ if (finished) { return; }
463
+ s.resume();
464
+ });
465
+ }
466
+
467
+ return function () {
468
+ startIndex();
428
469
  };
429
470
  }
430
471
 
@@ -433,6 +474,8 @@ async function transformer(ref) {
433
474
  var sourceClientConfig = ref.sourceClientConfig;
434
475
  var targetClientConfig = ref.targetClientConfig;
435
476
  var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
477
+ var searchSize = ref.searchSize; if ( searchSize === void 0 ) searchSize = DEFAULT_SEARCH_SIZE;
478
+ var stream$$1 = ref.stream;
436
479
  var fileName = ref.fileName;
437
480
  var splitRegex = ref.splitRegex; if ( splitRegex === void 0 ) splitRegex = /\n/;
438
481
  var sourceIndexName = ref.sourceIndexName;
@@ -468,6 +511,7 @@ async function transformer(ref) {
468
511
  mappingsOverride: mappingsOverride,
469
512
  indexMappingTotalFieldsLimit: indexMappingTotalFieldsLimit,
470
513
  verbose: verbose,
514
+ deleteIndex: deleteIndex,
471
515
  });
472
516
  var indexer = indexQueueFactory({
473
517
  targetClient: targetClient,
@@ -482,8 +526,12 @@ async function transformer(ref) {
482
526
  throw Error('Only either one of fileName or sourceIndexName can be specified.');
483
527
  }
484
528
 
485
- if (typeof fileName === 'undefined' && typeof sourceIndexName === 'undefined') {
486
- throw Error('Either fileName or sourceIndexName must be specified.');
529
+ if (
530
+ (typeof fileName !== 'undefined' && typeof sourceIndexName !== 'undefined') ||
531
+ (typeof fileName !== 'undefined' && typeof stream$$1 !== 'undefined') ||
532
+ (typeof sourceIndexName !== 'undefined' && typeof stream$$1 !== 'undefined')
533
+ ) {
534
+ throw Error('Only one of fileName, sourceIndexName, or stream can be specified.');
487
535
  }
488
536
 
489
537
  if (typeof fileName !== 'undefined') {
@@ -497,11 +545,15 @@ async function transformer(ref) {
497
545
  transform,
498
546
  sourceClient,
499
547
  query,
500
- bufferSize,
548
+ searchSize,
501
549
  populatedFields
502
550
  );
503
551
  }
504
552
 
553
+ if (typeof stream$$1 !== 'undefined') {
554
+ return streamReaderFactory(indexer, stream$$1, transform, splitRegex, verbose);
555
+ }
556
+
505
557
  return null;
506
558
  }
507
559
 
@@ -1,10 +1,20 @@
1
1
  import fs from 'fs';
2
2
  import es from 'event-stream';
3
3
  import glob from 'glob';
4
+ import split from 'split2';
5
+ import { Readable } from 'stream';
4
6
  import cliProgress from 'cli-progress';
5
7
  import elasticsearch from '@elastic/elasticsearch';
6
8
 
7
- var DEFAULT_BUFFER_SIZE = 1000;
9
+ // In earlier versions this was used to set the number of docs to index in a
10
+ // single bulk request. Since we switched to use the helpers.bulk() method from
11
+ // the ES client, this now translates to the `flushBytes` option of the helper.
12
+ // However, for kind of a backwards compability with the old values, this uses
13
+ // KBytes instead of Bytes. It will be multiplied by 1024 in the index queue.
14
+ var DEFAULT_BUFFER_SIZE = 5120;
15
+
16
+ // The default number of docs to fetch in a single search request when reindexing.
17
+ var DEFAULT_SEARCH_SIZE = 1000;
8
18
 
9
19
  function createMappingFactory(ref) {
10
20
  var sourceClient = ref.sourceClient;
@@ -15,6 +25,7 @@ function createMappingFactory(ref) {
15
25
  var mappingsOverride = ref.mappingsOverride;
16
26
  var indexMappingTotalFieldsLimit = ref.indexMappingTotalFieldsLimit;
17
27
  var verbose = ref.verbose;
28
+ var deleteIndex = ref.deleteIndex;
18
29
 
19
30
  return async function () {
20
31
  var targetMappings = mappingsOverride ? undefined : mappings;
@@ -24,7 +35,14 @@ function createMappingFactory(ref) {
24
35
  var mapping = await sourceClient.indices.getMapping({
25
36
  index: sourceIndexName,
26
37
  });
27
- targetMappings = mapping[sourceIndexName].mappings;
38
+ if (mapping[sourceIndexName]) {
39
+ targetMappings = mapping[sourceIndexName].mappings;
40
+ } else {
41
+ var allMappings = Object.values(mapping);
42
+ if (allMappings.length > 0) {
43
+ targetMappings = Object.values(mapping)[0].mappings;
44
+ }
45
+ }
28
46
  } catch (err) {
29
47
  console.log('Error reading source mapping', err);
30
48
  return;
@@ -39,18 +57,28 @@ function createMappingFactory(ref) {
39
57
  }
40
58
 
41
59
  try {
42
- var resp = await targetClient.indices.create({
43
- index: targetIndexName,
44
- body: Object.assign({}, {mappings: targetMappings},
45
- (indexMappingTotalFieldsLimit !== undefined
46
- ? {
47
- settings: {
48
- 'index.mapping.total_fields.limit': indexMappingTotalFieldsLimit,
49
- },
50
- }
51
- : {})),
52
- });
53
- if (verbose) { console.log('Created target mapping', resp); }
60
+ var indexExists = await targetClient.indices.exists({ index: targetIndexName });
61
+
62
+ if (indexExists === true && deleteIndex === true) {
63
+ await targetClient.indices.delete({ index: targetIndexName });
64
+ }
65
+
66
+ if (indexExists === false || deleteIndex === true) {
67
+ var resp = await targetClient.indices.create({
68
+ index: targetIndexName,
69
+ body: Object.assign({}, {mappings: targetMappings},
70
+ (indexMappingTotalFieldsLimit !== undefined
71
+ ? {
72
+ settings: {
73
+ 'index.mapping.total_fields.limit': indexMappingTotalFieldsLimit,
74
+ 'index.number_of_shards': 1,
75
+ 'index.number_of_replicas': 0,
76
+ },
77
+ }
78
+ : {})),
79
+ });
80
+ if (verbose) { console.log('Created target mapping', resp); }
81
+ }
54
82
  } catch (err) {
55
83
  console.log('Error creating target mapping', err);
56
84
  }
@@ -58,17 +86,14 @@ function createMappingFactory(ref) {
58
86
  };
59
87
  }
60
88
 
61
- var MAX_QUEUE_SIZE = 15;
62
-
63
89
  function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
64
90
  function startIndex(files) {
65
- var ingestQueueSize = 0;
66
91
  var finished = false;
67
92
 
68
93
  var file = files.shift();
69
94
  var s = fs
70
95
  .createReadStream(file)
71
- .pipe(es.split(splitRegex))
96
+ .pipe(split(splitRegex))
72
97
  .pipe(
73
98
  es
74
99
  .mapSync(function (line) {
@@ -116,20 +141,13 @@ function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
116
141
  })
117
142
  );
118
143
 
119
- indexer.queueEmitter.on('queue-size', async function (size) {
144
+ indexer.queueEmitter.on('pause', function () {
120
145
  if (finished) { return; }
121
- ingestQueueSize = size;
122
-
123
- if (ingestQueueSize < MAX_QUEUE_SIZE) {
124
- s.resume();
125
- } else {
126
- s.pause();
127
- }
146
+ s.pause();
128
147
  });
129
148
 
130
149
  indexer.queueEmitter.on('resume', function () {
131
150
  if (finished) { return; }
132
- ingestQueueSize = 0;
133
151
  s.resume();
134
152
  });
135
153
  }
@@ -145,7 +163,7 @@ var EventEmitter = require('events');
145
163
 
146
164
  var queueEmitter = new EventEmitter();
147
165
 
148
- var parallelCalls = 1;
166
+ var parallelCalls = 5;
149
167
 
150
168
  // a simple helper queue to bulk index documents
151
169
  function indexQueueFactory(ref) {
@@ -155,78 +173,74 @@ function indexQueueFactory(ref) {
155
173
  var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
156
174
  var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
157
175
 
158
- var buffer = [];
159
- var queue = [];
160
- var ingesting = 0;
161
- var ingestTimes = [];
162
- var finished = false;
176
+ var flushBytes = bufferSize * 1024; // Convert KB to Bytes
177
+ var highWaterMark = flushBytes * parallelCalls;
163
178
 
164
- var ingest = function (b) {
165
- if (typeof b !== 'undefined') {
166
- queue.push(b);
167
- queueEmitter.emit('queue-size', queue.length);
168
- }
179
+ // Create a Readable stream
180
+ var stream = new Readable({
181
+ read: function read() {}, // Implement read but we manage pushing manually
182
+ highWaterMark: highWaterMark, // Buffer size for backpressure management
183
+ });
169
184
 
170
- if (ingestTimes.length > 5) { ingestTimes = ingestTimes.slice(-5); }
185
+ async function* ndjsonStreamIterator(readableStream) {
186
+ var buffer = ''; // To hold the incomplete data
187
+ var skippedHeader = false;
171
188
 
172
- if (ingesting < parallelCalls) {
173
- var docs = queue.shift();
189
+ // Iterate over the stream using async iteration
190
+ for await (var chunk of readableStream) {
191
+ buffer += chunk.toString(); // Accumulate the chunk data in the buffer
174
192
 
175
- queueEmitter.emit('queue-size', queue.length);
176
- if (queue.length <= 5) {
177
- queueEmitter.emit('resume');
178
- }
193
+ // Split the buffer into lines (NDJSON items)
194
+ var lines = buffer.split('\n');
179
195
 
180
- ingesting += 1;
181
-
182
- if (verbose)
183
- { console.log(("bulk ingest docs: " + (docs.length / 2) + ", queue length: " + (queue.length))); }
184
-
185
- var start = Date.now();
186
- client
187
- .bulk({ body: docs })
188
- .then(function () {
189
- var end = Date.now();
190
- var delta = end - start;
191
- ingestTimes.push(delta);
192
- ingesting -= 1;
193
-
194
- var ingestTimesMovingAverage =
195
- ingestTimes.length > 0
196
- ? ingestTimes.reduce(function (p, c) { return p + c; }, 0) / ingestTimes.length
197
- : 0;
198
- var ingestTimesMovingAverageSeconds = Math.floor(ingestTimesMovingAverage / 1000);
199
-
200
- if (
201
- ingestTimes.length > 0 &&
202
- ingestTimesMovingAverageSeconds < 30 &&
203
- parallelCalls < 10
204
- ) {
205
- parallelCalls += 1;
206
- } else if (
207
- ingestTimes.length > 0 &&
208
- ingestTimesMovingAverageSeconds >= 30 &&
209
- parallelCalls > 1
210
- ) {
211
- parallelCalls -= 1;
212
- }
196
+ // The last line might be incomplete, so hold it back in the buffer
197
+ buffer = lines.pop();
213
198
 
214
- if (queue.length > 0) {
215
- ingest();
216
- } else if (queue.length === 0 && finished) {
217
- queueEmitter.emit('finish');
218
- }
219
- })
220
- .catch(function (error) {
221
- console.error(error);
222
- ingesting -= 1;
223
- parallelCalls = 1;
224
- if (queue.length > 0) {
225
- ingest();
199
+ // Yield each complete JSON object
200
+ for (var line of lines) {
201
+ if (line.trim()) {
202
+ try {
203
+ if (!skipHeader || (skipHeader && !skippedHeader)) {
204
+ yield JSON.parse(line); // Parse and yield the JSON object
205
+ skippedHeader = true;
206
+ }
207
+ } catch (err) {
208
+ // Handle JSON parse errors if necessary
209
+ console.error('Failed to parse JSON:', err);
226
210
  }
227
- });
211
+ }
212
+ }
228
213
  }
229
- };
214
+
215
+ // Handle any remaining data in the buffer after the stream ends
216
+ if (buffer.trim()) {
217
+ try {
218
+ yield JSON.parse(buffer);
219
+ } catch (err) {
220
+ console.error('Failed to parse final JSON:', err);
221
+ }
222
+ }
223
+ }
224
+
225
+ var finished = false;
226
+
227
+ // Async IIFE to start bulk indexing
228
+ (async function () {
229
+ await client.helpers.bulk({
230
+ concurrency: parallelCalls,
231
+ flushBytes: flushBytes,
232
+ flushInterval: 1000,
233
+ refreshOnCompletion: true,
234
+ datasource: ndjsonStreamIterator(stream),
235
+ onDocument: function onDocument(doc) {
236
+ return {
237
+ index: { _index: targetIndexName },
238
+ };
239
+ },
240
+ });
241
+
242
+ queueEmitter.emit('finish');
243
+ })();
230
244
 
231
245
  return {
232
246
  add: function (doc) {
@@ -234,37 +248,22 @@ function indexQueueFactory(ref) {
234
248
  throw new Error('Unexpected doc added after indexer should finish.');
235
249
  }
236
250
 
237
- if (!skipHeader) {
238
- var header = { index: { _index: targetIndexName } };
239
- buffer.push(header);
240
- }
241
- buffer.push(doc);
242
-
243
- if (queue.length === 0) {
244
- queueEmitter.emit('resume');
245
- }
246
-
247
- if (buffer.length >= bufferSize * 2) {
248
- ingest(buffer);
249
- buffer = [];
251
+ var canContinue = stream.push(((JSON.stringify(doc)) + "\n"));
252
+ if (!canContinue) {
253
+ queueEmitter.emit('pause');
254
+ stream.once('drain', function () {
255
+ queueEmitter.emit('resume');
256
+ });
250
257
  }
251
258
  },
252
259
  finish: function () {
253
260
  finished = true;
254
-
255
- if (buffer.length > 0) {
256
- ingest(buffer);
257
- buffer = [];
258
- } else if (queue.length === 0 && ingesting === 0) {
259
- queueEmitter.emit('finish');
260
- }
261
+ stream.push(null);
261
262
  },
262
263
  queueEmitter: queueEmitter,
263
264
  };
264
265
  }
265
266
 
266
- var MAX_QUEUE_SIZE$1 = 15;
267
-
268
267
  // create a new progress bar instance and use shades_classic theme
269
268
  var progressBar = new cliProgress.SingleBar({}, cliProgress.Presets.shades_classic);
270
269
 
@@ -274,32 +273,33 @@ function indexReaderFactory(
274
273
  transform,
275
274
  client,
276
275
  query,
277
- bufferSize,
276
+ searchSize,
278
277
  populatedFields
279
278
  ) {
280
- if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
279
+ if ( searchSize === void 0 ) searchSize = DEFAULT_SEARCH_SIZE;
281
280
  if ( populatedFields === void 0 ) populatedFields = false;
282
281
 
283
282
  return async function indexReader() {
284
- var responseQueue = [];
285
283
  var docsNum = 0;
284
+ var scrollId;
285
+ var finished = false;
286
+ var readActive = false;
287
+ var backPressurePause = false;
286
288
 
287
289
  async function fetchPopulatedFields() {
288
290
  try {
289
- var response = await client.search({
290
- index: sourceIndexName,
291
- size: bufferSize,
292
- query: {
293
- function_score: {
294
- query: query,
295
- random_score: {},
296
- },
291
+ // Get all populated fields from the index
292
+ var response = await client.fieldCaps(
293
+ {
294
+ index: sourceIndexName,
295
+ fields: '*',
296
+ include_empty_fields: false,
297
+ filters: '-metadata',
297
298
  },
298
- });
299
+ { maxRetries: 0 }
300
+ );
299
301
 
300
- // Get all field names for each returned doc and flatten it
301
- // to a list of unique field names used across all docs.
302
- return new Set(response.hits.hits.map(function (d) { return Object.keys(d._source); }).flat(1));
302
+ return Object.keys(response.fields);
303
303
  } catch (e) {
304
304
  console.log('error', e);
305
305
  }
@@ -308,7 +308,7 @@ function indexReaderFactory(
308
308
  function search(fields) {
309
309
  return client.search(Object.assign({}, {index: sourceIndexName,
310
310
  scroll: '600s',
311
- size: bufferSize,
311
+ size: searchSize,
312
312
  query: query},
313
313
  (fields ? { _source: fields } : {})));
314
314
  }
@@ -325,21 +325,14 @@ function indexReaderFactory(
325
325
  // identify populated fields
326
326
  if (populatedFields) {
327
327
  fieldsWithData = await fetchPopulatedFields();
328
- console.log('fieldsWithData', fieldsWithData);
329
328
  }
330
329
 
331
- // start things off by searching, setting a scroll timeout, and pushing
332
- // our first response into the queue to be processed
333
- var se = await search(fieldsWithData);
334
- responseQueue.push(se);
335
- progressBar.start(se.hits.total.value, 0);
336
- console.log('se', se.hits.hits[0]);
330
+ await fetchNextResponse();
337
331
 
338
332
  function processHit(hit) {
339
333
  docsNum += 1;
340
334
  try {
341
335
  var doc = typeof transform === 'function' ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle
342
- // console.log('doc', doc);
343
336
 
344
337
  // if doc is undefined we'll skip indexing it
345
338
  if (typeof doc === 'undefined') {
@@ -359,68 +352,116 @@ function indexReaderFactory(
359
352
  }
360
353
  }
361
354
 
362
- var ingestQueueSize = 0;
363
- var scrollId = se._scroll_id; // eslint-disable-line no-underscore-dangle
364
- var readActive = false;
365
-
366
- async function processResponseQueue() {
367
- while (responseQueue.length) {
368
- readActive = true;
369
- var response = responseQueue.shift();
355
+ async function fetchNextResponse() {
356
+ readActive = true;
370
357
 
371
- // collect the docs from this response
372
- response.hits.hits.forEach(processHit);
358
+ var sc = scrollId ? await scroll(scrollId) : await search(fieldsWithData);
373
359
 
374
- progressBar.update(docsNum);
360
+ if (!scrollId) {
361
+ progressBar.start(sc.hits.total.value, 0);
362
+ }
375
363
 
376
- // check to see if we have collected all of the docs
377
- if (response.hits.total.value === docsNum) {
378
- indexer.finish();
379
- break;
380
- }
364
+ scrollId = sc._scroll_id;
365
+ readActive = false;
381
366
 
382
- if (ingestQueueSize < MAX_QUEUE_SIZE$1) {
383
- // get the next response if there are more docs to fetch
384
- var sc = await scroll(response._scroll_id); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
385
- scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
386
- responseQueue.push(sc);
387
- } else {
388
- readActive = false;
389
- }
390
- }
367
+ processResponse(sc);
391
368
  }
392
369
 
393
- indexer.queueEmitter.on('queue-size', async function (size) {
394
- ingestQueueSize = size;
370
+ async function processResponse(response) {
371
+ // collect the docs from this response
372
+ response.hits.hits.forEach(processHit);
373
+
374
+ progressBar.update(docsNum);
375
+
376
+ // check to see if we have collected all of the docs
377
+ if (response.hits.total.value === docsNum) {
378
+ indexer.finish();
379
+ return;
380
+ }
395
381
 
396
- if (!readActive && ingestQueueSize < MAX_QUEUE_SIZE$1) {
397
- // get the next response if there are more docs to fetch
398
- var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
399
- scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
400
- responseQueue.push(sc);
401
- processResponseQueue();
382
+ if (!backPressurePause) {
383
+ await fetchNextResponse();
402
384
  }
385
+ }
386
+
387
+ indexer.queueEmitter.on('pause', async function () {
388
+ backPressurePause = true;
403
389
  });
404
390
 
405
391
  indexer.queueEmitter.on('resume', async function () {
406
- ingestQueueSize = 0;
392
+ backPressurePause = false;
407
393
 
408
- if (readActive) {
394
+ if (readActive || finished) {
409
395
  return;
410
396
  }
411
397
 
412
- // get the next response if there are more docs to fetch
413
- var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
414
- scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
415
- responseQueue.push(sc);
416
- processResponseQueue();
398
+ await fetchNextResponse();
417
399
  });
418
400
 
419
401
  indexer.queueEmitter.on('finish', function () {
402
+ finished = true;
420
403
  progressBar.stop();
421
404
  });
405
+ };
406
+ }
407
+
408
+ function streamReaderFactory(indexer, stream, transform, splitRegex, verbose) {
409
+ function startIndex() {
410
+ var finished = false;
411
+
412
+ var s = stream.pipe(split(splitRegex)).pipe(
413
+ es
414
+ .mapSync(function (line) {
415
+ try {
416
+ // skip empty lines
417
+ if (line === '') {
418
+ return;
419
+ }
420
+
421
+ var doc =
422
+ typeof transform === 'function' ? JSON.stringify(transform(JSON.parse(line))) : line;
422
423
 
423
- processResponseQueue();
424
+ // if doc is undefined we'll skip indexing it
425
+ if (typeof doc === 'undefined') {
426
+ s.resume();
427
+ return;
428
+ }
429
+
430
+ // the transform callback may return an array of docs so we can emit
431
+ // multiple docs from a single line
432
+ if (Array.isArray(doc)) {
433
+ doc.forEach(function (d) { return indexer.add(d); });
434
+ return;
435
+ }
436
+
437
+ indexer.add(doc);
438
+ } catch (e) {
439
+ console.log('error', e);
440
+ }
441
+ })
442
+ .on('error', function (err) {
443
+ console.log('Error while reading stream.', err);
444
+ })
445
+ .on('end', function () {
446
+ if (verbose) { console.log('Read entire stream.'); }
447
+ indexer.finish();
448
+ finished = true;
449
+ })
450
+ );
451
+
452
+ indexer.queueEmitter.on('pause', function () {
453
+ if (finished) { return; }
454
+ s.pause();
455
+ });
456
+
457
+ indexer.queueEmitter.on('resume', function () {
458
+ if (finished) { return; }
459
+ s.resume();
460
+ });
461
+ }
462
+
463
+ return function () {
464
+ startIndex();
424
465
  };
425
466
  }
426
467
 
@@ -429,6 +470,8 @@ async function transformer(ref) {
429
470
  var sourceClientConfig = ref.sourceClientConfig;
430
471
  var targetClientConfig = ref.targetClientConfig;
431
472
  var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
473
+ var searchSize = ref.searchSize; if ( searchSize === void 0 ) searchSize = DEFAULT_SEARCH_SIZE;
474
+ var stream = ref.stream;
432
475
  var fileName = ref.fileName;
433
476
  var splitRegex = ref.splitRegex; if ( splitRegex === void 0 ) splitRegex = /\n/;
434
477
  var sourceIndexName = ref.sourceIndexName;
@@ -464,6 +507,7 @@ async function transformer(ref) {
464
507
  mappingsOverride: mappingsOverride,
465
508
  indexMappingTotalFieldsLimit: indexMappingTotalFieldsLimit,
466
509
  verbose: verbose,
510
+ deleteIndex: deleteIndex,
467
511
  });
468
512
  var indexer = indexQueueFactory({
469
513
  targetClient: targetClient,
@@ -478,8 +522,12 @@ async function transformer(ref) {
478
522
  throw Error('Only either one of fileName or sourceIndexName can be specified.');
479
523
  }
480
524
 
481
- if (typeof fileName === 'undefined' && typeof sourceIndexName === 'undefined') {
482
- throw Error('Either fileName or sourceIndexName must be specified.');
525
+ if (
526
+ (typeof fileName !== 'undefined' && typeof sourceIndexName !== 'undefined') ||
527
+ (typeof fileName !== 'undefined' && typeof stream !== 'undefined') ||
528
+ (typeof sourceIndexName !== 'undefined' && typeof stream !== 'undefined')
529
+ ) {
530
+ throw Error('Only one of fileName, sourceIndexName, or stream can be specified.');
483
531
  }
484
532
 
485
533
  if (typeof fileName !== 'undefined') {
@@ -493,11 +541,15 @@ async function transformer(ref) {
493
541
  transform,
494
542
  sourceClient,
495
543
  query,
496
- bufferSize,
544
+ searchSize,
497
545
  populatedFields
498
546
  );
499
547
  }
500
548
 
549
+ if (typeof stream !== 'undefined') {
550
+ return streamReaderFactory(indexer, stream, transform, splitRegex, verbose);
551
+ }
552
+
501
553
  return null;
502
554
  }
503
555
 
package/package.json CHANGED
@@ -14,20 +14,21 @@
14
14
  "license": "Apache-2.0",
15
15
  "author": "Walter Rafelsberger <walter@rafelsberger.at>",
16
16
  "contributors": [],
17
- "version": "1.0.0-beta2",
17
+ "version": "1.0.0-beta4",
18
18
  "main": "dist/node-es-transformer.cjs.js",
19
19
  "module": "dist/node-es-transformer.esm.js",
20
20
  "dependencies": {
21
- "@elastic/elasticsearch": "^8.10.0",
21
+ "@elastic/elasticsearch": "^8.17.0",
22
22
  "cli-progress": "^3.12.0",
23
23
  "event-stream": "3.3.4",
24
- "glob": "7.1.2"
24
+ "git-cz": "^4.9.0",
25
+ "glob": "7.1.2",
26
+ "split2": "^4.2.0"
25
27
  },
26
28
  "devDependencies": {
27
29
  "acorn": "^6.4.2",
28
30
  "async-retry": "^1.3.3",
29
31
  "commit-and-tag-version": "^11.3.0",
30
- "cz-conventional-changelog": "^3.3.0",
31
32
  "eslint": "^8.51.0",
32
33
  "eslint-config-airbnb": "19.0.4",
33
34
  "eslint-config-prettier": "^9.0.0",
@@ -57,7 +58,7 @@
57
58
  ],
58
59
  "config": {
59
60
  "commitizen": {
60
- "path": "./node_modules/cz-conventional-changelog"
61
+ "path": "git-cz"
61
62
  }
62
63
  },
63
64
  "jest": {