node-es-transformer 1.0.0-alpha9 → 1.0.0-beta2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,6 +1,8 @@
1
1
  [![npm](https://img.shields.io/npm/v/node-es-transformer.svg?maxAge=2592000)](https://www.npmjs.com/package/node-es-transformer)
2
2
  [![npm](https://img.shields.io/npm/l/node-es-transformer.svg?maxAge=2592000)](https://www.npmjs.com/package/node-es-transformer)
3
3
  [![npm](https://img.shields.io/npm/dt/node-es-transformer.svg?maxAge=2592000)](https://www.npmjs.com/package/node-es-transformer)
4
+ [![Commitizen friendly](https://img.shields.io/badge/commitizen-friendly-brightgreen.svg)](http://commitizen.github.io/cz-cli/)
5
+ [![CI](https://github.com/walterra/node-es-transformer/actions/workflows/ci.yml/badge.svg)](https://github.com/walterra/node-es-transformer/actions)
4
6
 
5
7
  # node-es-transformer
6
8
 
@@ -10,7 +12,7 @@ A nodejs based library to (re)index and transform data from/to Elasticsearch.
10
12
 
11
13
  If you're looking for a nodejs based tool which allows you to ingest large CSV/JSON files in the GigaBytes you've come to the right place. Everything else I've tried with larger files runs out of JS heap, hammers ES with too many single requests, times out or tries to do everything with a single bulk request.
12
14
 
13
- While I'd generally recommend using [Logstash](https://www.elastic.co/products/logstash), [filebeat](https://www.elastic.co/products/beats/filebeat) or [Ingest Nodes](https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html) for established use cases, this tool may be of help especially if you feel more at home in the JavaScript/nodejs universe and have use cases with customized ingestion and data transformation needs.
15
+ While I'd generally recommend using [Logstash](https://www.elastic.co/products/logstash), [filebeat](https://www.elastic.co/products/beats/filebeat), [Ingest Nodes](https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html), [Elastic Agent](https://www.elastic.co/guide/en/fleet/current/fleet-overview.html) or [Elasticsearch Transforms](https://www.elastic.co/guide/en/elasticsearch/reference/current/transforms.html) for established use cases, this tool may be of help especially if you feel more at home in the JavaScript/nodejs universe and have use cases with customized ingestion and data transformation needs.
14
16
 
15
17
  **This is experimental code, use at your own risk. Nonetheless, I encourage you to give it a try so I can gather some feedback.**
16
18
 
@@ -26,7 +28,7 @@ Now that we've talked about the caveats, let's have a look what you actually get
26
28
 
27
29
  ## Features
28
30
 
29
- - Buffering/Streaming for both reading and indexing. Files are read using streaming and Elasticsearch ingestion is done using buffered bulk indexing. This is tailored towards ingestion of large files. Successfully tested so far with JSON and CSV files in the range of 20-30 GBytes. On a single machine running both `node-es-transformer` and Elasticsearch ingestion rates up to 20k documents/second were achieved (2,9 GHz Intel Core i7, 16GByte RAM, SSD).
31
+ - Buffering/Streaming for both reading and indexing. Files are read using streaming and Elasticsearch ingestion is done using buffered bulk indexing. This is tailored towards ingestion of large files. Successfully tested so far with JSON and CSV files in the range of 20-30 GBytes. On a single machine running both `node-es-transformer` and Elasticsearch ingestion rates up to 20k documents/second were achieved (2,9 GHz Intel Core i7, 16GByte RAM, SSD), depending on document size.
30
32
  - Supports wildcards to ingest/transform a range of files in one go.
31
33
  - Supports fetching documents from existing indices using search/scroll. This allows you to reindex with custom data transformations just using JavaScript in the `transform` callback.
32
34
  - The `transform` callback gives you each source document, but you can split it up in multiple ones and return an array of documents. An example use case for this: Each source document is a Tweet and you want to transform that into an entity centric index based on Hashtags.
@@ -113,7 +115,11 @@ transformer({
113
115
  - `splitRegex`: Custom line split regex, defaults to `/\n/`.
114
116
  - `sourceIndexName`: The source Elasticsearch index to reindex from. If this is set, `fileName` is not allowed.
115
117
  - `targetIndexName`: The target Elasticsearch index where documents will be indexed.
116
- - `mappings`: Elasticsearch document mapping.
118
+ - `mappings`: Optional Elasticsearch document mappings. If not set and you're reindexing from another index, the mappings from the existing index will be used.
119
+ - `mappingsOverride`: If you're reindexing and this is set to `true`, `mappings` will be applied on top of the source index's mappings. Defaults to `false`.
120
+ - `indexMappingTotalFieldsLimit`: Optional field limit for the target index to be created that will be passed on as the `index.mapping.total_fields.limit` setting.
121
+ - `populatedFields`: If `true`, fetches a set of random documents to identify which fields are actually used by documents. Can be useful for indices with lots of field mappings to increase query/reindex performance. Defaults to `false`.
122
+ - `query`: Optional Elasticsearch [DSL query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html) to filter documents from the source index.
117
123
  - `skipHeader`: If true, skips the first line of the source file. Defaults to `false`.
118
124
  - `transform(line)`: A callback function which allows the transformation of a source line into one or several documents.
119
125
  - `verbose`: Logging verbosity, defaults to `true`
@@ -137,7 +143,17 @@ yarn
137
143
 
138
144
  `yarn dev` builds the library, then keeps rebuilding it whenever the source files change using [rollup-watch](https://github.com/rollup/rollup-watch).
139
145
 
140
- `yarn test` builds the library, then tests it.
146
+ `yarn test` runs the tests. The tests expect that you have an Elasticsearch instance running without security at `http://localhost:9200`. Using docker, you can set this up with:
147
+
148
+ ```bash
149
+ # Download the docker image
150
+ docker pull docker.elastic.co/elasticsearch/elasticsearch:8.10.4
151
+
152
+ # Run the container
153
+ docker run --name es01 --net elastic -p 9200:9200 -it -m 1GB -e "discovery.type=single-node" -e "xpack.security.enabled=false" docker.elastic.co/elasticsearch/elasticsearch:8.10.4
154
+ ```
155
+
156
+ To commit, use `cz`. To prepare a release, use e.g. `yarn release -- --release-as 1.0.0-beta2`.
141
157
 
142
158
  ## License
143
159
 
@@ -8,20 +8,26 @@ var glob = _interopDefault(require('glob'));
8
8
  var cliProgress = _interopDefault(require('cli-progress'));
9
9
  var elasticsearch = _interopDefault(require('@elastic/elasticsearch'));
10
10
 
11
+ var DEFAULT_BUFFER_SIZE = 1000;
12
+
11
13
  function createMappingFactory(ref) {
12
14
  var sourceClient = ref.sourceClient;
13
15
  var sourceIndexName = ref.sourceIndexName;
14
16
  var targetClient = ref.targetClient;
15
17
  var targetIndexName = ref.targetIndexName;
16
18
  var mappings = ref.mappings;
19
+ var mappingsOverride = ref.mappingsOverride;
20
+ var indexMappingTotalFieldsLimit = ref.indexMappingTotalFieldsLimit;
17
21
  var verbose = ref.verbose;
18
22
 
19
23
  return async function () {
20
- var targetMappings = mappings;
24
+ var targetMappings = mappingsOverride ? undefined : mappings;
21
25
 
22
26
  if (sourceClient && sourceIndexName && typeof targetMappings === 'undefined') {
23
27
  try {
24
- var mapping = await sourceClient.indices.getMapping({ index: sourceIndexName });
28
+ var mapping = await sourceClient.indices.getMapping({
29
+ index: sourceIndexName,
30
+ });
25
31
  targetMappings = mapping[sourceIndexName].mappings;
26
32
  } catch (err) {
27
33
  console.log('Error reading source mapping', err);
@@ -30,13 +36,24 @@ function createMappingFactory(ref) {
30
36
  }
31
37
 
32
38
  if (typeof targetMappings === 'object' && targetMappings !== null) {
39
+ if (mappingsOverride) {
40
+ targetMappings = Object.assign({}, targetMappings,
41
+ {properties: Object.assign({}, targetMappings.properties,
42
+ mappings)});
43
+ }
44
+
33
45
  try {
34
- var resp = await targetClient.indices.create(
35
- {
36
- index: targetIndexName,
37
- body: { mappings: targetMappings },
38
- }
39
- );
46
+ var resp = await targetClient.indices.create({
47
+ index: targetIndexName,
48
+ body: Object.assign({}, {mappings: targetMappings},
49
+ (indexMappingTotalFieldsLimit !== undefined
50
+ ? {
51
+ settings: {
52
+ 'index.mapping.total_fields.limit': indexMappingTotalFieldsLimit,
53
+ },
54
+ }
55
+ : {})),
56
+ });
40
57
  if (verbose) { console.log('Created target mapping', resp); }
41
58
  } catch (err) {
42
59
  console.log('Error creating target mapping', err);
@@ -45,45 +62,78 @@ function createMappingFactory(ref) {
45
62
  };
46
63
  }
47
64
 
65
+ var MAX_QUEUE_SIZE = 15;
66
+
48
67
  function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
49
68
  function startIndex(files) {
69
+ var ingestQueueSize = 0;
70
+ var finished = false;
71
+
50
72
  var file = files.shift();
51
- var s = fs.createReadStream(file)
73
+ var s = fs
74
+ .createReadStream(file)
52
75
  .pipe(es.split(splitRegex))
53
- .pipe(es.mapSync(function (line) {
54
- s.pause();
55
- try {
56
- var doc = (typeof transform === 'function') ? transform(line) : line;
57
- // if doc is undefined we'll skip indexing it
58
- if (typeof doc === 'undefined') {
59
- s.resume();
60
- return;
61
- }
76
+ .pipe(
77
+ es
78
+ .mapSync(function (line) {
79
+ try {
80
+ // skip empty lines
81
+ if (line === '') {
82
+ return;
83
+ }
84
+
85
+ var doc =
86
+ typeof transform === 'function'
87
+ ? JSON.stringify(transform(JSON.parse(line)))
88
+ : line;
89
+
90
+ // if doc is undefined we'll skip indexing it
91
+ if (typeof doc === 'undefined') {
92
+ s.resume();
93
+ return;
94
+ }
95
+
96
+ // the transform callback may return an array of docs so we can emit
97
+ // multiple docs from a single line
98
+ if (Array.isArray(doc)) {
99
+ doc.forEach(function (d) { return indexer.add(d); });
100
+ return;
101
+ }
102
+
103
+ indexer.add(doc);
104
+ } catch (e) {
105
+ console.log('error', e);
106
+ }
107
+ })
108
+ .on('error', function (err) {
109
+ console.log('Error while reading file.', err);
110
+ })
111
+ .on('end', function () {
112
+ if (verbose) { console.log('Read entire file: ', file); }
113
+ if (files.length > 0) {
114
+ startIndex(files);
115
+ return;
116
+ }
117
+
118
+ indexer.finish();
119
+ finished = true;
120
+ })
121
+ );
62
122
 
63
- // the transform callback may return an array of docs so we can emit
64
- // multiple docs from a single line
65
- if (Array.isArray(doc)) {
66
- doc.forEach(function (d) { return indexer.add(d); });
67
- return;
68
- }
123
+ indexer.queueEmitter.on('queue-size', async function (size) {
124
+ if (finished) { return; }
125
+ ingestQueueSize = size;
69
126
 
70
- indexer.add(doc);
71
- } catch (e) {
72
- console.log('error', e);
73
- }
74
- })
75
- .on('error', function (err) {
76
- console.log('Error while reading file.', err);
77
- })
78
- .on('end', function () {
79
- if (verbose) { console.log('Read entire file: ', file); }
80
- indexer.finish();
81
- if (files.length > 0) {
82
- startIndex(files);
83
- }
84
- }));
127
+ if (ingestQueueSize < MAX_QUEUE_SIZE) {
128
+ s.resume();
129
+ } else {
130
+ s.pause();
131
+ }
132
+ });
85
133
 
86
134
  indexer.queueEmitter.on('resume', function () {
135
+ if (finished) { return; }
136
+ ingestQueueSize = 0;
87
137
  s.resume();
88
138
  });
89
139
  }
@@ -99,110 +149,202 @@ var EventEmitter = require('events');
99
149
 
100
150
  var queueEmitter = new EventEmitter();
101
151
 
152
+ var parallelCalls = 1;
153
+
102
154
  // a simple helper queue to bulk index documents
103
155
  function indexQueueFactory(ref) {
104
156
  var client = ref.targetClient;
105
157
  var targetIndexName = ref.targetIndexName;
106
- var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = 1000;
158
+ var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
107
159
  var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
108
160
  var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
109
161
 
110
162
  var buffer = [];
111
163
  var queue = [];
112
- var ingesting = false;
164
+ var ingesting = 0;
165
+ var ingestTimes = [];
166
+ var finished = false;
113
167
 
114
- var ingest = async function (b) {
168
+ var ingest = function (b) {
115
169
  if (typeof b !== 'undefined') {
116
170
  queue.push(b);
117
171
  queueEmitter.emit('queue-size', queue.length);
118
172
  }
119
173
 
120
- if (ingesting === false) {
174
+ if (ingestTimes.length > 5) { ingestTimes = ingestTimes.slice(-5); }
175
+
176
+ if (ingesting < parallelCalls) {
121
177
  var docs = queue.shift();
122
- queueEmitter.emit('queue-size', queue.length);
123
- ingesting = true;
124
- if (verbose) { console.log(("bulk ingest docs: " + (docs.length / 2) + ", queue length: " + (queue.length))); }
125
178
 
126
- try {
127
- await client.bulk({ body: docs });
128
- ingesting = false;
129
- if (queue.length > 0) {
130
- ingest();
131
- }
132
- } catch (err) {
133
- console.log('bulk index error', err);
179
+ queueEmitter.emit('queue-size', queue.length);
180
+ if (queue.length <= 5) {
181
+ queueEmitter.emit('resume');
134
182
  }
135
- }
136
183
 
137
- // console.log(`ingest: queue.length ${queue.length}`);
138
- if (queue.length === 0) {
139
- queueEmitter.emit('queue-size', 0);
140
- queueEmitter.emit('resume');
184
+ ingesting += 1;
185
+
186
+ if (verbose)
187
+ { console.log(("bulk ingest docs: " + (docs.length / 2) + ", queue length: " + (queue.length))); }
188
+
189
+ var start = Date.now();
190
+ client
191
+ .bulk({ body: docs })
192
+ .then(function () {
193
+ var end = Date.now();
194
+ var delta = end - start;
195
+ ingestTimes.push(delta);
196
+ ingesting -= 1;
197
+
198
+ var ingestTimesMovingAverage =
199
+ ingestTimes.length > 0
200
+ ? ingestTimes.reduce(function (p, c) { return p + c; }, 0) / ingestTimes.length
201
+ : 0;
202
+ var ingestTimesMovingAverageSeconds = Math.floor(ingestTimesMovingAverage / 1000);
203
+
204
+ if (
205
+ ingestTimes.length > 0 &&
206
+ ingestTimesMovingAverageSeconds < 30 &&
207
+ parallelCalls < 10
208
+ ) {
209
+ parallelCalls += 1;
210
+ } else if (
211
+ ingestTimes.length > 0 &&
212
+ ingestTimesMovingAverageSeconds >= 30 &&
213
+ parallelCalls > 1
214
+ ) {
215
+ parallelCalls -= 1;
216
+ }
217
+
218
+ if (queue.length > 0) {
219
+ ingest();
220
+ } else if (queue.length === 0 && finished) {
221
+ queueEmitter.emit('finish');
222
+ }
223
+ })
224
+ .catch(function (error) {
225
+ console.error(error);
226
+ ingesting -= 1;
227
+ parallelCalls = 1;
228
+ if (queue.length > 0) {
229
+ ingest();
230
+ }
231
+ });
141
232
  }
142
233
  };
143
234
 
144
235
  return {
145
236
  add: function (doc) {
237
+ if (finished) {
238
+ throw new Error('Unexpected doc added after indexer should finish.');
239
+ }
240
+
146
241
  if (!skipHeader) {
147
242
  var header = { index: { _index: targetIndexName } };
148
243
  buffer.push(header);
149
244
  }
150
245
  buffer.push(doc);
151
246
 
152
- // console.log(`add: queue.length ${queue.length}`);
153
247
  if (queue.length === 0) {
154
248
  queueEmitter.emit('resume');
155
249
  }
156
250
 
157
- if (buffer.length >= (bufferSize * 2)) {
251
+ if (buffer.length >= bufferSize * 2) {
158
252
  ingest(buffer);
159
253
  buffer = [];
160
254
  }
161
255
  },
162
- finish: async function () {
163
- await ingest(buffer);
164
- buffer = [];
165
- queueEmitter.emit('finish');
256
+ finish: function () {
257
+ finished = true;
258
+
259
+ if (buffer.length > 0) {
260
+ ingest(buffer);
261
+ buffer = [];
262
+ } else if (queue.length === 0 && ingesting === 0) {
263
+ queueEmitter.emit('finish');
264
+ }
166
265
  },
167
266
  queueEmitter: queueEmitter,
168
267
  };
169
268
  }
170
269
 
171
- var MAX_QUEUE_SIZE = 5;
270
+ var MAX_QUEUE_SIZE$1 = 15;
172
271
 
173
272
  // create a new progress bar instance and use shades_classic theme
174
273
  var progressBar = new cliProgress.SingleBar({}, cliProgress.Presets.shades_classic);
175
274
 
176
- function indexReaderFactory(indexer, sourceIndexName, transform, client) {
275
+ function indexReaderFactory(
276
+ indexer,
277
+ sourceIndexName,
278
+ transform,
279
+ client,
280
+ query,
281
+ bufferSize,
282
+ populatedFields
283
+ ) {
284
+ if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
285
+ if ( populatedFields === void 0 ) populatedFields = false;
286
+
177
287
  return async function indexReader() {
178
288
  var responseQueue = [];
179
289
  var docsNum = 0;
180
290
 
181
- function search() {
182
- return client.search({
183
- index: sourceIndexName,
184
- scroll: '30s',
185
- size: 10000,
186
- });
291
+ async function fetchPopulatedFields() {
292
+ try {
293
+ var response = await client.search({
294
+ index: sourceIndexName,
295
+ size: bufferSize,
296
+ query: {
297
+ function_score: {
298
+ query: query,
299
+ random_score: {},
300
+ },
301
+ },
302
+ });
303
+
304
+ // Get all field names for each returned doc and flatten it
305
+ // to a list of unique field names used across all docs.
306
+ return new Set(response.hits.hits.map(function (d) { return Object.keys(d._source); }).flat(1));
307
+ } catch (e) {
308
+ console.log('error', e);
309
+ }
310
+ }
311
+
312
+ function search(fields) {
313
+ return client.search(Object.assign({}, {index: sourceIndexName,
314
+ scroll: '600s',
315
+ size: bufferSize,
316
+ query: query},
317
+ (fields ? { _source: fields } : {})));
187
318
  }
188
319
 
189
320
  function scroll(id) {
190
321
  return client.scroll({
191
322
  scroll_id: id,
192
- scroll: '30s',
323
+ scroll: '600s',
193
324
  });
194
325
  }
195
326
 
327
+ var fieldsWithData;
328
+
329
+ // identify populated fields
330
+ if (populatedFields) {
331
+ fieldsWithData = await fetchPopulatedFields();
332
+ console.log('fieldsWithData', fieldsWithData);
333
+ }
334
+
196
335
  // start things off by searching, setting a scroll timeout, and pushing
197
336
  // our first response into the queue to be processed
198
- var se = await search();
337
+ var se = await search(fieldsWithData);
199
338
  responseQueue.push(se);
200
339
  progressBar.start(se.hits.total.value, 0);
340
+ console.log('se', se.hits.hits[0]);
201
341
 
202
342
  function processHit(hit) {
203
343
  docsNum += 1;
204
344
  try {
205
- var doc = (typeof transform === 'function') ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle
345
+ var doc = typeof transform === 'function' ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle
346
+ // console.log('doc', doc);
347
+
206
348
  // if doc is undefined we'll skip indexing it
207
349
  if (typeof doc === 'undefined') {
208
350
  return;
@@ -236,15 +378,13 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
236
378
  progressBar.update(docsNum);
237
379
 
238
380
  // check to see if we have collected all of the docs
239
- // console.log('check count', response.hits.total.value, docsNum);
240
381
  if (response.hits.total.value === docsNum) {
241
382
  indexer.finish();
242
- progressBar.stop();
243
383
  break;
244
384
  }
245
385
 
246
- if (ingestQueueSize < MAX_QUEUE_SIZE) {
247
- // get the next response if there are more docs to fetch
386
+ if (ingestQueueSize < MAX_QUEUE_SIZE$1) {
387
+ // get the next response if there are more docs to fetch
248
388
  var sc = await scroll(response._scroll_id); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
249
389
  scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
250
390
  responseQueue.push(sc);
@@ -257,8 +397,8 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
257
397
  indexer.queueEmitter.on('queue-size', async function (size) {
258
398
  ingestQueueSize = size;
259
399
 
260
- if (!readActive && ingestQueueSize < MAX_QUEUE_SIZE) {
261
- // get the next response if there are more docs to fetch
400
+ if (!readActive && ingestQueueSize < MAX_QUEUE_SIZE$1) {
401
+ // get the next response if there are more docs to fetch
262
402
  var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
263
403
  scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
264
404
  responseQueue.push(sc);
@@ -280,6 +420,10 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
280
420
  processResponseQueue();
281
421
  });
282
422
 
423
+ indexer.queueEmitter.on('finish', function () {
424
+ progressBar.stop();
425
+ });
426
+
283
427
  processResponseQueue();
284
428
  };
285
429
  }
@@ -288,12 +432,16 @@ async function transformer(ref) {
288
432
  var deleteIndex = ref.deleteIndex; if ( deleteIndex === void 0 ) deleteIndex = false;
289
433
  var sourceClientConfig = ref.sourceClientConfig;
290
434
  var targetClientConfig = ref.targetClientConfig;
291
- var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = 1000;
435
+ var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
292
436
  var fileName = ref.fileName;
293
437
  var splitRegex = ref.splitRegex; if ( splitRegex === void 0 ) splitRegex = /\n/;
294
438
  var sourceIndexName = ref.sourceIndexName;
295
439
  var targetIndexName = ref.targetIndexName;
296
440
  var mappings = ref.mappings;
441
+ var mappingsOverride = ref.mappingsOverride; if ( mappingsOverride === void 0 ) mappingsOverride = false;
442
+ var indexMappingTotalFieldsLimit = ref.indexMappingTotalFieldsLimit;
443
+ var populatedFields = ref.populatedFields; if ( populatedFields === void 0 ) populatedFields = false;
444
+ var query = ref.query;
297
445
  var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
298
446
  var transform = ref.transform;
299
447
  var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
@@ -317,6 +465,8 @@ async function transformer(ref) {
317
465
  targetClient: targetClient,
318
466
  targetIndexName: targetIndexName,
319
467
  mappings: mappings,
468
+ mappingsOverride: mappingsOverride,
469
+ indexMappingTotalFieldsLimit: indexMappingTotalFieldsLimit,
320
470
  verbose: verbose,
321
471
  });
322
472
  var indexer = indexQueueFactory({
@@ -328,30 +478,16 @@ async function transformer(ref) {
328
478
  });
329
479
 
330
480
  function getReader() {
331
- if (
332
- typeof fileName !== 'undefined'
333
- && typeof sourceIndexName !== 'undefined'
334
- ) {
335
- throw Error(
336
- 'Only either one of fileName or sourceIndexName can be specified.'
337
- );
481
+ if (typeof fileName !== 'undefined' && typeof sourceIndexName !== 'undefined') {
482
+ throw Error('Only either one of fileName or sourceIndexName can be specified.');
338
483
  }
339
484
 
340
- if (
341
- typeof fileName === 'undefined'
342
- && typeof sourceIndexName === 'undefined'
343
- ) {
485
+ if (typeof fileName === 'undefined' && typeof sourceIndexName === 'undefined') {
344
486
  throw Error('Either fileName or sourceIndexName must be specified.');
345
487
  }
346
488
 
347
489
  if (typeof fileName !== 'undefined') {
348
- return fileReaderFactory(
349
- indexer,
350
- fileName,
351
- transform,
352
- splitRegex,
353
- verbose
354
- );
490
+ return fileReaderFactory(indexer, fileName, transform, splitRegex, verbose);
355
491
  }
356
492
 
357
493
  if (typeof sourceIndexName !== 'undefined') {
@@ -359,7 +495,10 @@ async function transformer(ref) {
359
495
  indexer,
360
496
  sourceIndexName,
361
497
  transform,
362
- sourceClient
498
+ sourceClient,
499
+ query,
500
+ bufferSize,
501
+ populatedFields
363
502
  );
364
503
  }
365
504
 
@@ -4,20 +4,26 @@ import glob from 'glob';
4
4
  import cliProgress from 'cli-progress';
5
5
  import elasticsearch from '@elastic/elasticsearch';
6
6
 
7
+ var DEFAULT_BUFFER_SIZE = 1000;
8
+
7
9
  function createMappingFactory(ref) {
8
10
  var sourceClient = ref.sourceClient;
9
11
  var sourceIndexName = ref.sourceIndexName;
10
12
  var targetClient = ref.targetClient;
11
13
  var targetIndexName = ref.targetIndexName;
12
14
  var mappings = ref.mappings;
15
+ var mappingsOverride = ref.mappingsOverride;
16
+ var indexMappingTotalFieldsLimit = ref.indexMappingTotalFieldsLimit;
13
17
  var verbose = ref.verbose;
14
18
 
15
19
  return async function () {
16
- var targetMappings = mappings;
20
+ var targetMappings = mappingsOverride ? undefined : mappings;
17
21
 
18
22
  if (sourceClient && sourceIndexName && typeof targetMappings === 'undefined') {
19
23
  try {
20
- var mapping = await sourceClient.indices.getMapping({ index: sourceIndexName });
24
+ var mapping = await sourceClient.indices.getMapping({
25
+ index: sourceIndexName,
26
+ });
21
27
  targetMappings = mapping[sourceIndexName].mappings;
22
28
  } catch (err) {
23
29
  console.log('Error reading source mapping', err);
@@ -26,13 +32,24 @@ function createMappingFactory(ref) {
26
32
  }
27
33
 
28
34
  if (typeof targetMappings === 'object' && targetMappings !== null) {
35
+ if (mappingsOverride) {
36
+ targetMappings = Object.assign({}, targetMappings,
37
+ {properties: Object.assign({}, targetMappings.properties,
38
+ mappings)});
39
+ }
40
+
29
41
  try {
30
- var resp = await targetClient.indices.create(
31
- {
32
- index: targetIndexName,
33
- body: { mappings: targetMappings },
34
- }
35
- );
42
+ var resp = await targetClient.indices.create({
43
+ index: targetIndexName,
44
+ body: Object.assign({}, {mappings: targetMappings},
45
+ (indexMappingTotalFieldsLimit !== undefined
46
+ ? {
47
+ settings: {
48
+ 'index.mapping.total_fields.limit': indexMappingTotalFieldsLimit,
49
+ },
50
+ }
51
+ : {})),
52
+ });
36
53
  if (verbose) { console.log('Created target mapping', resp); }
37
54
  } catch (err) {
38
55
  console.log('Error creating target mapping', err);
@@ -41,45 +58,78 @@ function createMappingFactory(ref) {
41
58
  };
42
59
  }
43
60
 
61
+ var MAX_QUEUE_SIZE = 15;
62
+
44
63
  function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
45
64
  function startIndex(files) {
65
+ var ingestQueueSize = 0;
66
+ var finished = false;
67
+
46
68
  var file = files.shift();
47
- var s = fs.createReadStream(file)
69
+ var s = fs
70
+ .createReadStream(file)
48
71
  .pipe(es.split(splitRegex))
49
- .pipe(es.mapSync(function (line) {
50
- s.pause();
51
- try {
52
- var doc = (typeof transform === 'function') ? transform(line) : line;
53
- // if doc is undefined we'll skip indexing it
54
- if (typeof doc === 'undefined') {
55
- s.resume();
56
- return;
57
- }
72
+ .pipe(
73
+ es
74
+ .mapSync(function (line) {
75
+ try {
76
+ // skip empty lines
77
+ if (line === '') {
78
+ return;
79
+ }
80
+
81
+ var doc =
82
+ typeof transform === 'function'
83
+ ? JSON.stringify(transform(JSON.parse(line)))
84
+ : line;
85
+
86
+ // if doc is undefined we'll skip indexing it
87
+ if (typeof doc === 'undefined') {
88
+ s.resume();
89
+ return;
90
+ }
91
+
92
+ // the transform callback may return an array of docs so we can emit
93
+ // multiple docs from a single line
94
+ if (Array.isArray(doc)) {
95
+ doc.forEach(function (d) { return indexer.add(d); });
96
+ return;
97
+ }
98
+
99
+ indexer.add(doc);
100
+ } catch (e) {
101
+ console.log('error', e);
102
+ }
103
+ })
104
+ .on('error', function (err) {
105
+ console.log('Error while reading file.', err);
106
+ })
107
+ .on('end', function () {
108
+ if (verbose) { console.log('Read entire file: ', file); }
109
+ if (files.length > 0) {
110
+ startIndex(files);
111
+ return;
112
+ }
113
+
114
+ indexer.finish();
115
+ finished = true;
116
+ })
117
+ );
58
118
 
59
- // the transform callback may return an array of docs so we can emit
60
- // multiple docs from a single line
61
- if (Array.isArray(doc)) {
62
- doc.forEach(function (d) { return indexer.add(d); });
63
- return;
64
- }
119
+ indexer.queueEmitter.on('queue-size', async function (size) {
120
+ if (finished) { return; }
121
+ ingestQueueSize = size;
65
122
 
66
- indexer.add(doc);
67
- } catch (e) {
68
- console.log('error', e);
69
- }
70
- })
71
- .on('error', function (err) {
72
- console.log('Error while reading file.', err);
73
- })
74
- .on('end', function () {
75
- if (verbose) { console.log('Read entire file: ', file); }
76
- indexer.finish();
77
- if (files.length > 0) {
78
- startIndex(files);
79
- }
80
- }));
123
+ if (ingestQueueSize < MAX_QUEUE_SIZE) {
124
+ s.resume();
125
+ } else {
126
+ s.pause();
127
+ }
128
+ });
81
129
 
82
130
  indexer.queueEmitter.on('resume', function () {
131
+ if (finished) { return; }
132
+ ingestQueueSize = 0;
83
133
  s.resume();
84
134
  });
85
135
  }
@@ -95,110 +145,202 @@ var EventEmitter = require('events');
95
145
 
96
146
  var queueEmitter = new EventEmitter();
97
147
 
148
+ var parallelCalls = 1;
149
+
98
150
  // a simple helper queue to bulk index documents
99
151
  function indexQueueFactory(ref) {
100
152
  var client = ref.targetClient;
101
153
  var targetIndexName = ref.targetIndexName;
102
- var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = 1000;
154
+ var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
103
155
  var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
104
156
  var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
105
157
 
106
158
  var buffer = [];
107
159
  var queue = [];
108
- var ingesting = false;
160
+ var ingesting = 0;
161
+ var ingestTimes = [];
162
+ var finished = false;
109
163
 
110
- var ingest = async function (b) {
164
+ var ingest = function (b) {
111
165
  if (typeof b !== 'undefined') {
112
166
  queue.push(b);
113
167
  queueEmitter.emit('queue-size', queue.length);
114
168
  }
115
169
 
116
- if (ingesting === false) {
170
+ if (ingestTimes.length > 5) { ingestTimes = ingestTimes.slice(-5); }
171
+
172
+ if (ingesting < parallelCalls) {
117
173
  var docs = queue.shift();
118
- queueEmitter.emit('queue-size', queue.length);
119
- ingesting = true;
120
- if (verbose) { console.log(("bulk ingest docs: " + (docs.length / 2) + ", queue length: " + (queue.length))); }
121
174
 
122
- try {
123
- await client.bulk({ body: docs });
124
- ingesting = false;
125
- if (queue.length > 0) {
126
- ingest();
127
- }
128
- } catch (err) {
129
- console.log('bulk index error', err);
175
+ queueEmitter.emit('queue-size', queue.length);
176
+ if (queue.length <= 5) {
177
+ queueEmitter.emit('resume');
130
178
  }
131
- }
132
179
 
133
- // console.log(`ingest: queue.length ${queue.length}`);
134
- if (queue.length === 0) {
135
- queueEmitter.emit('queue-size', 0);
136
- queueEmitter.emit('resume');
180
+ ingesting += 1;
181
+
182
+ if (verbose)
183
+ { console.log(("bulk ingest docs: " + (docs.length / 2) + ", queue length: " + (queue.length))); }
184
+
185
+ var start = Date.now();
186
+ client
187
+ .bulk({ body: docs })
188
+ .then(function () {
189
+ var end = Date.now();
190
+ var delta = end - start;
191
+ ingestTimes.push(delta);
192
+ ingesting -= 1;
193
+
194
+ var ingestTimesMovingAverage =
195
+ ingestTimes.length > 0
196
+ ? ingestTimes.reduce(function (p, c) { return p + c; }, 0) / ingestTimes.length
197
+ : 0;
198
+ var ingestTimesMovingAverageSeconds = Math.floor(ingestTimesMovingAverage / 1000);
199
+
200
+ if (
201
+ ingestTimes.length > 0 &&
202
+ ingestTimesMovingAverageSeconds < 30 &&
203
+ parallelCalls < 10
204
+ ) {
205
+ parallelCalls += 1;
206
+ } else if (
207
+ ingestTimes.length > 0 &&
208
+ ingestTimesMovingAverageSeconds >= 30 &&
209
+ parallelCalls > 1
210
+ ) {
211
+ parallelCalls -= 1;
212
+ }
213
+
214
+ if (queue.length > 0) {
215
+ ingest();
216
+ } else if (queue.length === 0 && finished) {
217
+ queueEmitter.emit('finish');
218
+ }
219
+ })
220
+ .catch(function (error) {
221
+ console.error(error);
222
+ ingesting -= 1;
223
+ parallelCalls = 1;
224
+ if (queue.length > 0) {
225
+ ingest();
226
+ }
227
+ });
137
228
  }
138
229
  };
139
230
 
140
231
  return {
141
232
  add: function (doc) {
233
+ if (finished) {
234
+ throw new Error('Unexpected doc added after indexer should finish.');
235
+ }
236
+
142
237
  if (!skipHeader) {
143
238
  var header = { index: { _index: targetIndexName } };
144
239
  buffer.push(header);
145
240
  }
146
241
  buffer.push(doc);
147
242
 
148
- // console.log(`add: queue.length ${queue.length}`);
149
243
  if (queue.length === 0) {
150
244
  queueEmitter.emit('resume');
151
245
  }
152
246
 
153
- if (buffer.length >= (bufferSize * 2)) {
247
+ if (buffer.length >= bufferSize * 2) {
154
248
  ingest(buffer);
155
249
  buffer = [];
156
250
  }
157
251
  },
158
- finish: async function () {
159
- await ingest(buffer);
160
- buffer = [];
161
- queueEmitter.emit('finish');
252
+ finish: function () {
253
+ finished = true;
254
+
255
+ if (buffer.length > 0) {
256
+ ingest(buffer);
257
+ buffer = [];
258
+ } else if (queue.length === 0 && ingesting === 0) {
259
+ queueEmitter.emit('finish');
260
+ }
162
261
  },
163
262
  queueEmitter: queueEmitter,
164
263
  };
165
264
  }
166
265
 
167
- var MAX_QUEUE_SIZE = 5;
266
+ var MAX_QUEUE_SIZE$1 = 15;
168
267
 
169
268
  // create a new progress bar instance and use shades_classic theme
170
269
  var progressBar = new cliProgress.SingleBar({}, cliProgress.Presets.shades_classic);
171
270
 
172
- function indexReaderFactory(indexer, sourceIndexName, transform, client) {
271
+ function indexReaderFactory(
272
+ indexer,
273
+ sourceIndexName,
274
+ transform,
275
+ client,
276
+ query,
277
+ bufferSize,
278
+ populatedFields
279
+ ) {
280
+ if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
281
+ if ( populatedFields === void 0 ) populatedFields = false;
282
+
173
283
  return async function indexReader() {
174
284
  var responseQueue = [];
175
285
  var docsNum = 0;
176
286
 
177
- function search() {
178
- return client.search({
179
- index: sourceIndexName,
180
- scroll: '30s',
181
- size: 10000,
182
- });
287
+ async function fetchPopulatedFields() {
288
+ try {
289
+ var response = await client.search({
290
+ index: sourceIndexName,
291
+ size: bufferSize,
292
+ query: {
293
+ function_score: {
294
+ query: query,
295
+ random_score: {},
296
+ },
297
+ },
298
+ });
299
+
300
+ // Get all field names for each returned doc and flatten it
301
+ // to a list of unique field names used across all docs.
302
+ return new Set(response.hits.hits.map(function (d) { return Object.keys(d._source); }).flat(1));
303
+ } catch (e) {
304
+ console.log('error', e);
305
+ }
306
+ }
307
+
308
+ function search(fields) {
309
+ return client.search(Object.assign({}, {index: sourceIndexName,
310
+ scroll: '600s',
311
+ size: bufferSize,
312
+ query: query},
313
+ (fields ? { _source: fields } : {})));
183
314
  }
184
315
 
185
316
  function scroll(id) {
186
317
  return client.scroll({
187
318
  scroll_id: id,
188
- scroll: '30s',
319
+ scroll: '600s',
189
320
  });
190
321
  }
191
322
 
323
+ var fieldsWithData;
324
+
325
+ // identify populated fields
326
+ if (populatedFields) {
327
+ fieldsWithData = await fetchPopulatedFields();
328
+ console.log('fieldsWithData', fieldsWithData);
329
+ }
330
+
192
331
  // start things off by searching, setting a scroll timeout, and pushing
193
332
  // our first response into the queue to be processed
194
- var se = await search();
333
+ var se = await search(fieldsWithData);
195
334
  responseQueue.push(se);
196
335
  progressBar.start(se.hits.total.value, 0);
336
+ console.log('se', se.hits.hits[0]);
197
337
 
198
338
  function processHit(hit) {
199
339
  docsNum += 1;
200
340
  try {
201
- var doc = (typeof transform === 'function') ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle
341
+ var doc = typeof transform === 'function' ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle
342
+ // console.log('doc', doc);
343
+
202
344
  // if doc is undefined we'll skip indexing it
203
345
  if (typeof doc === 'undefined') {
204
346
  return;
@@ -232,15 +374,13 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
232
374
  progressBar.update(docsNum);
233
375
 
234
376
  // check to see if we have collected all of the docs
235
- // console.log('check count', response.hits.total.value, docsNum);
236
377
  if (response.hits.total.value === docsNum) {
237
378
  indexer.finish();
238
- progressBar.stop();
239
379
  break;
240
380
  }
241
381
 
242
- if (ingestQueueSize < MAX_QUEUE_SIZE) {
243
- // get the next response if there are more docs to fetch
382
+ if (ingestQueueSize < MAX_QUEUE_SIZE$1) {
383
+ // get the next response if there are more docs to fetch
244
384
  var sc = await scroll(response._scroll_id); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
245
385
  scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
246
386
  responseQueue.push(sc);
@@ -253,8 +393,8 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
253
393
  indexer.queueEmitter.on('queue-size', async function (size) {
254
394
  ingestQueueSize = size;
255
395
 
256
- if (!readActive && ingestQueueSize < MAX_QUEUE_SIZE) {
257
- // get the next response if there are more docs to fetch
396
+ if (!readActive && ingestQueueSize < MAX_QUEUE_SIZE$1) {
397
+ // get the next response if there are more docs to fetch
258
398
  var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
259
399
  scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
260
400
  responseQueue.push(sc);
@@ -276,6 +416,10 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
276
416
  processResponseQueue();
277
417
  });
278
418
 
419
+ indexer.queueEmitter.on('finish', function () {
420
+ progressBar.stop();
421
+ });
422
+
279
423
  processResponseQueue();
280
424
  };
281
425
  }
@@ -284,12 +428,16 @@ async function transformer(ref) {
284
428
  var deleteIndex = ref.deleteIndex; if ( deleteIndex === void 0 ) deleteIndex = false;
285
429
  var sourceClientConfig = ref.sourceClientConfig;
286
430
  var targetClientConfig = ref.targetClientConfig;
287
- var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = 1000;
431
+ var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
288
432
  var fileName = ref.fileName;
289
433
  var splitRegex = ref.splitRegex; if ( splitRegex === void 0 ) splitRegex = /\n/;
290
434
  var sourceIndexName = ref.sourceIndexName;
291
435
  var targetIndexName = ref.targetIndexName;
292
436
  var mappings = ref.mappings;
437
+ var mappingsOverride = ref.mappingsOverride; if ( mappingsOverride === void 0 ) mappingsOverride = false;
438
+ var indexMappingTotalFieldsLimit = ref.indexMappingTotalFieldsLimit;
439
+ var populatedFields = ref.populatedFields; if ( populatedFields === void 0 ) populatedFields = false;
440
+ var query = ref.query;
293
441
  var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
294
442
  var transform = ref.transform;
295
443
  var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
@@ -313,6 +461,8 @@ async function transformer(ref) {
313
461
  targetClient: targetClient,
314
462
  targetIndexName: targetIndexName,
315
463
  mappings: mappings,
464
+ mappingsOverride: mappingsOverride,
465
+ indexMappingTotalFieldsLimit: indexMappingTotalFieldsLimit,
316
466
  verbose: verbose,
317
467
  });
318
468
  var indexer = indexQueueFactory({
@@ -324,30 +474,16 @@ async function transformer(ref) {
324
474
  });
325
475
 
326
476
  function getReader() {
327
- if (
328
- typeof fileName !== 'undefined'
329
- && typeof sourceIndexName !== 'undefined'
330
- ) {
331
- throw Error(
332
- 'Only either one of fileName or sourceIndexName can be specified.'
333
- );
477
+ if (typeof fileName !== 'undefined' && typeof sourceIndexName !== 'undefined') {
478
+ throw Error('Only either one of fileName or sourceIndexName can be specified.');
334
479
  }
335
480
 
336
- if (
337
- typeof fileName === 'undefined'
338
- && typeof sourceIndexName === 'undefined'
339
- ) {
481
+ if (typeof fileName === 'undefined' && typeof sourceIndexName === 'undefined') {
340
482
  throw Error('Either fileName or sourceIndexName must be specified.');
341
483
  }
342
484
 
343
485
  if (typeof fileName !== 'undefined') {
344
- return fileReaderFactory(
345
- indexer,
346
- fileName,
347
- transform,
348
- splitRegex,
349
- verbose
350
- );
486
+ return fileReaderFactory(indexer, fileName, transform, splitRegex, verbose);
351
487
  }
352
488
 
353
489
  if (typeof sourceIndexName !== 'undefined') {
@@ -355,7 +491,10 @@ async function transformer(ref) {
355
491
  indexer,
356
492
  sourceIndexName,
357
493
  transform,
358
- sourceClient
494
+ sourceClient,
495
+ query,
496
+ bufferSize,
497
+ populatedFields
359
498
  );
360
499
  }
361
500
 
package/package.json CHANGED
@@ -14,22 +14,30 @@
14
14
  "license": "Apache-2.0",
15
15
  "author": "Walter Rafelsberger <walter@rafelsberger.at>",
16
16
  "contributors": [],
17
- "version": "1.0.0-alpha9",
17
+ "version": "1.0.0-beta2",
18
18
  "main": "dist/node-es-transformer.cjs.js",
19
19
  "module": "dist/node-es-transformer.esm.js",
20
20
  "dependencies": {
21
- "@elastic/elasticsearch": "^8.8.1",
21
+ "@elastic/elasticsearch": "^8.10.0",
22
22
  "cli-progress": "^3.12.0",
23
23
  "event-stream": "3.3.4",
24
24
  "glob": "7.1.2"
25
25
  },
26
26
  "devDependencies": {
27
27
  "acorn": "^6.4.2",
28
- "eslint": "8.2.0",
28
+ "async-retry": "^1.3.3",
29
+ "commit-and-tag-version": "^11.3.0",
30
+ "cz-conventional-changelog": "^3.3.0",
31
+ "eslint": "^8.51.0",
29
32
  "eslint-config-airbnb": "19.0.4",
33
+ "eslint-config-prettier": "^9.0.0",
30
34
  "eslint-plugin-import": "2.27.5",
35
+ "eslint-plugin-jest": "^27.4.2",
31
36
  "eslint-plugin-jsx-a11y": "6.7.1",
37
+ "eslint-plugin-prettier": "^3.3.1",
32
38
  "eslint-plugin-react": "7.32.2",
39
+ "jest": "^29.7.0",
40
+ "prettier": "^2.2.1",
33
41
  "rollup": "0.66.6",
34
42
  "rollup-plugin-buble": "0.19.6",
35
43
  "rollup-plugin-commonjs": "8.0.2",
@@ -38,10 +46,23 @@
38
46
  "scripts": {
39
47
  "build": "rollup -c",
40
48
  "dev": "rollup -c -w",
41
- "test": "node test/test.js",
42
- "pretest": "npm run build"
49
+ "test": "jest --runInBand --detectOpenHandles --forceExit",
50
+ "pretest": "npm run build",
51
+ "release": "commit-and-tag-version",
52
+ "create-sample-data-10000": "node scripts/create_sample_data_10000",
53
+ "create-sample-data-100": "node scripts/create_sample_data_100"
43
54
  },
44
55
  "files": [
45
56
  "dist"
46
- ]
57
+ ],
58
+ "config": {
59
+ "commitizen": {
60
+ "path": "./node_modules/cz-conventional-changelog"
61
+ }
62
+ },
63
+ "jest": {
64
+ "testMatch": [
65
+ "**/__tests__/**/*.test.js"
66
+ ]
67
+ }
47
68
  }