node-es-transformer 1.0.0-alpha9 → 1.0.0-beta1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,6 +1,8 @@
1
1
  [![npm](https://img.shields.io/npm/v/node-es-transformer.svg?maxAge=2592000)](https://www.npmjs.com/package/node-es-transformer)
2
2
  [![npm](https://img.shields.io/npm/l/node-es-transformer.svg?maxAge=2592000)](https://www.npmjs.com/package/node-es-transformer)
3
3
  [![npm](https://img.shields.io/npm/dt/node-es-transformer.svg?maxAge=2592000)](https://www.npmjs.com/package/node-es-transformer)
4
+ [![Commitizen friendly](https://img.shields.io/badge/commitizen-friendly-brightgreen.svg)](http://commitizen.github.io/cz-cli/)
5
+ [![CI](https://github.com/walterra/node-es-transformer/actions/workflows/ci.yml/badge.svg)](https://github.com/walterra/node-es-transformer/actions)
4
6
 
5
7
  # node-es-transformer
6
8
 
@@ -10,7 +12,7 @@ A nodejs based library to (re)index and transform data from/to Elasticsearch.
10
12
 
11
13
  If you're looking for a nodejs based tool which allows you to ingest large CSV/JSON files in the GigaBytes you've come to the right place. Everything else I've tried with larger files runs out of JS heap, hammers ES with too many single requests, times out or tries to do everything with a single bulk request.
12
14
 
13
- While I'd generally recommend using [Logstash](https://www.elastic.co/products/logstash), [filebeat](https://www.elastic.co/products/beats/filebeat) or [Ingest Nodes](https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html) for established use cases, this tool may be of help especially if you feel more at home in the JavaScript/nodejs universe and have use cases with customized ingestion and data transformation needs.
15
+ While I'd generally recommend using [Logstash](https://www.elastic.co/products/logstash), [filebeat](https://www.elastic.co/products/beats/filebeat), [Ingest Nodes](https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html), [Elastic Agent](https://www.elastic.co/guide/en/fleet/current/fleet-overview.html) or [Elasticsearch Transforms](https://www.elastic.co/guide/en/elasticsearch/reference/current/transforms.html) for established use cases, this tool may be of help especially if you feel more at home in the JavaScript/nodejs universe and have use cases with customized ingestion and data transformation needs.
14
16
 
15
17
  **This is experimental code, use at your own risk. Nonetheless, I encourage you to give it a try so I can gather some feedback.**
16
18
 
@@ -26,7 +28,7 @@ Now that we've talked about the caveats, let's have a look what you actually get
26
28
 
27
29
  ## Features
28
30
 
29
- - Buffering/Streaming for both reading and indexing. Files are read using streaming and Elasticsearch ingestion is done using buffered bulk indexing. This is tailored towards ingestion of large files. Successfully tested so far with JSON and CSV files in the range of 20-30 GBytes. On a single machine running both `node-es-transformer` and Elasticsearch ingestion rates up to 20k documents/second were achieved (2,9 GHz Intel Core i7, 16GByte RAM, SSD).
31
+ - Buffering/Streaming for both reading and indexing. Files are read using streaming and Elasticsearch ingestion is done using buffered bulk indexing. This is tailored towards ingestion of large files. Successfully tested so far with JSON and CSV files in the range of 20-30 GBytes. On a single machine running both `node-es-transformer` and Elasticsearch ingestion rates up to 20k documents/second were achieved (2,9 GHz Intel Core i7, 16GByte RAM, SSD), depending on document size.
30
32
  - Supports wildcards to ingest/transform a range of files in one go.
31
33
  - Supports fetching documents from existing indices using search/scroll. This allows you to reindex with custom data transformations just using JavaScript in the `transform` callback.
32
34
  - The `transform` callback gives you each source document, but you can split it up in multiple ones and return an array of documents. An example use case for this: Each source document is a Tweet and you want to transform that into an entity centric index based on Hashtags.
@@ -113,7 +115,10 @@ transformer({
113
115
  - `splitRegex`: Custom line split regex, defaults to `/\n/`.
114
116
  - `sourceIndexName`: The source Elasticsearch index to reindex from. If this is set, `fileName` is not allowed.
115
117
  - `targetIndexName`: The target Elasticsearch index where documents will be indexed.
116
- - `mappings`: Elasticsearch document mapping.
118
+ - `mappings`: Optional Elasticsearch document mappings. If not set and you're reindexing from another index, the mappings from the existing index will be used.
119
+ - `mappingsOverride`: If you're reindexing and this is set to `true`, `mappings` will be applied on top of the source index's mappings. Defaults to `false`.
120
+ - `indexMappingTotalFieldsLimit`: Optional field limit for the target index to be created that will be passed on as the `index.mapping.total_fields.limit` setting.
121
+ - `query`: Optional Elasticsearch [DSL query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html) to filter documents from the source index.
117
122
  - `skipHeader`: If true, skips the first line of the source file. Defaults to `false`.
118
123
  - `transform(line)`: A callback function which allows the transformation of a source line into one or several documents.
119
124
  - `verbose`: Logging verbosity, defaults to `true`
@@ -137,7 +142,15 @@ yarn
137
142
 
138
143
  `yarn dev` builds the library, then keeps rebuilding it whenever the source files change using [rollup-watch](https://github.com/rollup/rollup-watch).
139
144
 
140
- `yarn test` builds the library, then tests it.
145
+ `yarn test` runs the tests. The tests expect that you have an Elasticsearch instance running without security at `http://localhost:9200`. Using docker, you can set this up with:
146
+
147
+ ```bash
148
+ # Download the docker image
149
+ docker pull docker.elastic.co/elasticsearch/elasticsearch:8.10.4
150
+
151
+ # Run the container
152
+ docker run --name es01 --net elastic -p 9200:9200 -it -m 1GB -e "discovery.type=single-node" -e "xpack.security.enabled=false" docker.elastic.co/elasticsearch/elasticsearch:8.10.4
153
+ ```
141
154
 
142
155
  ## License
143
156
 
@@ -8,20 +8,26 @@ var glob = _interopDefault(require('glob'));
8
8
  var cliProgress = _interopDefault(require('cli-progress'));
9
9
  var elasticsearch = _interopDefault(require('@elastic/elasticsearch'));
10
10
 
11
+ var DEFAULT_BUFFER_SIZE = 1000;
12
+
11
13
  function createMappingFactory(ref) {
12
14
  var sourceClient = ref.sourceClient;
13
15
  var sourceIndexName = ref.sourceIndexName;
14
16
  var targetClient = ref.targetClient;
15
17
  var targetIndexName = ref.targetIndexName;
16
18
  var mappings = ref.mappings;
19
+ var mappingsOverride = ref.mappingsOverride;
20
+ var indexMappingTotalFieldsLimit = ref.indexMappingTotalFieldsLimit;
17
21
  var verbose = ref.verbose;
18
22
 
19
23
  return async function () {
20
- var targetMappings = mappings;
24
+ var targetMappings = mappingsOverride ? undefined : mappings;
21
25
 
22
26
  if (sourceClient && sourceIndexName && typeof targetMappings === 'undefined') {
23
27
  try {
24
- var mapping = await sourceClient.indices.getMapping({ index: sourceIndexName });
28
+ var mapping = await sourceClient.indices.getMapping({
29
+ index: sourceIndexName,
30
+ });
25
31
  targetMappings = mapping[sourceIndexName].mappings;
26
32
  } catch (err) {
27
33
  console.log('Error reading source mapping', err);
@@ -30,13 +36,24 @@ function createMappingFactory(ref) {
30
36
  }
31
37
 
32
38
  if (typeof targetMappings === 'object' && targetMappings !== null) {
39
+ if (mappingsOverride) {
40
+ targetMappings = Object.assign({}, targetMappings,
41
+ {properties: Object.assign({}, targetMappings.properties,
42
+ mappings)});
43
+ }
44
+
33
45
  try {
34
- var resp = await targetClient.indices.create(
35
- {
36
- index: targetIndexName,
37
- body: { mappings: targetMappings },
38
- }
39
- );
46
+ var resp = await targetClient.indices.create({
47
+ index: targetIndexName,
48
+ body: Object.assign({}, {mappings: targetMappings},
49
+ (indexMappingTotalFieldsLimit !== undefined
50
+ ? {
51
+ settings: {
52
+ 'index.mapping.total_fields.limit': indexMappingTotalFieldsLimit,
53
+ },
54
+ }
55
+ : {})),
56
+ });
40
57
  if (verbose) { console.log('Created target mapping', resp); }
41
58
  } catch (err) {
42
59
  console.log('Error creating target mapping', err);
@@ -45,45 +62,69 @@ function createMappingFactory(ref) {
45
62
  };
46
63
  }
47
64
 
65
+ var MAX_QUEUE_SIZE = 15;
66
+
48
67
  function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
49
68
  function startIndex(files) {
69
+ var ingestQueueSize = 0;
70
+ var finished = false;
71
+
50
72
  var file = files.shift();
51
- var s = fs.createReadStream(file)
73
+ var s = fs
74
+ .createReadStream(file)
52
75
  .pipe(es.split(splitRegex))
53
- .pipe(es.mapSync(function (line) {
54
- s.pause();
55
- try {
56
- var doc = (typeof transform === 'function') ? transform(line) : line;
57
- // if doc is undefined we'll skip indexing it
58
- if (typeof doc === 'undefined') {
59
- s.resume();
60
- return;
61
- }
76
+ .pipe(
77
+ es
78
+ .mapSync(function (line) {
79
+ try {
80
+ var doc = typeof transform === 'function' ? transform(line) : line;
81
+ // if doc is undefined we'll skip indexing it
82
+ if (typeof doc === 'undefined') {
83
+ s.resume();
84
+ return;
85
+ }
86
+
87
+ // the transform callback may return an array of docs so we can emit
88
+ // multiple docs from a single line
89
+ if (Array.isArray(doc)) {
90
+ doc.forEach(function (d) { return indexer.add(d); });
91
+ return;
92
+ }
93
+
94
+ indexer.add(doc);
95
+ } catch (e) {
96
+ console.log('error', e);
97
+ }
98
+ })
99
+ .on('error', function (err) {
100
+ console.log('Error while reading file.', err);
101
+ })
102
+ .on('end', function () {
103
+ if (verbose) { console.log('Read entire file: ', file); }
104
+ if (files.length > 0) {
105
+ startIndex(files);
106
+ return;
107
+ }
108
+
109
+ indexer.finish();
110
+ finished = true;
111
+ })
112
+ );
62
113
 
63
- // the transform callback may return an array of docs so we can emit
64
- // multiple docs from a single line
65
- if (Array.isArray(doc)) {
66
- doc.forEach(function (d) { return indexer.add(d); });
67
- return;
68
- }
114
+ indexer.queueEmitter.on('queue-size', async function (size) {
115
+ if (finished) { return; }
116
+ ingestQueueSize = size;
69
117
 
70
- indexer.add(doc);
71
- } catch (e) {
72
- console.log('error', e);
73
- }
74
- })
75
- .on('error', function (err) {
76
- console.log('Error while reading file.', err);
77
- })
78
- .on('end', function () {
79
- if (verbose) { console.log('Read entire file: ', file); }
80
- indexer.finish();
81
- if (files.length > 0) {
82
- startIndex(files);
83
- }
84
- }));
118
+ if (ingestQueueSize < MAX_QUEUE_SIZE) {
119
+ s.resume();
120
+ } else {
121
+ s.pause();
122
+ }
123
+ });
85
124
 
86
125
  indexer.queueEmitter.on('resume', function () {
126
+ if (finished) { return; }
127
+ ingestQueueSize = 0;
87
128
  s.resume();
88
129
  });
89
130
  }
@@ -99,81 +140,139 @@ var EventEmitter = require('events');
99
140
 
100
141
  var queueEmitter = new EventEmitter();
101
142
 
143
+ var parallelCalls = 1;
144
+
102
145
  // a simple helper queue to bulk index documents
103
146
  function indexQueueFactory(ref) {
104
147
  var client = ref.targetClient;
105
148
  var targetIndexName = ref.targetIndexName;
106
- var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = 1000;
149
+ var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
107
150
  var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
108
151
  var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
109
152
 
110
153
  var buffer = [];
111
154
  var queue = [];
112
- var ingesting = false;
155
+ var ingesting = 0;
156
+ var ingestTimes = [];
157
+ var finished = false;
113
158
 
114
- var ingest = async function (b) {
159
+ var ingest = function (b) {
115
160
  if (typeof b !== 'undefined') {
116
161
  queue.push(b);
117
162
  queueEmitter.emit('queue-size', queue.length);
118
163
  }
119
164
 
120
- if (ingesting === false) {
165
+ if (ingestTimes.length > 5) { ingestTimes = ingestTimes.slice(-5); }
166
+
167
+ if (ingesting < parallelCalls) {
121
168
  var docs = queue.shift();
122
- queueEmitter.emit('queue-size', queue.length);
123
- ingesting = true;
124
- if (verbose) { console.log(("bulk ingest docs: " + (docs.length / 2) + ", queue length: " + (queue.length))); }
125
169
 
126
- try {
127
- await client.bulk({ body: docs });
128
- ingesting = false;
129
- if (queue.length > 0) {
130
- ingest();
131
- }
132
- } catch (err) {
133
- console.log('bulk index error', err);
170
+ queueEmitter.emit('queue-size', queue.length);
171
+ if (queue.length <= 5) {
172
+ queueEmitter.emit('resume');
134
173
  }
135
- }
136
174
 
137
- // console.log(`ingest: queue.length ${queue.length}`);
138
- if (queue.length === 0) {
139
- queueEmitter.emit('queue-size', 0);
140
- queueEmitter.emit('resume');
175
+ ingesting += 1;
176
+
177
+ if (verbose)
178
+ { console.log(("bulk ingest docs: " + (docs.length / 2) + ", queue length: " + (queue.length))); }
179
+
180
+ var start = Date.now();
181
+ client
182
+ .bulk({ body: docs })
183
+ .then(function () {
184
+ var end = Date.now();
185
+ var delta = end - start;
186
+ ingestTimes.push(delta);
187
+ ingesting -= 1;
188
+
189
+ var ingestTimesMovingAverage =
190
+ ingestTimes.length > 0
191
+ ? ingestTimes.reduce(function (p, c) { return p + c; }, 0) / ingestTimes.length
192
+ : 0;
193
+ var ingestTimesMovingAverageSeconds = Math.floor(ingestTimesMovingAverage / 1000);
194
+
195
+ if (
196
+ ingestTimes.length > 0 &&
197
+ ingestTimesMovingAverageSeconds < 30 &&
198
+ parallelCalls < 10
199
+ ) {
200
+ parallelCalls += 1;
201
+ } else if (
202
+ ingestTimes.length > 0 &&
203
+ ingestTimesMovingAverageSeconds >= 30 &&
204
+ parallelCalls > 1
205
+ ) {
206
+ parallelCalls -= 1;
207
+ }
208
+
209
+ if (queue.length > 0) {
210
+ ingest();
211
+ } else if (queue.length === 0 && finished) {
212
+ queueEmitter.emit('finish');
213
+ }
214
+ })
215
+ .catch(function (error) {
216
+ console.error(error);
217
+ ingesting -= 1;
218
+ parallelCalls = 1;
219
+ if (queue.length > 0) {
220
+ ingest();
221
+ }
222
+ });
141
223
  }
142
224
  };
143
225
 
144
226
  return {
145
227
  add: function (doc) {
228
+ if (finished) {
229
+ throw new Error('Unexpected doc added after indexer should finish.');
230
+ }
231
+
146
232
  if (!skipHeader) {
147
233
  var header = { index: { _index: targetIndexName } };
148
234
  buffer.push(header);
149
235
  }
150
236
  buffer.push(doc);
151
237
 
152
- // console.log(`add: queue.length ${queue.length}`);
153
238
  if (queue.length === 0) {
154
239
  queueEmitter.emit('resume');
155
240
  }
156
241
 
157
- if (buffer.length >= (bufferSize * 2)) {
242
+ if (buffer.length >= bufferSize * 2) {
158
243
  ingest(buffer);
159
244
  buffer = [];
160
245
  }
161
246
  },
162
- finish: async function () {
163
- await ingest(buffer);
164
- buffer = [];
165
- queueEmitter.emit('finish');
247
+ finish: function () {
248
+ finished = true;
249
+
250
+ if (buffer.length > 0) {
251
+ ingest(buffer);
252
+ buffer = [];
253
+ } else if (queue.length === 0 && ingesting === 0) {
254
+ queueEmitter.emit('finish');
255
+ }
166
256
  },
167
257
  queueEmitter: queueEmitter,
168
258
  };
169
259
  }
170
260
 
171
- var MAX_QUEUE_SIZE = 5;
261
+ var MAX_QUEUE_SIZE$1 = 15;
172
262
 
173
263
  // create a new progress bar instance and use shades_classic theme
174
264
  var progressBar = new cliProgress.SingleBar({}, cliProgress.Presets.shades_classic);
175
265
 
176
- function indexReaderFactory(indexer, sourceIndexName, transform, client) {
266
+ function indexReaderFactory(
267
+ indexer,
268
+ sourceIndexName,
269
+ transform,
270
+ client,
271
+ query,
272
+ bufferSize
273
+ ) {
274
+ if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
275
+
177
276
  return async function indexReader() {
178
277
  var responseQueue = [];
179
278
  var docsNum = 0;
@@ -182,7 +281,8 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
182
281
  return client.search({
183
282
  index: sourceIndexName,
184
283
  scroll: '30s',
185
- size: 10000,
284
+ size: bufferSize,
285
+ query: query,
186
286
  });
187
287
  }
188
288
 
@@ -202,7 +302,7 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
202
302
  function processHit(hit) {
203
303
  docsNum += 1;
204
304
  try {
205
- var doc = (typeof transform === 'function') ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle
305
+ var doc = typeof transform === 'function' ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle
206
306
  // if doc is undefined we'll skip indexing it
207
307
  if (typeof doc === 'undefined') {
208
308
  return;
@@ -236,15 +336,13 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
236
336
  progressBar.update(docsNum);
237
337
 
238
338
  // check to see if we have collected all of the docs
239
- // console.log('check count', response.hits.total.value, docsNum);
240
339
  if (response.hits.total.value === docsNum) {
241
340
  indexer.finish();
242
- progressBar.stop();
243
341
  break;
244
342
  }
245
343
 
246
- if (ingestQueueSize < MAX_QUEUE_SIZE) {
247
- // get the next response if there are more docs to fetch
344
+ if (ingestQueueSize < MAX_QUEUE_SIZE$1) {
345
+ // get the next response if there are more docs to fetch
248
346
  var sc = await scroll(response._scroll_id); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
249
347
  scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
250
348
  responseQueue.push(sc);
@@ -257,8 +355,8 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
257
355
  indexer.queueEmitter.on('queue-size', async function (size) {
258
356
  ingestQueueSize = size;
259
357
 
260
- if (!readActive && ingestQueueSize < MAX_QUEUE_SIZE) {
261
- // get the next response if there are more docs to fetch
358
+ if (!readActive && ingestQueueSize < MAX_QUEUE_SIZE$1) {
359
+ // get the next response if there are more docs to fetch
262
360
  var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
263
361
  scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
264
362
  responseQueue.push(sc);
@@ -280,6 +378,10 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
280
378
  processResponseQueue();
281
379
  });
282
380
 
381
+ indexer.queueEmitter.on('finish', function () {
382
+ progressBar.stop();
383
+ });
384
+
283
385
  processResponseQueue();
284
386
  };
285
387
  }
@@ -288,12 +390,15 @@ async function transformer(ref) {
288
390
  var deleteIndex = ref.deleteIndex; if ( deleteIndex === void 0 ) deleteIndex = false;
289
391
  var sourceClientConfig = ref.sourceClientConfig;
290
392
  var targetClientConfig = ref.targetClientConfig;
291
- var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = 1000;
393
+ var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
292
394
  var fileName = ref.fileName;
293
395
  var splitRegex = ref.splitRegex; if ( splitRegex === void 0 ) splitRegex = /\n/;
294
396
  var sourceIndexName = ref.sourceIndexName;
295
397
  var targetIndexName = ref.targetIndexName;
296
398
  var mappings = ref.mappings;
399
+ var mappingsOverride = ref.mappingsOverride; if ( mappingsOverride === void 0 ) mappingsOverride = false;
400
+ var indexMappingTotalFieldsLimit = ref.indexMappingTotalFieldsLimit;
401
+ var query = ref.query;
297
402
  var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
298
403
  var transform = ref.transform;
299
404
  var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
@@ -317,6 +422,8 @@ async function transformer(ref) {
317
422
  targetClient: targetClient,
318
423
  targetIndexName: targetIndexName,
319
424
  mappings: mappings,
425
+ mappingsOverride: mappingsOverride,
426
+ indexMappingTotalFieldsLimit: indexMappingTotalFieldsLimit,
320
427
  verbose: verbose,
321
428
  });
322
429
  var indexer = indexQueueFactory({
@@ -328,30 +435,16 @@ async function transformer(ref) {
328
435
  });
329
436
 
330
437
  function getReader() {
331
- if (
332
- typeof fileName !== 'undefined'
333
- && typeof sourceIndexName !== 'undefined'
334
- ) {
335
- throw Error(
336
- 'Only either one of fileName or sourceIndexName can be specified.'
337
- );
438
+ if (typeof fileName !== 'undefined' && typeof sourceIndexName !== 'undefined') {
439
+ throw Error('Only either one of fileName or sourceIndexName can be specified.');
338
440
  }
339
441
 
340
- if (
341
- typeof fileName === 'undefined'
342
- && typeof sourceIndexName === 'undefined'
343
- ) {
442
+ if (typeof fileName === 'undefined' && typeof sourceIndexName === 'undefined') {
344
443
  throw Error('Either fileName or sourceIndexName must be specified.');
345
444
  }
346
445
 
347
446
  if (typeof fileName !== 'undefined') {
348
- return fileReaderFactory(
349
- indexer,
350
- fileName,
351
- transform,
352
- splitRegex,
353
- verbose
354
- );
447
+ return fileReaderFactory(indexer, fileName, transform, splitRegex, verbose);
355
448
  }
356
449
 
357
450
  if (typeof sourceIndexName !== 'undefined') {
@@ -359,7 +452,9 @@ async function transformer(ref) {
359
452
  indexer,
360
453
  sourceIndexName,
361
454
  transform,
362
- sourceClient
455
+ sourceClient,
456
+ query,
457
+ bufferSize
363
458
  );
364
459
  }
365
460
 
@@ -4,20 +4,26 @@ import glob from 'glob';
4
4
  import cliProgress from 'cli-progress';
5
5
  import elasticsearch from '@elastic/elasticsearch';
6
6
 
7
+ var DEFAULT_BUFFER_SIZE = 1000;
8
+
7
9
  function createMappingFactory(ref) {
8
10
  var sourceClient = ref.sourceClient;
9
11
  var sourceIndexName = ref.sourceIndexName;
10
12
  var targetClient = ref.targetClient;
11
13
  var targetIndexName = ref.targetIndexName;
12
14
  var mappings = ref.mappings;
15
+ var mappingsOverride = ref.mappingsOverride;
16
+ var indexMappingTotalFieldsLimit = ref.indexMappingTotalFieldsLimit;
13
17
  var verbose = ref.verbose;
14
18
 
15
19
  return async function () {
16
- var targetMappings = mappings;
20
+ var targetMappings = mappingsOverride ? undefined : mappings;
17
21
 
18
22
  if (sourceClient && sourceIndexName && typeof targetMappings === 'undefined') {
19
23
  try {
20
- var mapping = await sourceClient.indices.getMapping({ index: sourceIndexName });
24
+ var mapping = await sourceClient.indices.getMapping({
25
+ index: sourceIndexName,
26
+ });
21
27
  targetMappings = mapping[sourceIndexName].mappings;
22
28
  } catch (err) {
23
29
  console.log('Error reading source mapping', err);
@@ -26,13 +32,24 @@ function createMappingFactory(ref) {
26
32
  }
27
33
 
28
34
  if (typeof targetMappings === 'object' && targetMappings !== null) {
35
+ if (mappingsOverride) {
36
+ targetMappings = Object.assign({}, targetMappings,
37
+ {properties: Object.assign({}, targetMappings.properties,
38
+ mappings)});
39
+ }
40
+
29
41
  try {
30
- var resp = await targetClient.indices.create(
31
- {
32
- index: targetIndexName,
33
- body: { mappings: targetMappings },
34
- }
35
- );
42
+ var resp = await targetClient.indices.create({
43
+ index: targetIndexName,
44
+ body: Object.assign({}, {mappings: targetMappings},
45
+ (indexMappingTotalFieldsLimit !== undefined
46
+ ? {
47
+ settings: {
48
+ 'index.mapping.total_fields.limit': indexMappingTotalFieldsLimit,
49
+ },
50
+ }
51
+ : {})),
52
+ });
36
53
  if (verbose) { console.log('Created target mapping', resp); }
37
54
  } catch (err) {
38
55
  console.log('Error creating target mapping', err);
@@ -41,45 +58,69 @@ function createMappingFactory(ref) {
41
58
  };
42
59
  }
43
60
 
61
+ var MAX_QUEUE_SIZE = 15;
62
+
44
63
  function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
45
64
  function startIndex(files) {
65
+ var ingestQueueSize = 0;
66
+ var finished = false;
67
+
46
68
  var file = files.shift();
47
- var s = fs.createReadStream(file)
69
+ var s = fs
70
+ .createReadStream(file)
48
71
  .pipe(es.split(splitRegex))
49
- .pipe(es.mapSync(function (line) {
50
- s.pause();
51
- try {
52
- var doc = (typeof transform === 'function') ? transform(line) : line;
53
- // if doc is undefined we'll skip indexing it
54
- if (typeof doc === 'undefined') {
55
- s.resume();
56
- return;
57
- }
72
+ .pipe(
73
+ es
74
+ .mapSync(function (line) {
75
+ try {
76
+ var doc = typeof transform === 'function' ? transform(line) : line;
77
+ // if doc is undefined we'll skip indexing it
78
+ if (typeof doc === 'undefined') {
79
+ s.resume();
80
+ return;
81
+ }
82
+
83
+ // the transform callback may return an array of docs so we can emit
84
+ // multiple docs from a single line
85
+ if (Array.isArray(doc)) {
86
+ doc.forEach(function (d) { return indexer.add(d); });
87
+ return;
88
+ }
89
+
90
+ indexer.add(doc);
91
+ } catch (e) {
92
+ console.log('error', e);
93
+ }
94
+ })
95
+ .on('error', function (err) {
96
+ console.log('Error while reading file.', err);
97
+ })
98
+ .on('end', function () {
99
+ if (verbose) { console.log('Read entire file: ', file); }
100
+ if (files.length > 0) {
101
+ startIndex(files);
102
+ return;
103
+ }
104
+
105
+ indexer.finish();
106
+ finished = true;
107
+ })
108
+ );
58
109
 
59
- // the transform callback may return an array of docs so we can emit
60
- // multiple docs from a single line
61
- if (Array.isArray(doc)) {
62
- doc.forEach(function (d) { return indexer.add(d); });
63
- return;
64
- }
110
+ indexer.queueEmitter.on('queue-size', async function (size) {
111
+ if (finished) { return; }
112
+ ingestQueueSize = size;
65
113
 
66
- indexer.add(doc);
67
- } catch (e) {
68
- console.log('error', e);
69
- }
70
- })
71
- .on('error', function (err) {
72
- console.log('Error while reading file.', err);
73
- })
74
- .on('end', function () {
75
- if (verbose) { console.log('Read entire file: ', file); }
76
- indexer.finish();
77
- if (files.length > 0) {
78
- startIndex(files);
79
- }
80
- }));
114
+ if (ingestQueueSize < MAX_QUEUE_SIZE) {
115
+ s.resume();
116
+ } else {
117
+ s.pause();
118
+ }
119
+ });
81
120
 
82
121
  indexer.queueEmitter.on('resume', function () {
122
+ if (finished) { return; }
123
+ ingestQueueSize = 0;
83
124
  s.resume();
84
125
  });
85
126
  }
@@ -95,81 +136,139 @@ var EventEmitter = require('events');
95
136
 
96
137
  var queueEmitter = new EventEmitter();
97
138
 
139
+ var parallelCalls = 1;
140
+
98
141
  // a simple helper queue to bulk index documents
99
142
  function indexQueueFactory(ref) {
100
143
  var client = ref.targetClient;
101
144
  var targetIndexName = ref.targetIndexName;
102
- var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = 1000;
145
+ var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
103
146
  var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
104
147
  var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
105
148
 
106
149
  var buffer = [];
107
150
  var queue = [];
108
- var ingesting = false;
151
+ var ingesting = 0;
152
+ var ingestTimes = [];
153
+ var finished = false;
109
154
 
110
- var ingest = async function (b) {
155
+ var ingest = function (b) {
111
156
  if (typeof b !== 'undefined') {
112
157
  queue.push(b);
113
158
  queueEmitter.emit('queue-size', queue.length);
114
159
  }
115
160
 
116
- if (ingesting === false) {
161
+ if (ingestTimes.length > 5) { ingestTimes = ingestTimes.slice(-5); }
162
+
163
+ if (ingesting < parallelCalls) {
117
164
  var docs = queue.shift();
118
- queueEmitter.emit('queue-size', queue.length);
119
- ingesting = true;
120
- if (verbose) { console.log(("bulk ingest docs: " + (docs.length / 2) + ", queue length: " + (queue.length))); }
121
165
 
122
- try {
123
- await client.bulk({ body: docs });
124
- ingesting = false;
125
- if (queue.length > 0) {
126
- ingest();
127
- }
128
- } catch (err) {
129
- console.log('bulk index error', err);
166
+ queueEmitter.emit('queue-size', queue.length);
167
+ if (queue.length <= 5) {
168
+ queueEmitter.emit('resume');
130
169
  }
131
- }
132
170
 
133
- // console.log(`ingest: queue.length ${queue.length}`);
134
- if (queue.length === 0) {
135
- queueEmitter.emit('queue-size', 0);
136
- queueEmitter.emit('resume');
171
+ ingesting += 1;
172
+
173
+ if (verbose)
174
+ { console.log(("bulk ingest docs: " + (docs.length / 2) + ", queue length: " + (queue.length))); }
175
+
176
+ var start = Date.now();
177
+ client
178
+ .bulk({ body: docs })
179
+ .then(function () {
180
+ var end = Date.now();
181
+ var delta = end - start;
182
+ ingestTimes.push(delta);
183
+ ingesting -= 1;
184
+
185
+ var ingestTimesMovingAverage =
186
+ ingestTimes.length > 0
187
+ ? ingestTimes.reduce(function (p, c) { return p + c; }, 0) / ingestTimes.length
188
+ : 0;
189
+ var ingestTimesMovingAverageSeconds = Math.floor(ingestTimesMovingAverage / 1000);
190
+
191
+ if (
192
+ ingestTimes.length > 0 &&
193
+ ingestTimesMovingAverageSeconds < 30 &&
194
+ parallelCalls < 10
195
+ ) {
196
+ parallelCalls += 1;
197
+ } else if (
198
+ ingestTimes.length > 0 &&
199
+ ingestTimesMovingAverageSeconds >= 30 &&
200
+ parallelCalls > 1
201
+ ) {
202
+ parallelCalls -= 1;
203
+ }
204
+
205
+ if (queue.length > 0) {
206
+ ingest();
207
+ } else if (queue.length === 0 && finished) {
208
+ queueEmitter.emit('finish');
209
+ }
210
+ })
211
+ .catch(function (error) {
212
+ console.error(error);
213
+ ingesting -= 1;
214
+ parallelCalls = 1;
215
+ if (queue.length > 0) {
216
+ ingest();
217
+ }
218
+ });
137
219
  }
138
220
  };
139
221
 
140
222
  return {
141
223
  add: function (doc) {
224
+ if (finished) {
225
+ throw new Error('Unexpected doc added after indexer should finish.');
226
+ }
227
+
142
228
  if (!skipHeader) {
143
229
  var header = { index: { _index: targetIndexName } };
144
230
  buffer.push(header);
145
231
  }
146
232
  buffer.push(doc);
147
233
 
148
- // console.log(`add: queue.length ${queue.length}`);
149
234
  if (queue.length === 0) {
150
235
  queueEmitter.emit('resume');
151
236
  }
152
237
 
153
- if (buffer.length >= (bufferSize * 2)) {
238
+ if (buffer.length >= bufferSize * 2) {
154
239
  ingest(buffer);
155
240
  buffer = [];
156
241
  }
157
242
  },
158
- finish: async function () {
159
- await ingest(buffer);
160
- buffer = [];
161
- queueEmitter.emit('finish');
243
+ finish: function () {
244
+ finished = true;
245
+
246
+ if (buffer.length > 0) {
247
+ ingest(buffer);
248
+ buffer = [];
249
+ } else if (queue.length === 0 && ingesting === 0) {
250
+ queueEmitter.emit('finish');
251
+ }
162
252
  },
163
253
  queueEmitter: queueEmitter,
164
254
  };
165
255
  }
166
256
 
167
- var MAX_QUEUE_SIZE = 5;
257
+ var MAX_QUEUE_SIZE$1 = 15;
168
258
 
169
259
  // create a new progress bar instance and use shades_classic theme
170
260
  var progressBar = new cliProgress.SingleBar({}, cliProgress.Presets.shades_classic);
171
261
 
172
- function indexReaderFactory(indexer, sourceIndexName, transform, client) {
262
+ function indexReaderFactory(
263
+ indexer,
264
+ sourceIndexName,
265
+ transform,
266
+ client,
267
+ query,
268
+ bufferSize
269
+ ) {
270
+ if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
271
+
173
272
  return async function indexReader() {
174
273
  var responseQueue = [];
175
274
  var docsNum = 0;
@@ -178,7 +277,8 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
178
277
  return client.search({
179
278
  index: sourceIndexName,
180
279
  scroll: '30s',
181
- size: 10000,
280
+ size: bufferSize,
281
+ query: query,
182
282
  });
183
283
  }
184
284
 
@@ -198,7 +298,7 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
198
298
  function processHit(hit) {
199
299
  docsNum += 1;
200
300
  try {
201
- var doc = (typeof transform === 'function') ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle
301
+ var doc = typeof transform === 'function' ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle
202
302
  // if doc is undefined we'll skip indexing it
203
303
  if (typeof doc === 'undefined') {
204
304
  return;
@@ -232,15 +332,13 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
232
332
  progressBar.update(docsNum);
233
333
 
234
334
  // check to see if we have collected all of the docs
235
- // console.log('check count', response.hits.total.value, docsNum);
236
335
  if (response.hits.total.value === docsNum) {
237
336
  indexer.finish();
238
- progressBar.stop();
239
337
  break;
240
338
  }
241
339
 
242
- if (ingestQueueSize < MAX_QUEUE_SIZE) {
243
- // get the next response if there are more docs to fetch
340
+ if (ingestQueueSize < MAX_QUEUE_SIZE$1) {
341
+ // get the next response if there are more docs to fetch
244
342
  var sc = await scroll(response._scroll_id); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
245
343
  scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
246
344
  responseQueue.push(sc);
@@ -253,8 +351,8 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
253
351
  indexer.queueEmitter.on('queue-size', async function (size) {
254
352
  ingestQueueSize = size;
255
353
 
256
- if (!readActive && ingestQueueSize < MAX_QUEUE_SIZE) {
257
- // get the next response if there are more docs to fetch
354
+ if (!readActive && ingestQueueSize < MAX_QUEUE_SIZE$1) {
355
+ // get the next response if there are more docs to fetch
258
356
  var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
259
357
  scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
260
358
  responseQueue.push(sc);
@@ -276,6 +374,10 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
276
374
  processResponseQueue();
277
375
  });
278
376
 
377
+ indexer.queueEmitter.on('finish', function () {
378
+ progressBar.stop();
379
+ });
380
+
279
381
  processResponseQueue();
280
382
  };
281
383
  }
@@ -284,12 +386,15 @@ async function transformer(ref) {
284
386
  var deleteIndex = ref.deleteIndex; if ( deleteIndex === void 0 ) deleteIndex = false;
285
387
  var sourceClientConfig = ref.sourceClientConfig;
286
388
  var targetClientConfig = ref.targetClientConfig;
287
- var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = 1000;
389
+ var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
288
390
  var fileName = ref.fileName;
289
391
  var splitRegex = ref.splitRegex; if ( splitRegex === void 0 ) splitRegex = /\n/;
290
392
  var sourceIndexName = ref.sourceIndexName;
291
393
  var targetIndexName = ref.targetIndexName;
292
394
  var mappings = ref.mappings;
395
+ var mappingsOverride = ref.mappingsOverride; if ( mappingsOverride === void 0 ) mappingsOverride = false;
396
+ var indexMappingTotalFieldsLimit = ref.indexMappingTotalFieldsLimit;
397
+ var query = ref.query;
293
398
  var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
294
399
  var transform = ref.transform;
295
400
  var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
@@ -313,6 +418,8 @@ async function transformer(ref) {
313
418
  targetClient: targetClient,
314
419
  targetIndexName: targetIndexName,
315
420
  mappings: mappings,
421
+ mappingsOverride: mappingsOverride,
422
+ indexMappingTotalFieldsLimit: indexMappingTotalFieldsLimit,
316
423
  verbose: verbose,
317
424
  });
318
425
  var indexer = indexQueueFactory({
@@ -324,30 +431,16 @@ async function transformer(ref) {
324
431
  });
325
432
 
326
433
  function getReader() {
327
- if (
328
- typeof fileName !== 'undefined'
329
- && typeof sourceIndexName !== 'undefined'
330
- ) {
331
- throw Error(
332
- 'Only either one of fileName or sourceIndexName can be specified.'
333
- );
434
+ if (typeof fileName !== 'undefined' && typeof sourceIndexName !== 'undefined') {
435
+ throw Error('Only either one of fileName or sourceIndexName can be specified.');
334
436
  }
335
437
 
336
- if (
337
- typeof fileName === 'undefined'
338
- && typeof sourceIndexName === 'undefined'
339
- ) {
438
+ if (typeof fileName === 'undefined' && typeof sourceIndexName === 'undefined') {
340
439
  throw Error('Either fileName or sourceIndexName must be specified.');
341
440
  }
342
441
 
343
442
  if (typeof fileName !== 'undefined') {
344
- return fileReaderFactory(
345
- indexer,
346
- fileName,
347
- transform,
348
- splitRegex,
349
- verbose
350
- );
443
+ return fileReaderFactory(indexer, fileName, transform, splitRegex, verbose);
351
444
  }
352
445
 
353
446
  if (typeof sourceIndexName !== 'undefined') {
@@ -355,7 +448,9 @@ async function transformer(ref) {
355
448
  indexer,
356
449
  sourceIndexName,
357
450
  transform,
358
- sourceClient
451
+ sourceClient,
452
+ query,
453
+ bufferSize
359
454
  );
360
455
  }
361
456
 
package/package.json CHANGED
@@ -14,22 +14,30 @@
14
14
  "license": "Apache-2.0",
15
15
  "author": "Walter Rafelsberger <walter@rafelsberger.at>",
16
16
  "contributors": [],
17
- "version": "1.0.0-alpha9",
17
+ "version": "1.0.0-beta1",
18
18
  "main": "dist/node-es-transformer.cjs.js",
19
19
  "module": "dist/node-es-transformer.esm.js",
20
20
  "dependencies": {
21
- "@elastic/elasticsearch": "^8.8.1",
21
+ "@elastic/elasticsearch": "^8.10.0",
22
22
  "cli-progress": "^3.12.0",
23
23
  "event-stream": "3.3.4",
24
24
  "glob": "7.1.2"
25
25
  },
26
26
  "devDependencies": {
27
27
  "acorn": "^6.4.2",
28
- "eslint": "8.2.0",
28
+ "commit-and-tag-version": "^11.3.0",
29
+ "cz-conventional-changelog": "^3.3.0",
30
+ "eslint": "^8.51.0",
29
31
  "eslint-config-airbnb": "19.0.4",
32
+ "eslint-config-prettier": "^9.0.0",
30
33
  "eslint-plugin-import": "2.27.5",
34
+ "eslint-plugin-jest": "^27.4.2",
31
35
  "eslint-plugin-jsx-a11y": "6.7.1",
36
+ "eslint-plugin-prettier": "^3.3.1",
32
37
  "eslint-plugin-react": "7.32.2",
38
+ "frisby": "^2.1.3",
39
+ "jest": "^29.7.0",
40
+ "prettier": "^2.2.1",
33
41
  "rollup": "0.66.6",
34
42
  "rollup-plugin-buble": "0.19.6",
35
43
  "rollup-plugin-commonjs": "8.0.2",
@@ -38,10 +46,18 @@
38
46
  "scripts": {
39
47
  "build": "rollup -c",
40
48
  "dev": "rollup -c -w",
41
- "test": "node test/test.js",
42
- "pretest": "npm run build"
49
+ "test": "jest",
50
+ "pretest": "npm run build",
51
+ "release": "commit-and-tag-version",
52
+ "create-sample-data-10000": "node scripts/create_sample_data_10000",
53
+ "create-sample-data-100": "node scripts/create_sample_data_100"
43
54
  },
44
55
  "files": [
45
56
  "dist"
46
- ]
57
+ ],
58
+ "config": {
59
+ "commitizen": {
60
+ "path": "./node_modules/cz-conventional-changelog"
61
+ }
62
+ }
47
63
  }