node-es-transformer 1.0.0-alpha7 → 1.0.0-alpha8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -12,7 +12,6 @@ If you're looking for a nodejs based tool which allows you to ingest large CSV/J
12
12
 
13
13
  While I'd generally recommend using [Logstash](https://www.elastic.co/products/logstash), [filebeat](https://www.elastic.co/products/beats/filebeat) or [Ingest Nodes](https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html) for established use cases, this tool may be of help especially if you feel more at home in the JavaScript/nodejs universe and have use cases with customized ingestion and data transformation needs.
14
14
 
15
-
16
15
  **This is experimental code, use at your own risk. Nonetheless, I encourage you to give it a try so I can gather some feedback.**
17
16
 
18
17
  ### So why is this still _alpha_?
@@ -21,13 +20,13 @@ While I'd generally recommend using [Logstash](https://www.elastic.co/products/l
21
20
  - The code needs some more safety measures to avoid some possible accidental data loss scenarios.
22
21
  - No test coverage yet.
23
22
 
24
- ----
23
+ ---
25
24
 
26
25
  Now that we've talked about the caveats, let's have a look what you actually get with this tool:
27
26
 
28
27
  ## Features
29
28
 
30
- - Buffering/Streaming for both reading and indexing. Files are read using streaming and Elasticsearch ingestion is done using buffered bulk indexing. This is tailored towards ingestion of large files. Successfully tested so far with JSON and CSV files in the range of 20-30 GBytes. On a single machine running both `node-es-transformer` and Elasticsearch ingestion rates up to 20k documents/second were achieved (2,9 GHz Intel Core i7, 16GByte RAM, SSD).
29
+ - Buffering/Streaming for both reading and indexing. Files are read using streaming and Elasticsearch ingestion is done using buffered bulk indexing. This is tailored towards ingestion of large files. Successfully tested so far with JSON and CSV files in the range of 20-30 GBytes. On a single machine running both `node-es-transformer` and Elasticsearch ingestion rates up to 20k documents/second were achieved (2,9 GHz Intel Core i7, 16GByte RAM, SSD).
31
30
  - Supports wildcards to ingest/transform a range of files in one go.
32
31
  - Supports fetching documents from existing indices using search/scroll. This allows you to reindex with custom data transformations just using JavaScript in the `transform` callback.
33
32
  - The `transform` callback gives you each source document, but you can split it up in multiple ones and return an array of documents. An example use case for this: Each source document is a Tweet and you want to transform that into an entity centric index based on Hashtags.
@@ -46,9 +45,8 @@ const transformer = require('node-es-transformer');
46
45
  transformer({
47
46
  fileName: 'filename.json',
48
47
  targetIndexName: 'my-index',
49
- typeName: 'doc',
50
48
  mappings: {
51
- doc: {
49
+ _doc: {
52
50
  properties: {
53
51
  '@timestamp': {
54
52
  type: 'date'
@@ -82,9 +80,9 @@ const transformer = require('node-es-transformer');
82
80
  transformer({
83
81
  sourceIndexName: 'my-source-index',
84
82
  targetIndexName: 'my-target-index',
85
- typeName: 'doc',
83
+ // optional, if you skip mappings, they will be fetched from the source index.
86
84
  mappings: {
87
- doc: {
85
+ _doc: {
88
86
  properties: {
89
87
  '@timestamp': {
90
88
  type: 'date'
@@ -112,15 +110,18 @@ transformer({
112
110
 
113
111
  ### Options
114
112
 
115
- - `deleteIndex`: Setting to automatically delete an existing index, default is `false`.
116
- - `host`: Elasticsearch host, defaults to `localhost`.
117
- - `port`: Elasticsearch port, defaults to `9200`.
113
+ - `deleteIndex`: Setting to automatically delete an existing index, default is `false`.
114
+ - `protocol`/`targetProtocol`: Elasticsearch protocol, defaults to `http`.
115
+ - `host`/`targetHost`: Elasticsearch host, defaults to `localhost`.
116
+ - `port`/`targetPort`: Elasticsearch port, defaults to `9200`.
117
+ - `auth`/`targetAuth`: Optional Elasticsearch authorization object, for example `{ username: 'elastic', password: 'changeme'}`.
118
+ - `rejectUnauthorized`: Elasticsearch TLS option, defaults to `true`.
119
+ - `ca`: Optional path to certificate used for TLS configuraiton.
118
120
  - `bufferSize`: The amount of documents inserted with each Elasticsearch bulk insert request, default is `1000`.
119
121
  - `fileName`: Source filename to ingest, supports wildcards. If this is set, `sourceIndexName` is not allowed.
120
122
  - `splitRegex`: Custom line split regex, defaults to `/\n/`.
121
123
  - `sourceIndexName`: The source Elasticsearch to reindex from. If this is set, `fileName` is not allowed.
122
124
  - `targetIndexName`: The target Elasticsearch index where documents will be indexed.
123
- - `typeName`: Elasticsearch document type name.
124
125
  - `mappings`: Elasticsearch document mapping.
125
126
  - `skipHeader`: If true, skips the first line of the source file. Defaults to `false`.
126
127
  - `transform(line)`: A callback function which allows the transformation of a source line into one or several documents.
@@ -138,10 +139,10 @@ yarn
138
139
 
139
140
  `yarn build` builds the library to `dist`, generating two files:
140
141
 
141
- * `dist/node-es-transformer.cjs.js`
142
- A CommonJS bundle, suitable for use in Node.js, that `require`s the external dependency. This corresponds to the `"main"` field in package.json
143
- * `dist/node-es-transformer.esm.js`
144
- an ES module bundle, suitable for use in other people's libraries and applications, that `import`s the external dependency. This corresponds to the `"module"` field in package.json
142
+ - `dist/node-es-transformer.cjs.js`
143
+ A CommonJS bundle, suitable for use in Node.js, that `require`s the external dependency. This corresponds to the `"main"` field in package.json
144
+ - `dist/node-es-transformer.esm.js`
145
+ an ES module bundle, suitable for use in other people's libraries and applications, that `import`s the external dependency. This corresponds to the `"module"` field in package.json
145
146
 
146
147
  `yarn dev` builds the library, then keeps rebuilding it whenever the source files change using [rollup-watch](https://github.com/rollup/rollup-watch).
147
148
 
@@ -5,40 +5,47 @@ function _interopDefault (ex) { return (ex && (typeof ex === 'object') && 'defau
5
5
  var fs = _interopDefault(require('fs'));
6
6
  var es = _interopDefault(require('event-stream'));
7
7
  var glob = _interopDefault(require('glob'));
8
- var elasticsearch = _interopDefault(require('elasticsearch'));
8
+ var cliProgress = _interopDefault(require('cli-progress'));
9
+ var elasticsearch = _interopDefault(require('@elastic/elasticsearch'));
9
10
 
10
11
  function createMappingFactory(ref) {
11
- var client = ref.client;
12
+ var sourceClient = ref.sourceClient;
13
+ var sourceIndexName = ref.sourceIndexName;
14
+ var targetClient = ref.targetClient;
12
15
  var targetIndexName = ref.targetIndexName;
13
16
  var mappings = ref.mappings;
14
17
  var verbose = ref.verbose;
15
18
 
16
- return function () { return (new Promise(function (resolve, reject) {
17
- console.log('targetIndexName', targetIndexName);
18
- if (
19
- typeof mappings === 'object'
20
- && mappings !== null
21
- ) {
22
- client.indices.create({
23
- index: targetIndexName,
24
- body: { mappings: mappings },
25
- }, function (err, resp) {
26
- if (err) {
27
- console.log('Error creating mapping', err);
28
- reject();
29
- return;
30
- }
31
- if (verbose) { console.log('Created mapping', resp); }
32
- resolve();
33
- });
34
- } else {
35
- resolve();
19
+ return async function () {
20
+ var targetMappings = mappings;
21
+
22
+ if (sourceClient && sourceIndexName && typeof targetMappings === 'undefined') {
23
+ try {
24
+ var mapping = await sourceClient.indices.getMapping({ index: sourceIndexName });
25
+ targetMappings = mapping[sourceIndexName].mappings;
26
+ } catch (err) {
27
+ console.log('Error reading source mapping', err);
28
+ return;
29
+ }
36
30
  }
37
- })); };
31
+
32
+ if (typeof targetMappings === 'object' && targetMappings !== null) {
33
+ try {
34
+ var resp = await targetClient.indices.create(
35
+ {
36
+ index: targetIndexName,
37
+ body: { mappings: targetMappings },
38
+ }
39
+ );
40
+ if (verbose) { console.log('Created target mapping', resp); }
41
+ } catch (err) {
42
+ console.log('Error creating target mapping', err);
43
+ }
44
+ }
45
+ };
38
46
  }
39
47
 
40
48
  function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
41
- console.log('splitRegex', splitRegex);
42
49
  function startIndex(files) {
43
50
  var file = files.shift();
44
51
  var s = fs.createReadStream(file)
@@ -48,15 +55,11 @@ function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
48
55
  try {
49
56
  var doc = (typeof transform === 'function') ? transform(line) : line;
50
57
  // if doc is undefined we'll skip indexing it
51
- if (
52
- typeof doc === 'undefined'
53
- || (Array.isArray(doc) && doc.length === 0)
54
- ) {
58
+ if (typeof doc === 'undefined') {
55
59
  s.resume();
56
60
  return;
57
61
  }
58
62
 
59
- //console.log('continue?');
60
63
  // the transform callback may return an array of docs so we can emit
61
64
  // multiple docs from a single line
62
65
  if (Array.isArray(doc)) {
@@ -81,7 +84,6 @@ function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
81
84
  }));
82
85
 
83
86
  indexer.queueEmitter.on('resume', function () {
84
- //console.log('on resume');
85
87
  s.resume();
86
88
  });
87
89
  }
@@ -99,9 +101,8 @@ var queueEmitter = new EventEmitter();
99
101
 
100
102
  // a simple helper queue to bulk index documents
101
103
  function indexQueueFactory(ref) {
102
- var client = ref.client;
104
+ var client = ref.targetClient;
103
105
  var targetIndexName = ref.targetIndexName;
104
- var typeName = ref.typeName; if ( typeName === void 0 ) typeName = 'doc';
105
106
  var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = 1000;
106
107
  var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
107
108
  var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
@@ -110,36 +111,40 @@ function indexQueueFactory(ref) {
110
111
  var queue = [];
111
112
  var ingesting = false;
112
113
 
113
- var ingest = function (b) {
114
+ var ingest = async function (b) {
114
115
  if (typeof b !== 'undefined') {
115
116
  queue.push(b);
117
+ queueEmitter.emit('queue-size', queue.length);
116
118
  }
117
119
 
118
120
  if (ingesting === false) {
119
121
  var docs = queue.shift();
122
+ queueEmitter.emit('queue-size', queue.length);
120
123
  ingesting = true;
121
124
  if (verbose) { console.log(("bulk ingest docs: " + (docs.length / 2) + ", queue length: " + (queue.length))); }
122
125
 
123
- client.bulk({ body: docs }, function () {
126
+ try {
127
+ await client.bulk({ body: docs });
124
128
  ingesting = false;
125
129
  if (queue.length > 0) {
126
130
  ingest();
127
131
  }
128
- });
132
+ } catch (err) {
133
+ console.log('bulk index error', err);
134
+ }
129
135
  }
130
136
 
131
137
  // console.log(`ingest: queue.length ${queue.length}`);
132
138
  if (queue.length === 0) {
139
+ queueEmitter.emit('queue-size', 0);
133
140
  queueEmitter.emit('resume');
134
141
  }
135
-
136
- return [];
137
142
  };
138
143
 
139
144
  return {
140
145
  add: function (doc) {
141
146
  if (!skipHeader) {
142
- var header = { index: { _index: targetIndexName, _type: typeName } };
147
+ var header = { index: { _index: targetIndexName } };
143
148
  buffer.push(header);
144
149
  }
145
150
  buffer.push(doc);
@@ -150,16 +155,24 @@ function indexQueueFactory(ref) {
150
155
  }
151
156
 
152
157
  if (buffer.length >= (bufferSize * 2)) {
153
- buffer = ingest(buffer);
158
+ ingest(buffer);
159
+ buffer = [];
154
160
  }
155
161
  },
156
- finish: function () {
157
- buffer = ingest(buffer);
162
+ finish: async function () {
163
+ await ingest(buffer);
164
+ buffer = [];
165
+ queueEmitter.emit('finish');
158
166
  },
159
167
  queueEmitter: queueEmitter,
160
168
  };
161
169
  }
162
170
 
171
+ var MAX_QUEUE_SIZE = 5;
172
+
173
+ // create a new progress bar instance and use shades_classic theme
174
+ var progressBar = new cliProgress.SingleBar({}, cliProgress.Presets.shades_classic);
175
+
163
176
  function indexReaderFactory(indexer, sourceIndexName, transform, client) {
164
177
  return async function indexReader() {
165
178
  var responseQueue = [];
@@ -169,12 +182,13 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
169
182
  return client.search({
170
183
  index: sourceIndexName,
171
184
  scroll: '30s',
185
+ size: 10000,
172
186
  });
173
187
  }
174
188
 
175
189
  function scroll(id) {
176
190
  return client.scroll({
177
- scrollId: id,
191
+ scroll_id: id,
178
192
  scroll: '30s',
179
193
  });
180
194
  }
@@ -183,11 +197,12 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
183
197
  // our first response into the queue to be processed
184
198
  var se = await search();
185
199
  responseQueue.push(se);
200
+ progressBar.start(se.hits.total.value, 0);
186
201
 
187
202
  function processHit(hit) {
188
203
  docsNum += 1;
189
204
  try {
190
- var doc = (typeof transform === 'function') ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle,max-len
205
+ var doc = (typeof transform === 'function') ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle
191
206
  // if doc is undefined we'll skip indexing it
192
207
  if (typeof doc === 'undefined') {
193
208
  return;
@@ -206,36 +221,86 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
206
221
  }
207
222
  }
208
223
 
209
- while (responseQueue.length) {
210
- var response = responseQueue.shift();
224
+ var ingestQueueSize = 0;
225
+ var scrollId = se._scroll_id; // eslint-disable-line no-underscore-dangle
226
+ var readActive = false;
227
+
228
+ async function processResponseQueue() {
229
+ while (responseQueue.length) {
230
+ readActive = true;
231
+ var response = responseQueue.shift();
232
+
233
+ // collect the docs from this response
234
+ response.hits.hits.forEach(processHit);
235
+
236
+ progressBar.update(docsNum);
237
+
238
+ // check to see if we have collected all of the docs
239
+ // console.log('check count', response.hits.total.value, docsNum);
240
+ if (response.hits.total.value === docsNum) {
241
+ indexer.finish();
242
+ progressBar.stop();
243
+ break;
244
+ }
245
+
246
+ if (ingestQueueSize < MAX_QUEUE_SIZE) {
247
+ // get the next response if there are more docs to fetch
248
+ var sc = await scroll(response._scroll_id); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
249
+ scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
250
+ responseQueue.push(sc);
251
+ } else {
252
+ readActive = false;
253
+ }
254
+ }
255
+ }
256
+
257
+ indexer.queueEmitter.on('queue-size', async function (size) {
258
+ ingestQueueSize = size;
259
+
260
+ if (!readActive && ingestQueueSize < MAX_QUEUE_SIZE) {
261
+ // get the next response if there are more docs to fetch
262
+ var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
263
+ scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
264
+ responseQueue.push(sc);
265
+ processResponseQueue();
266
+ }
267
+ });
211
268
 
212
- // collect the docs from this response
213
- response.hits.hits.forEach(processHit);
269
+ indexer.queueEmitter.on('resume', async function () {
270
+ ingestQueueSize = 0;
214
271
 
215
- // check to see if we have collected all of the docs
216
- if (response.hits.total === docsNum) {
217
- console.log('finished scrolling.');
218
- indexer.finish();
219
- break;
272
+ if (readActive) {
273
+ return;
220
274
  }
221
275
 
222
276
  // get the next response if there are more docs to fetch
223
- var sc = await scroll(response._scroll_id); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
277
+ var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
278
+ scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
224
279
  responseQueue.push(sc);
225
- }
280
+ processResponseQueue();
281
+ });
282
+
283
+ processResponseQueue();
226
284
  };
227
285
  }
228
286
 
229
- function transformer(ref) {
287
+ async function transformer(ref) {
230
288
  var deleteIndex = ref.deleteIndex; if ( deleteIndex === void 0 ) deleteIndex = false;
289
+ var protocol = ref.protocol; if ( protocol === void 0 ) protocol = 'http';
231
290
  var host = ref.host; if ( host === void 0 ) host = 'localhost';
232
291
  var port = ref.port; if ( port === void 0 ) port = '9200';
292
+ var auth = ref.auth;
293
+ var rejectUnauthorized = ref.rejectUnauthorized; if ( rejectUnauthorized === void 0 ) rejectUnauthorized = true;
294
+ var ca = ref.ca;
295
+ var targetProtocol = ref.targetProtocol;
296
+ var targetHost = ref.targetHost;
297
+ var targetPort = ref.targetPort;
298
+ var targetAuth = ref.targetAuth;
233
299
  var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = 1000;
234
300
  var fileName = ref.fileName;
235
301
  var splitRegex = ref.splitRegex; if ( splitRegex === void 0 ) splitRegex = /\n/;
236
302
  var sourceIndexName = ref.sourceIndexName;
237
303
  var targetIndexName = ref.targetIndexName;
238
- var typeName = ref.typeName;
239
304
  var mappings = ref.mappings;
240
305
  var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
241
306
  var transform = ref.transform;
@@ -245,30 +310,70 @@ function transformer(ref) {
245
310
  throw Error('targetIndexName must be specified.');
246
311
  }
247
312
 
248
- var client = new elasticsearch.Client({ host: (host + ":" + port) });
313
+ var sourceNode = protocol + "://" + host + ":" + port;
314
+ var sourceClient = new elasticsearch.Client({
315
+ node: sourceNode,
316
+ auth: auth,
317
+ tls: { ca: ca, rejectUnauthorized: rejectUnauthorized },
318
+ });
319
+
320
+ var targetNode = (typeof targetProtocol === 'string' ? targetProtocol : protocol) + "://" + (typeof targetHost === 'string' ? targetHost : host) + ":" + (typeof targetPort === 'string' ? targetPort : port);
321
+ var targetClient = new elasticsearch.Client({
322
+ node: targetNode,
323
+ auth: targetAuth !== undefined ? targetAuth : auth,
324
+ tls: { ca: ca, rejectUnauthorized: rejectUnauthorized },
325
+ });
249
326
 
250
327
  var createMapping = createMappingFactory({
251
- client: client, targetIndexName: targetIndexName, mappings: mappings, verbose: verbose,
328
+ sourceClient: sourceClient,
329
+ sourceIndexName: sourceIndexName,
330
+ targetClient: targetClient,
331
+ targetIndexName: targetIndexName,
332
+ mappings: mappings,
333
+ verbose: verbose,
252
334
  });
253
335
  var indexer = indexQueueFactory({
254
- client: client, targetIndexName: targetIndexName, typeName: typeName, bufferSize: bufferSize, skipHeader: skipHeader, verbose: verbose,
336
+ targetClient: targetClient,
337
+ targetIndexName: targetIndexName,
338
+ bufferSize: bufferSize,
339
+ skipHeader: skipHeader,
340
+ verbose: verbose,
255
341
  });
256
342
 
257
343
  function getReader() {
258
- if (typeof fileName !== 'undefined' && typeof sourceIndexName !== 'undefined') {
259
- throw Error('Only either one of fileName or sourceIndexName can be specified.');
344
+ if (
345
+ typeof fileName !== 'undefined'
346
+ && typeof sourceIndexName !== 'undefined'
347
+ ) {
348
+ throw Error(
349
+ 'Only either one of fileName or sourceIndexName can be specified.'
350
+ );
260
351
  }
261
352
 
262
- if (typeof fileName === 'undefined' && typeof sourceIndexName === 'undefined') {
353
+ if (
354
+ typeof fileName === 'undefined'
355
+ && typeof sourceIndexName === 'undefined'
356
+ ) {
263
357
  throw Error('Either fileName or sourceIndexName must be specified.');
264
358
  }
265
359
 
266
360
  if (typeof fileName !== 'undefined') {
267
- return fileReaderFactory(indexer, fileName, transform, splitRegex, verbose);
361
+ return fileReaderFactory(
362
+ indexer,
363
+ fileName,
364
+ transform,
365
+ splitRegex,
366
+ verbose
367
+ );
268
368
  }
269
369
 
270
370
  if (typeof sourceIndexName !== 'undefined') {
271
- return indexReaderFactory(indexer, sourceIndexName, transform, client);
371
+ return indexReaderFactory(
372
+ indexer,
373
+ sourceIndexName,
374
+ transform,
375
+ sourceClient
376
+ );
272
377
  }
273
378
 
274
379
  return null;
@@ -276,17 +381,26 @@ function transformer(ref) {
276
381
 
277
382
  var reader = getReader();
278
383
 
279
- client.indices.exists({ index: targetIndexName }, function (err, resp) {
280
- if (resp === false) {
281
- createMapping().then(reader);
384
+ try {
385
+ var indexExists = await targetClient.indices.exists({ index: targetIndexName });
386
+
387
+ if (indexExists === false) {
388
+ await createMapping();
389
+ reader();
282
390
  } else if (deleteIndex === true) {
283
- client.indices.delete({ index: targetIndexName }, function () {
284
- createMapping().then(reader);
285
- });
391
+ await targetClient.indices.delete({ index: targetIndexName });
392
+ await createMapping();
393
+ reader();
286
394
  } else {
287
395
  reader();
288
396
  }
289
- });
397
+ } catch (error) {
398
+ console.error('Error checking index existence:', error);
399
+ } finally {
400
+ // targetClient.close();
401
+ }
402
+
403
+ return { events: indexer.queueEmitter };
290
404
  }
291
405
 
292
406
  module.exports = transformer;
@@ -1,40 +1,47 @@
1
1
  import fs from 'fs';
2
2
  import es from 'event-stream';
3
3
  import glob from 'glob';
4
- import elasticsearch from 'elasticsearch';
4
+ import cliProgress from 'cli-progress';
5
+ import elasticsearch from '@elastic/elasticsearch';
5
6
 
6
7
  function createMappingFactory(ref) {
7
- var client = ref.client;
8
+ var sourceClient = ref.sourceClient;
9
+ var sourceIndexName = ref.sourceIndexName;
10
+ var targetClient = ref.targetClient;
8
11
  var targetIndexName = ref.targetIndexName;
9
12
  var mappings = ref.mappings;
10
13
  var verbose = ref.verbose;
11
14
 
12
- return function () { return (new Promise(function (resolve, reject) {
13
- console.log('targetIndexName', targetIndexName);
14
- if (
15
- typeof mappings === 'object'
16
- && mappings !== null
17
- ) {
18
- client.indices.create({
19
- index: targetIndexName,
20
- body: { mappings: mappings },
21
- }, function (err, resp) {
22
- if (err) {
23
- console.log('Error creating mapping', err);
24
- reject();
25
- return;
26
- }
27
- if (verbose) { console.log('Created mapping', resp); }
28
- resolve();
29
- });
30
- } else {
31
- resolve();
15
+ return async function () {
16
+ var targetMappings = mappings;
17
+
18
+ if (sourceClient && sourceIndexName && typeof targetMappings === 'undefined') {
19
+ try {
20
+ var mapping = await sourceClient.indices.getMapping({ index: sourceIndexName });
21
+ targetMappings = mapping[sourceIndexName].mappings;
22
+ } catch (err) {
23
+ console.log('Error reading source mapping', err);
24
+ return;
25
+ }
32
26
  }
33
- })); };
27
+
28
+ if (typeof targetMappings === 'object' && targetMappings !== null) {
29
+ try {
30
+ var resp = await targetClient.indices.create(
31
+ {
32
+ index: targetIndexName,
33
+ body: { mappings: targetMappings },
34
+ }
35
+ );
36
+ if (verbose) { console.log('Created target mapping', resp); }
37
+ } catch (err) {
38
+ console.log('Error creating target mapping', err);
39
+ }
40
+ }
41
+ };
34
42
  }
35
43
 
36
44
  function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
37
- console.log('splitRegex', splitRegex);
38
45
  function startIndex(files) {
39
46
  var file = files.shift();
40
47
  var s = fs.createReadStream(file)
@@ -44,15 +51,11 @@ function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
44
51
  try {
45
52
  var doc = (typeof transform === 'function') ? transform(line) : line;
46
53
  // if doc is undefined we'll skip indexing it
47
- if (
48
- typeof doc === 'undefined'
49
- || (Array.isArray(doc) && doc.length === 0)
50
- ) {
54
+ if (typeof doc === 'undefined') {
51
55
  s.resume();
52
56
  return;
53
57
  }
54
58
 
55
- //console.log('continue?');
56
59
  // the transform callback may return an array of docs so we can emit
57
60
  // multiple docs from a single line
58
61
  if (Array.isArray(doc)) {
@@ -77,7 +80,6 @@ function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
77
80
  }));
78
81
 
79
82
  indexer.queueEmitter.on('resume', function () {
80
- //console.log('on resume');
81
83
  s.resume();
82
84
  });
83
85
  }
@@ -95,9 +97,8 @@ var queueEmitter = new EventEmitter();
95
97
 
96
98
  // a simple helper queue to bulk index documents
97
99
  function indexQueueFactory(ref) {
98
- var client = ref.client;
100
+ var client = ref.targetClient;
99
101
  var targetIndexName = ref.targetIndexName;
100
- var typeName = ref.typeName; if ( typeName === void 0 ) typeName = 'doc';
101
102
  var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = 1000;
102
103
  var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
103
104
  var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
@@ -106,36 +107,40 @@ function indexQueueFactory(ref) {
106
107
  var queue = [];
107
108
  var ingesting = false;
108
109
 
109
- var ingest = function (b) {
110
+ var ingest = async function (b) {
110
111
  if (typeof b !== 'undefined') {
111
112
  queue.push(b);
113
+ queueEmitter.emit('queue-size', queue.length);
112
114
  }
113
115
 
114
116
  if (ingesting === false) {
115
117
  var docs = queue.shift();
118
+ queueEmitter.emit('queue-size', queue.length);
116
119
  ingesting = true;
117
120
  if (verbose) { console.log(("bulk ingest docs: " + (docs.length / 2) + ", queue length: " + (queue.length))); }
118
121
 
119
- client.bulk({ body: docs }, function () {
122
+ try {
123
+ await client.bulk({ body: docs });
120
124
  ingesting = false;
121
125
  if (queue.length > 0) {
122
126
  ingest();
123
127
  }
124
- });
128
+ } catch (err) {
129
+ console.log('bulk index error', err);
130
+ }
125
131
  }
126
132
 
127
133
  // console.log(`ingest: queue.length ${queue.length}`);
128
134
  if (queue.length === 0) {
135
+ queueEmitter.emit('queue-size', 0);
129
136
  queueEmitter.emit('resume');
130
137
  }
131
-
132
- return [];
133
138
  };
134
139
 
135
140
  return {
136
141
  add: function (doc) {
137
142
  if (!skipHeader) {
138
- var header = { index: { _index: targetIndexName, _type: typeName } };
143
+ var header = { index: { _index: targetIndexName } };
139
144
  buffer.push(header);
140
145
  }
141
146
  buffer.push(doc);
@@ -146,16 +151,24 @@ function indexQueueFactory(ref) {
146
151
  }
147
152
 
148
153
  if (buffer.length >= (bufferSize * 2)) {
149
- buffer = ingest(buffer);
154
+ ingest(buffer);
155
+ buffer = [];
150
156
  }
151
157
  },
152
- finish: function () {
153
- buffer = ingest(buffer);
158
+ finish: async function () {
159
+ await ingest(buffer);
160
+ buffer = [];
161
+ queueEmitter.emit('finish');
154
162
  },
155
163
  queueEmitter: queueEmitter,
156
164
  };
157
165
  }
158
166
 
167
+ var MAX_QUEUE_SIZE = 5;
168
+
169
+ // create a new progress bar instance and use shades_classic theme
170
+ var progressBar = new cliProgress.SingleBar({}, cliProgress.Presets.shades_classic);
171
+
159
172
  function indexReaderFactory(indexer, sourceIndexName, transform, client) {
160
173
  return async function indexReader() {
161
174
  var responseQueue = [];
@@ -165,12 +178,13 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
165
178
  return client.search({
166
179
  index: sourceIndexName,
167
180
  scroll: '30s',
181
+ size: 10000,
168
182
  });
169
183
  }
170
184
 
171
185
  function scroll(id) {
172
186
  return client.scroll({
173
- scrollId: id,
187
+ scroll_id: id,
174
188
  scroll: '30s',
175
189
  });
176
190
  }
@@ -179,11 +193,12 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
179
193
  // our first response into the queue to be processed
180
194
  var se = await search();
181
195
  responseQueue.push(se);
196
+ progressBar.start(se.hits.total.value, 0);
182
197
 
183
198
  function processHit(hit) {
184
199
  docsNum += 1;
185
200
  try {
186
- var doc = (typeof transform === 'function') ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle,max-len
201
+ var doc = (typeof transform === 'function') ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle
187
202
  // if doc is undefined we'll skip indexing it
188
203
  if (typeof doc === 'undefined') {
189
204
  return;
@@ -202,36 +217,86 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
202
217
  }
203
218
  }
204
219
 
205
- while (responseQueue.length) {
206
- var response = responseQueue.shift();
220
+ var ingestQueueSize = 0;
221
+ var scrollId = se._scroll_id; // eslint-disable-line no-underscore-dangle
222
+ var readActive = false;
223
+
224
+ async function processResponseQueue() {
225
+ while (responseQueue.length) {
226
+ readActive = true;
227
+ var response = responseQueue.shift();
228
+
229
+ // collect the docs from this response
230
+ response.hits.hits.forEach(processHit);
231
+
232
+ progressBar.update(docsNum);
233
+
234
+ // check to see if we have collected all of the docs
235
+ // console.log('check count', response.hits.total.value, docsNum);
236
+ if (response.hits.total.value === docsNum) {
237
+ indexer.finish();
238
+ progressBar.stop();
239
+ break;
240
+ }
241
+
242
+ if (ingestQueueSize < MAX_QUEUE_SIZE) {
243
+ // get the next response if there are more docs to fetch
244
+ var sc = await scroll(response._scroll_id); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
245
+ scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
246
+ responseQueue.push(sc);
247
+ } else {
248
+ readActive = false;
249
+ }
250
+ }
251
+ }
252
+
253
+ indexer.queueEmitter.on('queue-size', async function (size) {
254
+ ingestQueueSize = size;
255
+
256
+ if (!readActive && ingestQueueSize < MAX_QUEUE_SIZE) {
257
+ // get the next response if there are more docs to fetch
258
+ var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
259
+ scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
260
+ responseQueue.push(sc);
261
+ processResponseQueue();
262
+ }
263
+ });
207
264
 
208
- // collect the docs from this response
209
- response.hits.hits.forEach(processHit);
265
+ indexer.queueEmitter.on('resume', async function () {
266
+ ingestQueueSize = 0;
210
267
 
211
- // check to see if we have collected all of the docs
212
- if (response.hits.total === docsNum) {
213
- console.log('finished scrolling.');
214
- indexer.finish();
215
- break;
268
+ if (readActive) {
269
+ return;
216
270
  }
217
271
 
218
272
  // get the next response if there are more docs to fetch
219
- var sc = await scroll(response._scroll_id); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
273
+ var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
274
+ scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
220
275
  responseQueue.push(sc);
221
- }
276
+ processResponseQueue();
277
+ });
278
+
279
+ processResponseQueue();
222
280
  };
223
281
  }
224
282
 
225
- function transformer(ref) {
283
+ async function transformer(ref) {
226
284
  var deleteIndex = ref.deleteIndex; if ( deleteIndex === void 0 ) deleteIndex = false;
285
+ var protocol = ref.protocol; if ( protocol === void 0 ) protocol = 'http';
227
286
  var host = ref.host; if ( host === void 0 ) host = 'localhost';
228
287
  var port = ref.port; if ( port === void 0 ) port = '9200';
288
+ var auth = ref.auth;
289
+ var rejectUnauthorized = ref.rejectUnauthorized; if ( rejectUnauthorized === void 0 ) rejectUnauthorized = true;
290
+ var ca = ref.ca;
291
+ var targetProtocol = ref.targetProtocol;
292
+ var targetHost = ref.targetHost;
293
+ var targetPort = ref.targetPort;
294
+ var targetAuth = ref.targetAuth;
229
295
  var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = 1000;
230
296
  var fileName = ref.fileName;
231
297
  var splitRegex = ref.splitRegex; if ( splitRegex === void 0 ) splitRegex = /\n/;
232
298
  var sourceIndexName = ref.sourceIndexName;
233
299
  var targetIndexName = ref.targetIndexName;
234
- var typeName = ref.typeName;
235
300
  var mappings = ref.mappings;
236
301
  var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
237
302
  var transform = ref.transform;
@@ -241,30 +306,70 @@ function transformer(ref) {
241
306
  throw Error('targetIndexName must be specified.');
242
307
  }
243
308
 
244
- var client = new elasticsearch.Client({ host: (host + ":" + port) });
309
+ var sourceNode = protocol + "://" + host + ":" + port;
310
+ var sourceClient = new elasticsearch.Client({
311
+ node: sourceNode,
312
+ auth: auth,
313
+ tls: { ca: ca, rejectUnauthorized: rejectUnauthorized },
314
+ });
315
+
316
+ var targetNode = (typeof targetProtocol === 'string' ? targetProtocol : protocol) + "://" + (typeof targetHost === 'string' ? targetHost : host) + ":" + (typeof targetPort === 'string' ? targetPort : port);
317
+ var targetClient = new elasticsearch.Client({
318
+ node: targetNode,
319
+ auth: targetAuth !== undefined ? targetAuth : auth,
320
+ tls: { ca: ca, rejectUnauthorized: rejectUnauthorized },
321
+ });
245
322
 
246
323
  var createMapping = createMappingFactory({
247
- client: client, targetIndexName: targetIndexName, mappings: mappings, verbose: verbose,
324
+ sourceClient: sourceClient,
325
+ sourceIndexName: sourceIndexName,
326
+ targetClient: targetClient,
327
+ targetIndexName: targetIndexName,
328
+ mappings: mappings,
329
+ verbose: verbose,
248
330
  });
249
331
  var indexer = indexQueueFactory({
250
- client: client, targetIndexName: targetIndexName, typeName: typeName, bufferSize: bufferSize, skipHeader: skipHeader, verbose: verbose,
332
+ targetClient: targetClient,
333
+ targetIndexName: targetIndexName,
334
+ bufferSize: bufferSize,
335
+ skipHeader: skipHeader,
336
+ verbose: verbose,
251
337
  });
252
338
 
253
339
  function getReader() {
254
- if (typeof fileName !== 'undefined' && typeof sourceIndexName !== 'undefined') {
255
- throw Error('Only either one of fileName or sourceIndexName can be specified.');
340
+ if (
341
+ typeof fileName !== 'undefined'
342
+ && typeof sourceIndexName !== 'undefined'
343
+ ) {
344
+ throw Error(
345
+ 'Only either one of fileName or sourceIndexName can be specified.'
346
+ );
256
347
  }
257
348
 
258
- if (typeof fileName === 'undefined' && typeof sourceIndexName === 'undefined') {
349
+ if (
350
+ typeof fileName === 'undefined'
351
+ && typeof sourceIndexName === 'undefined'
352
+ ) {
259
353
  throw Error('Either fileName or sourceIndexName must be specified.');
260
354
  }
261
355
 
262
356
  if (typeof fileName !== 'undefined') {
263
- return fileReaderFactory(indexer, fileName, transform, splitRegex, verbose);
357
+ return fileReaderFactory(
358
+ indexer,
359
+ fileName,
360
+ transform,
361
+ splitRegex,
362
+ verbose
363
+ );
264
364
  }
265
365
 
266
366
  if (typeof sourceIndexName !== 'undefined') {
267
- return indexReaderFactory(indexer, sourceIndexName, transform, client);
367
+ return indexReaderFactory(
368
+ indexer,
369
+ sourceIndexName,
370
+ transform,
371
+ sourceClient
372
+ );
268
373
  }
269
374
 
270
375
  return null;
@@ -272,17 +377,26 @@ function transformer(ref) {
272
377
 
273
378
  var reader = getReader();
274
379
 
275
- client.indices.exists({ index: targetIndexName }, function (err, resp) {
276
- if (resp === false) {
277
- createMapping().then(reader);
380
+ try {
381
+ var indexExists = await targetClient.indices.exists({ index: targetIndexName });
382
+
383
+ if (indexExists === false) {
384
+ await createMapping();
385
+ reader();
278
386
  } else if (deleteIndex === true) {
279
- client.indices.delete({ index: targetIndexName }, function () {
280
- createMapping().then(reader);
281
- });
387
+ await targetClient.indices.delete({ index: targetIndexName });
388
+ await createMapping();
389
+ reader();
282
390
  } else {
283
391
  reader();
284
392
  }
285
- });
393
+ } catch (error) {
394
+ console.error('Error checking index existence:', error);
395
+ } finally {
396
+ // targetClient.close();
397
+ }
398
+
399
+ return { events: indexer.queueEmitter };
286
400
  }
287
401
 
288
402
  export default transformer;
package/package.json CHANGED
@@ -14,23 +14,24 @@
14
14
  "license": "Apache-2.0",
15
15
  "author": "Walter Rafelsberger <walter@rafelsberger.at>",
16
16
  "contributors": [],
17
- "version": "1.0.0-alpha7",
17
+ "version": "1.0.0-alpha8",
18
18
  "main": "dist/node-es-transformer.cjs.js",
19
19
  "module": "dist/node-es-transformer.esm.js",
20
20
  "dependencies": {
21
- "elasticsearch": "15.0.0",
21
+ "@elastic/elasticsearch": "^8.8.1",
22
+ "cli-progress": "^3.12.0",
22
23
  "event-stream": "3.3.4",
23
24
  "glob": "7.1.2"
24
25
  },
25
26
  "devDependencies": {
26
- "acorn": "6.0.0",
27
- "eslint": "4.19.1",
28
- "eslint-config-airbnb": "17.1.0",
29
- "eslint-plugin-jsx-a11y": "6.1.1",
30
- "eslint-plugin-react": "7.11.0",
31
- "eslint-plugin-import": "2.12.0",
27
+ "acorn": "^6.4.2",
28
+ "eslint": "8.2.0",
29
+ "eslint-config-airbnb": "19.0.4",
30
+ "eslint-plugin-import": "2.27.5",
31
+ "eslint-plugin-jsx-a11y": "6.7.1",
32
+ "eslint-plugin-react": "7.32.2",
32
33
  "rollup": "0.66.6",
33
- "rollup-plugin-buble": "0.19.4",
34
+ "rollup-plugin-buble": "0.19.6",
34
35
  "rollup-plugin-commonjs": "8.0.2",
35
36
  "rollup-plugin-node-resolve": "3.0.0"
36
37
  },