node-es-transformer 1.0.0-alpha6 → 1.0.0-alpha8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -12,7 +12,6 @@ If you're looking for a nodejs based tool which allows you to ingest large CSV/J
12
12
 
13
13
  While I'd generally recommend using [Logstash](https://www.elastic.co/products/logstash), [filebeat](https://www.elastic.co/products/beats/filebeat) or [Ingest Nodes](https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html) for established use cases, this tool may be of help especially if you feel more at home in the JavaScript/nodejs universe and have use cases with customized ingestion and data transformation needs.
14
14
 
15
-
16
15
  **This is experimental code, use at your own risk. Nonetheless, I encourage you to give it a try so I can gather some feedback.**
17
16
 
18
17
  ### So why is this still _alpha_?
@@ -21,13 +20,13 @@ While I'd generally recommend using [Logstash](https://www.elastic.co/products/l
21
20
  - The code needs some more safety measures to avoid some possible accidental data loss scenarios.
22
21
  - No test coverage yet.
23
22
 
24
- ----
23
+ ---
25
24
 
26
25
  Now that we've talked about the caveats, let's have a look what you actually get with this tool:
27
26
 
28
27
  ## Features
29
28
 
30
- - Buffering/Streaming for both reading and indexing. Files are read using streaming and Elasticsearch ingestion is done using buffered bulk indexing. This is tailored towards ingestion of large files. Successfully tested so far with JSON and CSV files in the range of 20-30 GBytes. On a single machine running both `node-es-transformer` and Elasticsearch ingestion rates up to 20k documents/second were achieved (2,9 GHz Intel Core i7, 16GByte RAM, SSD).
29
+ - Buffering/Streaming for both reading and indexing. Files are read using streaming and Elasticsearch ingestion is done using buffered bulk indexing. This is tailored towards ingestion of large files. Successfully tested so far with JSON and CSV files in the range of 20-30 GBytes. On a single machine running both `node-es-transformer` and Elasticsearch ingestion rates up to 20k documents/second were achieved (2,9 GHz Intel Core i7, 16GByte RAM, SSD).
31
30
  - Supports wildcards to ingest/transform a range of files in one go.
32
31
  - Supports fetching documents from existing indices using search/scroll. This allows you to reindex with custom data transformations just using JavaScript in the `transform` callback.
33
32
  - The `transform` callback gives you each source document, but you can split it up in multiple ones and return an array of documents. An example use case for this: Each source document is a Tweet and you want to transform that into an entity centric index based on Hashtags.
@@ -46,9 +45,8 @@ const transformer = require('node-es-transformer');
46
45
  transformer({
47
46
  fileName: 'filename.json',
48
47
  targetIndexName: 'my-index',
49
- typeName: 'doc',
50
48
  mappings: {
51
- doc: {
49
+ _doc: {
52
50
  properties: {
53
51
  '@timestamp': {
54
52
  type: 'date'
@@ -82,9 +80,9 @@ const transformer = require('node-es-transformer');
82
80
  transformer({
83
81
  sourceIndexName: 'my-source-index',
84
82
  targetIndexName: 'my-target-index',
85
- typeName: 'doc',
83
+ // optional, if you skip mappings, they will be fetched from the source index.
86
84
  mappings: {
87
- doc: {
85
+ _doc: {
88
86
  properties: {
89
87
  '@timestamp': {
90
88
  type: 'date'
@@ -112,15 +110,18 @@ transformer({
112
110
 
113
111
  ### Options
114
112
 
115
- - `deleteIndex`: Setting to automatically delete an existing index, default is `false`.
116
- - `host`: Elasticsearch host, defaults to `localhost`.
117
- - `port`: Elasticsearch port, defaults to `9200`.
113
+ - `deleteIndex`: Setting to automatically delete an existing index, default is `false`.
114
+ - `protocol`/`targetProtocol`: Elasticsearch protocol, defaults to `http`.
115
+ - `host`/`targetHost`: Elasticsearch host, defaults to `localhost`.
116
+ - `port`/`targetPort`: Elasticsearch port, defaults to `9200`.
117
+ - `auth`/`targetAuth`: Optional Elasticsearch authorization object, for example `{ username: 'elastic', password: 'changeme'}`.
118
+ - `rejectUnauthorized`: Elasticsearch TLS option, defaults to `true`.
119
+ - `ca`: Optional path to certificate used for TLS configuraiton.
118
120
  - `bufferSize`: The amount of documents inserted with each Elasticsearch bulk insert request, default is `1000`.
119
121
  - `fileName`: Source filename to ingest, supports wildcards. If this is set, `sourceIndexName` is not allowed.
120
122
  - `splitRegex`: Custom line split regex, defaults to `/\n/`.
121
123
  - `sourceIndexName`: The source Elasticsearch to reindex from. If this is set, `fileName` is not allowed.
122
124
  - `targetIndexName`: The target Elasticsearch index where documents will be indexed.
123
- - `typeName`: Elasticsearch document type name.
124
125
  - `mappings`: Elasticsearch document mapping.
125
126
  - `skipHeader`: If true, skips the first line of the source file. Defaults to `false`.
126
127
  - `transform(line)`: A callback function which allows the transformation of a source line into one or several documents.
@@ -138,10 +139,10 @@ yarn
138
139
 
139
140
  `yarn build` builds the library to `dist`, generating two files:
140
141
 
141
- * `dist/node-es-transformer.cjs.js`
142
- A CommonJS bundle, suitable for use in Node.js, that `require`s the external dependency. This corresponds to the `"main"` field in package.json
143
- * `dist/node-es-transformer.esm.js`
144
- an ES module bundle, suitable for use in other people's libraries and applications, that `import`s the external dependency. This corresponds to the `"module"` field in package.json
142
+ - `dist/node-es-transformer.cjs.js`
143
+ A CommonJS bundle, suitable for use in Node.js, that `require`s the external dependency. This corresponds to the `"main"` field in package.json
144
+ - `dist/node-es-transformer.esm.js`
145
+ an ES module bundle, suitable for use in other people's libraries and applications, that `import`s the external dependency. This corresponds to the `"module"` field in package.json
145
146
 
146
147
  `yarn dev` builds the library, then keeps rebuilding it whenever the source files change using [rollup-watch](https://github.com/rollup/rollup-watch).
147
148
 
@@ -5,48 +5,55 @@ function _interopDefault (ex) { return (ex && (typeof ex === 'object') && 'defau
5
5
  var fs = _interopDefault(require('fs'));
6
6
  var es = _interopDefault(require('event-stream'));
7
7
  var glob = _interopDefault(require('glob'));
8
- var elasticsearch = _interopDefault(require('elasticsearch'));
8
+ var cliProgress = _interopDefault(require('cli-progress'));
9
+ var elasticsearch = _interopDefault(require('@elastic/elasticsearch'));
9
10
 
10
11
  function createMappingFactory(ref) {
11
- var client = ref.client;
12
+ var sourceClient = ref.sourceClient;
13
+ var sourceIndexName = ref.sourceIndexName;
14
+ var targetClient = ref.targetClient;
12
15
  var targetIndexName = ref.targetIndexName;
13
16
  var mappings = ref.mappings;
14
17
  var verbose = ref.verbose;
15
18
 
16
- return function () { return (new Promise(function (resolve, reject) {
17
- console.log('targetIndexName', targetIndexName);
18
- if (
19
- typeof mappings === 'object'
20
- && mappings !== null
21
- ) {
22
- client.indices.create({
23
- index: targetIndexName,
24
- body: { mappings: mappings },
25
- }, function (err, resp) {
26
- if (err) {
27
- console.log('Error creating mapping', err);
28
- reject();
29
- return;
30
- }
31
- if (verbose) { console.log('Created mapping', resp); }
32
- resolve();
33
- });
34
- } else {
35
- resolve();
19
+ return async function () {
20
+ var targetMappings = mappings;
21
+
22
+ if (sourceClient && sourceIndexName && typeof targetMappings === 'undefined') {
23
+ try {
24
+ var mapping = await sourceClient.indices.getMapping({ index: sourceIndexName });
25
+ targetMappings = mapping[sourceIndexName].mappings;
26
+ } catch (err) {
27
+ console.log('Error reading source mapping', err);
28
+ return;
29
+ }
36
30
  }
37
- })); };
31
+
32
+ if (typeof targetMappings === 'object' && targetMappings !== null) {
33
+ try {
34
+ var resp = await targetClient.indices.create(
35
+ {
36
+ index: targetIndexName,
37
+ body: { mappings: targetMappings },
38
+ }
39
+ );
40
+ if (verbose) { console.log('Created target mapping', resp); }
41
+ } catch (err) {
42
+ console.log('Error creating target mapping', err);
43
+ }
44
+ }
45
+ };
38
46
  }
39
47
 
40
48
  function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
41
49
  function startIndex(files) {
42
50
  var file = files.shift();
43
51
  var s = fs.createReadStream(file)
44
- .pipe(es.split(/\n/))
52
+ .pipe(es.split(splitRegex))
45
53
  .pipe(es.mapSync(function (line) {
46
54
  s.pause();
47
55
  try {
48
56
  var doc = (typeof transform === 'function') ? transform(line) : line;
49
- console.log('doc', doc);
50
57
  // if doc is undefined we'll skip indexing it
51
58
  if (typeof doc === 'undefined') {
52
59
  s.resume();
@@ -94,9 +101,8 @@ var queueEmitter = new EventEmitter();
94
101
 
95
102
  // a simple helper queue to bulk index documents
96
103
  function indexQueueFactory(ref) {
97
- var client = ref.client;
104
+ var client = ref.targetClient;
98
105
  var targetIndexName = ref.targetIndexName;
99
- var typeName = ref.typeName; if ( typeName === void 0 ) typeName = 'doc';
100
106
  var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = 1000;
101
107
  var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
102
108
  var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
@@ -105,36 +111,40 @@ function indexQueueFactory(ref) {
105
111
  var queue = [];
106
112
  var ingesting = false;
107
113
 
108
- var ingest = function (b) {
114
+ var ingest = async function (b) {
109
115
  if (typeof b !== 'undefined') {
110
116
  queue.push(b);
117
+ queueEmitter.emit('queue-size', queue.length);
111
118
  }
112
119
 
113
120
  if (ingesting === false) {
114
121
  var docs = queue.shift();
122
+ queueEmitter.emit('queue-size', queue.length);
115
123
  ingesting = true;
116
124
  if (verbose) { console.log(("bulk ingest docs: " + (docs.length / 2) + ", queue length: " + (queue.length))); }
117
125
 
118
- client.bulk({ body: docs }, function () {
126
+ try {
127
+ await client.bulk({ body: docs });
119
128
  ingesting = false;
120
129
  if (queue.length > 0) {
121
130
  ingest();
122
131
  }
123
- });
132
+ } catch (err) {
133
+ console.log('bulk index error', err);
134
+ }
124
135
  }
125
136
 
126
137
  // console.log(`ingest: queue.length ${queue.length}`);
127
138
  if (queue.length === 0) {
139
+ queueEmitter.emit('queue-size', 0);
128
140
  queueEmitter.emit('resume');
129
141
  }
130
-
131
- return [];
132
142
  };
133
143
 
134
144
  return {
135
145
  add: function (doc) {
136
146
  if (!skipHeader) {
137
- var header = { index: { _index: targetIndexName, _type: typeName } };
147
+ var header = { index: { _index: targetIndexName } };
138
148
  buffer.push(header);
139
149
  }
140
150
  buffer.push(doc);
@@ -145,16 +155,24 @@ function indexQueueFactory(ref) {
145
155
  }
146
156
 
147
157
  if (buffer.length >= (bufferSize * 2)) {
148
- buffer = ingest(buffer);
158
+ ingest(buffer);
159
+ buffer = [];
149
160
  }
150
161
  },
151
- finish: function () {
152
- buffer = ingest(buffer);
162
+ finish: async function () {
163
+ await ingest(buffer);
164
+ buffer = [];
165
+ queueEmitter.emit('finish');
153
166
  },
154
167
  queueEmitter: queueEmitter,
155
168
  };
156
169
  }
157
170
 
171
+ var MAX_QUEUE_SIZE = 5;
172
+
173
+ // create a new progress bar instance and use shades_classic theme
174
+ var progressBar = new cliProgress.SingleBar({}, cliProgress.Presets.shades_classic);
175
+
158
176
  function indexReaderFactory(indexer, sourceIndexName, transform, client) {
159
177
  return async function indexReader() {
160
178
  var responseQueue = [];
@@ -164,12 +182,13 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
164
182
  return client.search({
165
183
  index: sourceIndexName,
166
184
  scroll: '30s',
185
+ size: 10000,
167
186
  });
168
187
  }
169
188
 
170
189
  function scroll(id) {
171
190
  return client.scroll({
172
- scrollId: id,
191
+ scroll_id: id,
173
192
  scroll: '30s',
174
193
  });
175
194
  }
@@ -178,11 +197,12 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
178
197
  // our first response into the queue to be processed
179
198
  var se = await search();
180
199
  responseQueue.push(se);
200
+ progressBar.start(se.hits.total.value, 0);
181
201
 
182
202
  function processHit(hit) {
183
203
  docsNum += 1;
184
204
  try {
185
- var doc = (typeof transform === 'function') ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle,max-len
205
+ var doc = (typeof transform === 'function') ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle
186
206
  // if doc is undefined we'll skip indexing it
187
207
  if (typeof doc === 'undefined') {
188
208
  return;
@@ -201,36 +221,86 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
201
221
  }
202
222
  }
203
223
 
204
- while (responseQueue.length) {
205
- var response = responseQueue.shift();
224
+ var ingestQueueSize = 0;
225
+ var scrollId = se._scroll_id; // eslint-disable-line no-underscore-dangle
226
+ var readActive = false;
227
+
228
+ async function processResponseQueue() {
229
+ while (responseQueue.length) {
230
+ readActive = true;
231
+ var response = responseQueue.shift();
232
+
233
+ // collect the docs from this response
234
+ response.hits.hits.forEach(processHit);
235
+
236
+ progressBar.update(docsNum);
237
+
238
+ // check to see if we have collected all of the docs
239
+ // console.log('check count', response.hits.total.value, docsNum);
240
+ if (response.hits.total.value === docsNum) {
241
+ indexer.finish();
242
+ progressBar.stop();
243
+ break;
244
+ }
245
+
246
+ if (ingestQueueSize < MAX_QUEUE_SIZE) {
247
+ // get the next response if there are more docs to fetch
248
+ var sc = await scroll(response._scroll_id); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
249
+ scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
250
+ responseQueue.push(sc);
251
+ } else {
252
+ readActive = false;
253
+ }
254
+ }
255
+ }
256
+
257
+ indexer.queueEmitter.on('queue-size', async function (size) {
258
+ ingestQueueSize = size;
259
+
260
+ if (!readActive && ingestQueueSize < MAX_QUEUE_SIZE) {
261
+ // get the next response if there are more docs to fetch
262
+ var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
263
+ scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
264
+ responseQueue.push(sc);
265
+ processResponseQueue();
266
+ }
267
+ });
206
268
 
207
- // collect the docs from this response
208
- response.hits.hits.forEach(processHit);
269
+ indexer.queueEmitter.on('resume', async function () {
270
+ ingestQueueSize = 0;
209
271
 
210
- // check to see if we have collected all of the docs
211
- if (response.hits.total === docsNum) {
212
- console.log('finished scrolling.');
213
- indexer.finish();
214
- break;
272
+ if (readActive) {
273
+ return;
215
274
  }
216
275
 
217
276
  // get the next response if there are more docs to fetch
218
- var sc = await scroll(response._scroll_id); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
277
+ var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
278
+ scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
219
279
  responseQueue.push(sc);
220
- }
280
+ processResponseQueue();
281
+ });
282
+
283
+ processResponseQueue();
221
284
  };
222
285
  }
223
286
 
224
- function transformer(ref) {
287
+ async function transformer(ref) {
225
288
  var deleteIndex = ref.deleteIndex; if ( deleteIndex === void 0 ) deleteIndex = false;
289
+ var protocol = ref.protocol; if ( protocol === void 0 ) protocol = 'http';
226
290
  var host = ref.host; if ( host === void 0 ) host = 'localhost';
227
291
  var port = ref.port; if ( port === void 0 ) port = '9200';
292
+ var auth = ref.auth;
293
+ var rejectUnauthorized = ref.rejectUnauthorized; if ( rejectUnauthorized === void 0 ) rejectUnauthorized = true;
294
+ var ca = ref.ca;
295
+ var targetProtocol = ref.targetProtocol;
296
+ var targetHost = ref.targetHost;
297
+ var targetPort = ref.targetPort;
298
+ var targetAuth = ref.targetAuth;
228
299
  var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = 1000;
229
300
  var fileName = ref.fileName;
230
301
  var splitRegex = ref.splitRegex; if ( splitRegex === void 0 ) splitRegex = /\n/;
231
302
  var sourceIndexName = ref.sourceIndexName;
232
303
  var targetIndexName = ref.targetIndexName;
233
- var typeName = ref.typeName;
234
304
  var mappings = ref.mappings;
235
305
  var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
236
306
  var transform = ref.transform;
@@ -240,30 +310,70 @@ function transformer(ref) {
240
310
  throw Error('targetIndexName must be specified.');
241
311
  }
242
312
 
243
- var client = new elasticsearch.Client({ host: (host + ":" + port) });
313
+ var sourceNode = protocol + "://" + host + ":" + port;
314
+ var sourceClient = new elasticsearch.Client({
315
+ node: sourceNode,
316
+ auth: auth,
317
+ tls: { ca: ca, rejectUnauthorized: rejectUnauthorized },
318
+ });
319
+
320
+ var targetNode = (typeof targetProtocol === 'string' ? targetProtocol : protocol) + "://" + (typeof targetHost === 'string' ? targetHost : host) + ":" + (typeof targetPort === 'string' ? targetPort : port);
321
+ var targetClient = new elasticsearch.Client({
322
+ node: targetNode,
323
+ auth: targetAuth !== undefined ? targetAuth : auth,
324
+ tls: { ca: ca, rejectUnauthorized: rejectUnauthorized },
325
+ });
244
326
 
245
327
  var createMapping = createMappingFactory({
246
- client: client, targetIndexName: targetIndexName, mappings: mappings, verbose: verbose,
328
+ sourceClient: sourceClient,
329
+ sourceIndexName: sourceIndexName,
330
+ targetClient: targetClient,
331
+ targetIndexName: targetIndexName,
332
+ mappings: mappings,
333
+ verbose: verbose,
247
334
  });
248
335
  var indexer = indexQueueFactory({
249
- client: client, targetIndexName: targetIndexName, typeName: typeName, bufferSize: bufferSize, skipHeader: skipHeader, verbose: verbose,
336
+ targetClient: targetClient,
337
+ targetIndexName: targetIndexName,
338
+ bufferSize: bufferSize,
339
+ skipHeader: skipHeader,
340
+ verbose: verbose,
250
341
  });
251
342
 
252
343
  function getReader() {
253
- if (typeof fileName !== 'undefined' && typeof sourceIndexName !== 'undefined') {
254
- throw Error('Only either one of fileName or sourceIndexName can be specified.');
344
+ if (
345
+ typeof fileName !== 'undefined'
346
+ && typeof sourceIndexName !== 'undefined'
347
+ ) {
348
+ throw Error(
349
+ 'Only either one of fileName or sourceIndexName can be specified.'
350
+ );
255
351
  }
256
352
 
257
- if (typeof fileName === 'undefined' && typeof sourceIndexName === 'undefined') {
353
+ if (
354
+ typeof fileName === 'undefined'
355
+ && typeof sourceIndexName === 'undefined'
356
+ ) {
258
357
  throw Error('Either fileName or sourceIndexName must be specified.');
259
358
  }
260
359
 
261
360
  if (typeof fileName !== 'undefined') {
262
- return fileReaderFactory(indexer, fileName, transform, splitRegex, verbose);
361
+ return fileReaderFactory(
362
+ indexer,
363
+ fileName,
364
+ transform,
365
+ splitRegex,
366
+ verbose
367
+ );
263
368
  }
264
369
 
265
370
  if (typeof sourceIndexName !== 'undefined') {
266
- return indexReaderFactory(indexer, sourceIndexName, transform, client);
371
+ return indexReaderFactory(
372
+ indexer,
373
+ sourceIndexName,
374
+ transform,
375
+ sourceClient
376
+ );
267
377
  }
268
378
 
269
379
  return null;
@@ -271,17 +381,26 @@ function transformer(ref) {
271
381
 
272
382
  var reader = getReader();
273
383
 
274
- client.indices.exists({ index: targetIndexName }, function (err, resp) {
275
- if (resp === false) {
276
- createMapping().then(reader);
384
+ try {
385
+ var indexExists = await targetClient.indices.exists({ index: targetIndexName });
386
+
387
+ if (indexExists === false) {
388
+ await createMapping();
389
+ reader();
277
390
  } else if (deleteIndex === true) {
278
- client.indices.delete({ index: targetIndexName }, function () {
279
- createMapping().then(reader);
280
- });
391
+ await targetClient.indices.delete({ index: targetIndexName });
392
+ await createMapping();
393
+ reader();
281
394
  } else {
282
395
  reader();
283
396
  }
284
- });
397
+ } catch (error) {
398
+ console.error('Error checking index existence:', error);
399
+ } finally {
400
+ // targetClient.close();
401
+ }
402
+
403
+ return { events: indexer.queueEmitter };
285
404
  }
286
405
 
287
406
  module.exports = transformer;
@@ -1,48 +1,55 @@
1
1
  import fs from 'fs';
2
2
  import es from 'event-stream';
3
3
  import glob from 'glob';
4
- import elasticsearch from 'elasticsearch';
4
+ import cliProgress from 'cli-progress';
5
+ import elasticsearch from '@elastic/elasticsearch';
5
6
 
6
7
  function createMappingFactory(ref) {
7
- var client = ref.client;
8
+ var sourceClient = ref.sourceClient;
9
+ var sourceIndexName = ref.sourceIndexName;
10
+ var targetClient = ref.targetClient;
8
11
  var targetIndexName = ref.targetIndexName;
9
12
  var mappings = ref.mappings;
10
13
  var verbose = ref.verbose;
11
14
 
12
- return function () { return (new Promise(function (resolve, reject) {
13
- console.log('targetIndexName', targetIndexName);
14
- if (
15
- typeof mappings === 'object'
16
- && mappings !== null
17
- ) {
18
- client.indices.create({
19
- index: targetIndexName,
20
- body: { mappings: mappings },
21
- }, function (err, resp) {
22
- if (err) {
23
- console.log('Error creating mapping', err);
24
- reject();
25
- return;
26
- }
27
- if (verbose) { console.log('Created mapping', resp); }
28
- resolve();
29
- });
30
- } else {
31
- resolve();
15
+ return async function () {
16
+ var targetMappings = mappings;
17
+
18
+ if (sourceClient && sourceIndexName && typeof targetMappings === 'undefined') {
19
+ try {
20
+ var mapping = await sourceClient.indices.getMapping({ index: sourceIndexName });
21
+ targetMappings = mapping[sourceIndexName].mappings;
22
+ } catch (err) {
23
+ console.log('Error reading source mapping', err);
24
+ return;
25
+ }
32
26
  }
33
- })); };
27
+
28
+ if (typeof targetMappings === 'object' && targetMappings !== null) {
29
+ try {
30
+ var resp = await targetClient.indices.create(
31
+ {
32
+ index: targetIndexName,
33
+ body: { mappings: targetMappings },
34
+ }
35
+ );
36
+ if (verbose) { console.log('Created target mapping', resp); }
37
+ } catch (err) {
38
+ console.log('Error creating target mapping', err);
39
+ }
40
+ }
41
+ };
34
42
  }
35
43
 
36
44
  function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
37
45
  function startIndex(files) {
38
46
  var file = files.shift();
39
47
  var s = fs.createReadStream(file)
40
- .pipe(es.split(/\n/))
48
+ .pipe(es.split(splitRegex))
41
49
  .pipe(es.mapSync(function (line) {
42
50
  s.pause();
43
51
  try {
44
52
  var doc = (typeof transform === 'function') ? transform(line) : line;
45
- console.log('doc', doc);
46
53
  // if doc is undefined we'll skip indexing it
47
54
  if (typeof doc === 'undefined') {
48
55
  s.resume();
@@ -90,9 +97,8 @@ var queueEmitter = new EventEmitter();
90
97
 
91
98
  // a simple helper queue to bulk index documents
92
99
  function indexQueueFactory(ref) {
93
- var client = ref.client;
100
+ var client = ref.targetClient;
94
101
  var targetIndexName = ref.targetIndexName;
95
- var typeName = ref.typeName; if ( typeName === void 0 ) typeName = 'doc';
96
102
  var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = 1000;
97
103
  var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
98
104
  var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
@@ -101,36 +107,40 @@ function indexQueueFactory(ref) {
101
107
  var queue = [];
102
108
  var ingesting = false;
103
109
 
104
- var ingest = function (b) {
110
+ var ingest = async function (b) {
105
111
  if (typeof b !== 'undefined') {
106
112
  queue.push(b);
113
+ queueEmitter.emit('queue-size', queue.length);
107
114
  }
108
115
 
109
116
  if (ingesting === false) {
110
117
  var docs = queue.shift();
118
+ queueEmitter.emit('queue-size', queue.length);
111
119
  ingesting = true;
112
120
  if (verbose) { console.log(("bulk ingest docs: " + (docs.length / 2) + ", queue length: " + (queue.length))); }
113
121
 
114
- client.bulk({ body: docs }, function () {
122
+ try {
123
+ await client.bulk({ body: docs });
115
124
  ingesting = false;
116
125
  if (queue.length > 0) {
117
126
  ingest();
118
127
  }
119
- });
128
+ } catch (err) {
129
+ console.log('bulk index error', err);
130
+ }
120
131
  }
121
132
 
122
133
  // console.log(`ingest: queue.length ${queue.length}`);
123
134
  if (queue.length === 0) {
135
+ queueEmitter.emit('queue-size', 0);
124
136
  queueEmitter.emit('resume');
125
137
  }
126
-
127
- return [];
128
138
  };
129
139
 
130
140
  return {
131
141
  add: function (doc) {
132
142
  if (!skipHeader) {
133
- var header = { index: { _index: targetIndexName, _type: typeName } };
143
+ var header = { index: { _index: targetIndexName } };
134
144
  buffer.push(header);
135
145
  }
136
146
  buffer.push(doc);
@@ -141,16 +151,24 @@ function indexQueueFactory(ref) {
141
151
  }
142
152
 
143
153
  if (buffer.length >= (bufferSize * 2)) {
144
- buffer = ingest(buffer);
154
+ ingest(buffer);
155
+ buffer = [];
145
156
  }
146
157
  },
147
- finish: function () {
148
- buffer = ingest(buffer);
158
+ finish: async function () {
159
+ await ingest(buffer);
160
+ buffer = [];
161
+ queueEmitter.emit('finish');
149
162
  },
150
163
  queueEmitter: queueEmitter,
151
164
  };
152
165
  }
153
166
 
167
+ var MAX_QUEUE_SIZE = 5;
168
+
169
+ // create a new progress bar instance and use shades_classic theme
170
+ var progressBar = new cliProgress.SingleBar({}, cliProgress.Presets.shades_classic);
171
+
154
172
  function indexReaderFactory(indexer, sourceIndexName, transform, client) {
155
173
  return async function indexReader() {
156
174
  var responseQueue = [];
@@ -160,12 +178,13 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
160
178
  return client.search({
161
179
  index: sourceIndexName,
162
180
  scroll: '30s',
181
+ size: 10000,
163
182
  });
164
183
  }
165
184
 
166
185
  function scroll(id) {
167
186
  return client.scroll({
168
- scrollId: id,
187
+ scroll_id: id,
169
188
  scroll: '30s',
170
189
  });
171
190
  }
@@ -174,11 +193,12 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
174
193
  // our first response into the queue to be processed
175
194
  var se = await search();
176
195
  responseQueue.push(se);
196
+ progressBar.start(se.hits.total.value, 0);
177
197
 
178
198
  function processHit(hit) {
179
199
  docsNum += 1;
180
200
  try {
181
- var doc = (typeof transform === 'function') ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle,max-len
201
+ var doc = (typeof transform === 'function') ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle
182
202
  // if doc is undefined we'll skip indexing it
183
203
  if (typeof doc === 'undefined') {
184
204
  return;
@@ -197,36 +217,86 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
197
217
  }
198
218
  }
199
219
 
200
- while (responseQueue.length) {
201
- var response = responseQueue.shift();
220
+ var ingestQueueSize = 0;
221
+ var scrollId = se._scroll_id; // eslint-disable-line no-underscore-dangle
222
+ var readActive = false;
223
+
224
+ async function processResponseQueue() {
225
+ while (responseQueue.length) {
226
+ readActive = true;
227
+ var response = responseQueue.shift();
228
+
229
+ // collect the docs from this response
230
+ response.hits.hits.forEach(processHit);
231
+
232
+ progressBar.update(docsNum);
233
+
234
+ // check to see if we have collected all of the docs
235
+ // console.log('check count', response.hits.total.value, docsNum);
236
+ if (response.hits.total.value === docsNum) {
237
+ indexer.finish();
238
+ progressBar.stop();
239
+ break;
240
+ }
241
+
242
+ if (ingestQueueSize < MAX_QUEUE_SIZE) {
243
+ // get the next response if there are more docs to fetch
244
+ var sc = await scroll(response._scroll_id); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
245
+ scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
246
+ responseQueue.push(sc);
247
+ } else {
248
+ readActive = false;
249
+ }
250
+ }
251
+ }
252
+
253
+ indexer.queueEmitter.on('queue-size', async function (size) {
254
+ ingestQueueSize = size;
255
+
256
+ if (!readActive && ingestQueueSize < MAX_QUEUE_SIZE) {
257
+ // get the next response if there are more docs to fetch
258
+ var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
259
+ scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
260
+ responseQueue.push(sc);
261
+ processResponseQueue();
262
+ }
263
+ });
202
264
 
203
- // collect the docs from this response
204
- response.hits.hits.forEach(processHit);
265
+ indexer.queueEmitter.on('resume', async function () {
266
+ ingestQueueSize = 0;
205
267
 
206
- // check to see if we have collected all of the docs
207
- if (response.hits.total === docsNum) {
208
- console.log('finished scrolling.');
209
- indexer.finish();
210
- break;
268
+ if (readActive) {
269
+ return;
211
270
  }
212
271
 
213
272
  // get the next response if there are more docs to fetch
214
- var sc = await scroll(response._scroll_id); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
273
+ var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
274
+ scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
215
275
  responseQueue.push(sc);
216
- }
276
+ processResponseQueue();
277
+ });
278
+
279
+ processResponseQueue();
217
280
  };
218
281
  }
219
282
 
220
- function transformer(ref) {
283
+ async function transformer(ref) {
221
284
  var deleteIndex = ref.deleteIndex; if ( deleteIndex === void 0 ) deleteIndex = false;
285
+ var protocol = ref.protocol; if ( protocol === void 0 ) protocol = 'http';
222
286
  var host = ref.host; if ( host === void 0 ) host = 'localhost';
223
287
  var port = ref.port; if ( port === void 0 ) port = '9200';
288
+ var auth = ref.auth;
289
+ var rejectUnauthorized = ref.rejectUnauthorized; if ( rejectUnauthorized === void 0 ) rejectUnauthorized = true;
290
+ var ca = ref.ca;
291
+ var targetProtocol = ref.targetProtocol;
292
+ var targetHost = ref.targetHost;
293
+ var targetPort = ref.targetPort;
294
+ var targetAuth = ref.targetAuth;
224
295
  var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = 1000;
225
296
  var fileName = ref.fileName;
226
297
  var splitRegex = ref.splitRegex; if ( splitRegex === void 0 ) splitRegex = /\n/;
227
298
  var sourceIndexName = ref.sourceIndexName;
228
299
  var targetIndexName = ref.targetIndexName;
229
- var typeName = ref.typeName;
230
300
  var mappings = ref.mappings;
231
301
  var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
232
302
  var transform = ref.transform;
@@ -236,30 +306,70 @@ function transformer(ref) {
236
306
  throw Error('targetIndexName must be specified.');
237
307
  }
238
308
 
239
- var client = new elasticsearch.Client({ host: (host + ":" + port) });
309
+ var sourceNode = protocol + "://" + host + ":" + port;
310
+ var sourceClient = new elasticsearch.Client({
311
+ node: sourceNode,
312
+ auth: auth,
313
+ tls: { ca: ca, rejectUnauthorized: rejectUnauthorized },
314
+ });
315
+
316
+ var targetNode = (typeof targetProtocol === 'string' ? targetProtocol : protocol) + "://" + (typeof targetHost === 'string' ? targetHost : host) + ":" + (typeof targetPort === 'string' ? targetPort : port);
317
+ var targetClient = new elasticsearch.Client({
318
+ node: targetNode,
319
+ auth: targetAuth !== undefined ? targetAuth : auth,
320
+ tls: { ca: ca, rejectUnauthorized: rejectUnauthorized },
321
+ });
240
322
 
241
323
  var createMapping = createMappingFactory({
242
- client: client, targetIndexName: targetIndexName, mappings: mappings, verbose: verbose,
324
+ sourceClient: sourceClient,
325
+ sourceIndexName: sourceIndexName,
326
+ targetClient: targetClient,
327
+ targetIndexName: targetIndexName,
328
+ mappings: mappings,
329
+ verbose: verbose,
243
330
  });
244
331
  var indexer = indexQueueFactory({
245
- client: client, targetIndexName: targetIndexName, typeName: typeName, bufferSize: bufferSize, skipHeader: skipHeader, verbose: verbose,
332
+ targetClient: targetClient,
333
+ targetIndexName: targetIndexName,
334
+ bufferSize: bufferSize,
335
+ skipHeader: skipHeader,
336
+ verbose: verbose,
246
337
  });
247
338
 
248
339
  function getReader() {
249
- if (typeof fileName !== 'undefined' && typeof sourceIndexName !== 'undefined') {
250
- throw Error('Only either one of fileName or sourceIndexName can be specified.');
340
+ if (
341
+ typeof fileName !== 'undefined'
342
+ && typeof sourceIndexName !== 'undefined'
343
+ ) {
344
+ throw Error(
345
+ 'Only either one of fileName or sourceIndexName can be specified.'
346
+ );
251
347
  }
252
348
 
253
- if (typeof fileName === 'undefined' && typeof sourceIndexName === 'undefined') {
349
+ if (
350
+ typeof fileName === 'undefined'
351
+ && typeof sourceIndexName === 'undefined'
352
+ ) {
254
353
  throw Error('Either fileName or sourceIndexName must be specified.');
255
354
  }
256
355
 
257
356
  if (typeof fileName !== 'undefined') {
258
- return fileReaderFactory(indexer, fileName, transform, splitRegex, verbose);
357
+ return fileReaderFactory(
358
+ indexer,
359
+ fileName,
360
+ transform,
361
+ splitRegex,
362
+ verbose
363
+ );
259
364
  }
260
365
 
261
366
  if (typeof sourceIndexName !== 'undefined') {
262
- return indexReaderFactory(indexer, sourceIndexName, transform, client);
367
+ return indexReaderFactory(
368
+ indexer,
369
+ sourceIndexName,
370
+ transform,
371
+ sourceClient
372
+ );
263
373
  }
264
374
 
265
375
  return null;
@@ -267,17 +377,26 @@ function transformer(ref) {
267
377
 
268
378
  var reader = getReader();
269
379
 
270
- client.indices.exists({ index: targetIndexName }, function (err, resp) {
271
- if (resp === false) {
272
- createMapping().then(reader);
380
+ try {
381
+ var indexExists = await targetClient.indices.exists({ index: targetIndexName });
382
+
383
+ if (indexExists === false) {
384
+ await createMapping();
385
+ reader();
273
386
  } else if (deleteIndex === true) {
274
- client.indices.delete({ index: targetIndexName }, function () {
275
- createMapping().then(reader);
276
- });
387
+ await targetClient.indices.delete({ index: targetIndexName });
388
+ await createMapping();
389
+ reader();
277
390
  } else {
278
391
  reader();
279
392
  }
280
- });
393
+ } catch (error) {
394
+ console.error('Error checking index existence:', error);
395
+ } finally {
396
+ // targetClient.close();
397
+ }
398
+
399
+ return { events: indexer.queueEmitter };
281
400
  }
282
401
 
283
402
  export default transformer;
package/package.json CHANGED
@@ -14,25 +14,26 @@
14
14
  "license": "Apache-2.0",
15
15
  "author": "Walter Rafelsberger <walter@rafelsberger.at>",
16
16
  "contributors": [],
17
- "version": "1.0.0-alpha6",
17
+ "version": "1.0.0-alpha8",
18
18
  "main": "dist/node-es-transformer.cjs.js",
19
19
  "module": "dist/node-es-transformer.esm.js",
20
20
  "dependencies": {
21
- "elasticsearch": "^15.0.0",
22
- "event-stream": "^3.3.4",
23
- "glob": "^7.1.2"
21
+ "@elastic/elasticsearch": "^8.8.1",
22
+ "cli-progress": "^3.12.0",
23
+ "event-stream": "3.3.4",
24
+ "glob": "7.1.2"
24
25
  },
25
26
  "devDependencies": {
26
- "acorn": "^6.0.0",
27
- "eslint": "^4.19.1",
28
- "eslint-config-airbnb": "^17.1.0",
29
- "eslint-plugin-jsx-a11y": "^6.1.1",
30
- "eslint-plugin-react": "^7.11.0",
31
- "eslint-plugin-import": "^2.12.0",
32
- "rollup": "^0.66.6",
33
- "rollup-plugin-buble": "^0.19.4",
34
- "rollup-plugin-commonjs": "^8.0.2",
35
- "rollup-plugin-node-resolve": "^3.0.0"
27
+ "acorn": "^6.4.2",
28
+ "eslint": "8.2.0",
29
+ "eslint-config-airbnb": "19.0.4",
30
+ "eslint-plugin-import": "2.27.5",
31
+ "eslint-plugin-jsx-a11y": "6.7.1",
32
+ "eslint-plugin-react": "7.32.2",
33
+ "rollup": "0.66.6",
34
+ "rollup-plugin-buble": "0.19.6",
35
+ "rollup-plugin-commonjs": "8.0.2",
36
+ "rollup-plugin-node-resolve": "3.0.0"
36
37
  },
37
38
  "scripts": {
38
39
  "build": "rollup -c",