node-es-transformer 1.0.0-alpha1 → 1.0.0-alpha11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md ADDED
@@ -0,0 +1,32 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project will be documented in this file. See [commit-and-tag-version](https://github.com/absolute-version/commit-and-tag-version) for commit guidelines.
4
+
5
+ ## [1.0.0-alpha11](https://github.com/walterra/node-es-transformer/compare/v1.0.0-alpha10...v1.0.0-alpha11) (2023-10-12)
6
+
7
+ ### Features
8
+
9
+ - new option 'indexMappingTotalFieldsLimit' ([92edad1](https://github.com/walterra/node-es-transformer/commit/92edad18da7186d3881fc181e6e88b7929bed2d4))
10
+
11
+ ### Bug Fixes
12
+
13
+ - fixes bufferSize to be applied to index reader too ([ffc3749](https://github.com/walterra/node-es-transformer/commit/ffc3749e296cd39f39924571c197986addc756ff))
14
+
15
+ ## [`v1.0.0-alpha10`](https://github.com/walterra/node-es-transformer/releases/tag/v1.0.0-alpha10)
16
+
17
+ - New option `mappingsOverride` (0b951e1).
18
+ - New option `query` (45f91db).
19
+
20
+ ## [`v1.0.0-alpha9`](https://github.com/walterra/node-es-transformer/releases/tag/v1.0.0-alpha9)
21
+
22
+ - Source and target configs are now expected to be passed in as complete client configs instead of individual parameters (5e6d0c7).
23
+
24
+ ## [`v1.0.0-alpha8`](https://github.com/walterra/node-es-transformer/releases/tag/v1.0.0-alpha8)
25
+
26
+ - Exposes events and introduces `finish` event (a3e5810).
27
+ - Drop support for `_type` from `6.x` indices (3a26a84).
28
+
29
+ ## [`v1.0.0-alpha7`](https://github.com/walterra/node-es-transformer/releases/tag/v1.0.0-alpha7)
30
+
31
+ - This version locks down `event-stream` to version `3.3.4` because of the security issue described here: https://github.com/dominictarr/event-stream/issues/116
32
+ - Last version to support `_type` from `6.x` indices.
package/README.md CHANGED
@@ -1,12 +1,36 @@
1
1
  [![npm](https://img.shields.io/npm/v/node-es-transformer.svg?maxAge=2592000)](https://www.npmjs.com/package/node-es-transformer)
2
2
  [![npm](https://img.shields.io/npm/l/node-es-transformer.svg?maxAge=2592000)](https://www.npmjs.com/package/node-es-transformer)
3
3
  [![npm](https://img.shields.io/npm/dt/node-es-transformer.svg?maxAge=2592000)](https://www.npmjs.com/package/node-es-transformer)
4
+ [![Commitizen friendly](https://img.shields.io/badge/commitizen-friendly-brightgreen.svg)](http://commitizen.github.io/cz-cli/)
4
5
 
5
6
  # node-es-transformer
6
7
 
7
8
  A nodejs based library to (re)index and transform data from/to Elasticsearch.
8
9
 
9
- **This is experimental code, use at your own risk.**
10
+ ### Why another reindex/ingestion tool?
11
+
12
+ If you're looking for a nodejs based tool which allows you to ingest large CSV/JSON files in the GigaBytes you've come to the right place. Everything else I've tried with larger files runs out of JS heap, hammers ES with too many single requests, times out or tries to do everything with a single bulk request.
13
+
14
+ While I'd generally recommend using [Logstash](https://www.elastic.co/products/logstash), [filebeat](https://www.elastic.co/products/beats/filebeat), [Ingest Nodes](https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html), [Elastic Agent](https://www.elastic.co/guide/en/fleet/current/fleet-overview.html) or [Elasticsearch Transforms](https://www.elastic.co/guide/en/elasticsearch/reference/current/transforms.html) for established use cases, this tool may be of help especially if you feel more at home in the JavaScript/nodejs universe and have use cases with customized ingestion and data transformation needs.
15
+
16
+ **This is experimental code, use at your own risk. Nonetheless, I encourage you to give it a try so I can gather some feedback.**
17
+
18
+ ### So why is this still _alpha_?
19
+
20
+ - The API is not quite final and might change from release to release.
21
+ - The code needs some more safety measures to avoid some possible accidental data loss scenarios.
22
+ - No test coverage yet.
23
+
24
+ ---
25
+
26
+ Now that we've talked about the caveats, let's have a look what you actually get with this tool:
27
+
28
+ ## Features
29
+
30
+ - Buffering/Streaming for both reading and indexing. Files are read using streaming and Elasticsearch ingestion is done using buffered bulk indexing. This is tailored towards ingestion of large files. Successfully tested so far with JSON and CSV files in the range of 20-30 GBytes. On a single machine running both `node-es-transformer` and Elasticsearch ingestion rates up to 20k documents/second were achieved (2,9 GHz Intel Core i7, 16GByte RAM, SSD), depending on document size.
31
+ - Supports wildcards to ingest/transform a range of files in one go.
32
+ - Supports fetching documents from existing indices using search/scroll. This allows you to reindex with custom data transformations just using JavaScript in the `transform` callback.
33
+ - The `transform` callback gives you each source document, but you can split it up in multiple ones and return an array of documents. An example use case for this: Each source document is a Tweet and you want to transform that into an entity centric index based on Hashtags.
10
34
 
11
35
  ## Getting started
12
36
 
@@ -14,31 +38,90 @@ In your node-js project, add `node-es-transformer` as a dependency (`yarn add no
14
38
 
15
39
  Use the library in your code like:
16
40
 
41
+ ### Read from a file
42
+
17
43
  ```javascript
18
- const { transformer } = require('node-es-transformer');
44
+ const transformer = require('node-es-transformer');
19
45
 
20
46
  transformer({
21
47
  fileName: 'filename.json',
22
- indexName: 'my-index',
23
- typeName: 'doc',
48
+ targetIndexName: 'my-index',
49
+ mappings: {
50
+ properties: {
51
+ '@timestamp': {
52
+ type: 'date'
53
+ },
54
+ 'first_name': {
55
+ type: 'keyword'
56
+ },
57
+ 'last_name': {
58
+ type: 'keyword'
59
+ }
60
+ 'full_name': {
61
+ type: 'keyword'
62
+ }
63
+ }
64
+ },
65
+ transform(line) {
66
+ return {
67
+ ...line,
68
+ full_name: `${line.first_name} ${line.last_name}`
69
+ }
70
+ }
71
+ });
72
+ ```
73
+
74
+ ### Read from another index
75
+
76
+ ```javascript
77
+ const transformer = require('node-es-transformer');
78
+
79
+ transformer({
80
+ sourceIndexName: 'my-source-index',
81
+ targetIndexName: 'my-target-index',
82
+ // optional, if you skip mappings, they will be fetched from the source index.
24
83
  mappings: {
25
- doc: {
26
- properties: {
27
- '@timestamp': {
28
- type: 'date'
29
- },
30
- 'field1': {
31
- type: 'keyword'
32
- },
33
- 'field2': {
34
- type: 'keyword'
35
- }
84
+ properties: {
85
+ '@timestamp': {
86
+ type: 'date'
87
+ },
88
+ 'first_name': {
89
+ type: 'keyword'
90
+ },
91
+ 'last_name': {
92
+ type: 'keyword'
36
93
  }
94
+ 'full_name': {
95
+ type: 'keyword'
96
+ }
97
+ }
98
+ },
99
+ transform(doc) {
100
+ return {
101
+ ...doc,
102
+ full_name: `${line.first_name} ${line.last_name}`
37
103
  }
38
104
  }
39
105
  });
40
106
  ```
41
107
 
108
+ ### Options
109
+
110
+ - `deleteIndex`: Setting to automatically delete an existing index, default is `false`.
111
+ - `sourceClientConfig`/`targetClientConfig`: Optional Elasticsearch client options, defaults to `{ node: 'http://localhost:9200' }`.
112
+ - `bufferSize`: The amount of documents inserted with each Elasticsearch bulk insert request, default is `1000`.
113
+ - `fileName`: Source filename to ingest, supports wildcards. If this is set, `sourceIndexName` is not allowed.
114
+ - `splitRegex`: Custom line split regex, defaults to `/\n/`.
115
+ - `sourceIndexName`: The source Elasticsearch index to reindex from. If this is set, `fileName` is not allowed.
116
+ - `targetIndexName`: The target Elasticsearch index where documents will be indexed.
117
+ - `mappings`: Optional Elasticsearch document mappings. If not set and you're reindexing from another index, the mappings from the existing index will be used.
118
+ - `mappingsOverride`: If you're reindexing and this is set to `true`, `mappings` will be applied on top of the source index's mappings. Defaults to `false`.
119
+ - `indexMappingTotalFieldsLimit`: Optional field limit for the target index to be created that will be passed on as the `index.mapping.total_fields.limit` setting.
120
+ - `query`: Optional Elasticsearch [DSL query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html) to filter documents from the source index.
121
+ - `skipHeader`: If true, skips the first line of the source file. Defaults to `false`.
122
+ - `transform(line)`: A callback function which allows the transformation of a source line into one or several documents.
123
+ - `verbose`: Logging verbosity, defaults to `true`
124
+
42
125
  ## Development
43
126
 
44
127
  Clone this repository and install its dependencies:
@@ -49,12 +132,12 @@ cd node-es-transformer
49
132
  yarn
50
133
  ```
51
134
 
52
- `yarn build` builds the library to `dist`, generating three files:
135
+ `yarn build` builds the library to `dist`, generating two files:
53
136
 
54
- * `dist/node-es-transformer.cjs.js`
55
- A CommonJS bundle, suitable for use in Node.js, that `require`s the external dependency. This corresponds to the `"main"` field in package.json
56
- * `dist/node-es-transformer.esm.js`
57
- an ES module bundle, suitable for use in other people's libraries and applications, that `import`s the external dependency. This corresponds to the `"module"` field in package.json
137
+ - `dist/node-es-transformer.cjs.js`
138
+ A CommonJS bundle, suitable for use in Node.js, that `require`s the external dependency. This corresponds to the `"main"` field in package.json
139
+ - `dist/node-es-transformer.esm.js`
140
+ an ES module bundle, suitable for use in other people's libraries and applications, that `import`s the external dependency. This corresponds to the `"module"` field in package.json
58
141
 
59
142
  `yarn dev` builds the library, then keeps rebuilding it whenever the source files change using [rollup-watch](https://github.com/rollup/rollup-watch).
60
143
 
@@ -1,104 +1,418 @@
1
1
  'use strict';
2
2
 
3
- Object.defineProperty(exports, '__esModule', { value: true });
4
-
5
3
  function _interopDefault (ex) { return (ex && (typeof ex === 'object') && 'default' in ex) ? ex['default'] : ex; }
6
4
 
7
5
  var fs = _interopDefault(require('fs'));
8
6
  var es = _interopDefault(require('event-stream'));
9
- var elasticsearch = _interopDefault(require('elasticsearch'));
10
-
11
- function transformer(ref) {
12
- var deleteIndex = ref.deleteIndex; if ( deleteIndex === void 0 ) deleteIndex = false;
13
- var host = ref.host; if ( host === void 0 ) host = 'localhost';
14
- var port = ref.port; if ( port === void 0 ) port = '9200';
15
- var fileName = ref.fileName;
16
- var indexName = ref.indexName;
17
- var typeName = ref.typeName;
18
- var mappings = ref.mappings;
19
- var transform = ref.transform;
20
- var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
21
-
22
-
23
- var client = new elasticsearch.Client({
24
- host: (host + ":" + port)
25
- });
26
-
27
- client.indices.exists({
28
- index: indexName
29
- }, function (err, resp) {
30
- if (resp === false) {
31
- createMapping();
32
- } else {
33
- if (deleteIndex === true) {
34
- client.indices.delete({
35
- index: indexName
36
- }, function (err, resp) {
37
- createMapping();
38
- });
39
- } else {
40
- indexFile();
41
- }
42
- }
43
- });
44
-
45
- function createMapping() {
46
- if (
47
- typeof mappings === 'object' &&
48
- mappings !== null
49
- ) {
50
- client.indices.create({
51
- index: indexName,
52
- body: {
53
- mappings: mappings
54
- }
55
- }, function (err, resp) {
56
- console.log('create mapping', err, resp);
57
- indexFile();
58
- });
59
- } else {
60
- indexFile();
61
- }
62
- }
63
-
64
- function indexFile() {
65
- var docs = [];
66
- var s = fs.createReadStream(fileName)
67
- .pipe(es.split())
68
- .pipe(es.mapSync(function (line) {
69
- s.pause();
70
-
71
- if (line) {
72
- try {
73
- var header = { index: { _index: indexName, _type: typeName } };
74
-
75
- var doc = (typeof transform === 'function') ? transform(line) : line;
76
-
77
- docs.push(header);
78
- docs.push(doc);
79
- } catch (e) {
80
- console.log('error', e);
81
- }
82
- }
83
-
84
- // resume the readstream, possibly from a callback
85
- s.resume();
86
- })
87
- .on('error', function (err) {
88
- console.log('Error while reading file.', err);
89
- })
90
- .on('end', function () {
91
- verbose && console.log('Read entire file.');
92
- client.bulk({
93
- body: docs
94
- }, function (err, resp) {
95
- if (err) {
96
- console.log('Ingest Error:', err);
97
- }
98
- });
99
- })
100
- );
101
- }
7
+ var glob = _interopDefault(require('glob'));
8
+ var cliProgress = _interopDefault(require('cli-progress'));
9
+ var elasticsearch = _interopDefault(require('@elastic/elasticsearch'));
10
+
11
+ var DEFAULT_BUFFER_SIZE = 1000;
12
+
13
+ function createMappingFactory(ref) {
14
+ var sourceClient = ref.sourceClient;
15
+ var sourceIndexName = ref.sourceIndexName;
16
+ var targetClient = ref.targetClient;
17
+ var targetIndexName = ref.targetIndexName;
18
+ var mappings = ref.mappings;
19
+ var mappingsOverride = ref.mappingsOverride;
20
+ var indexMappingTotalFieldsLimit = ref.indexMappingTotalFieldsLimit;
21
+ var verbose = ref.verbose;
22
+
23
+ return async function () {
24
+ var targetMappings = mappingsOverride ? undefined : mappings;
25
+
26
+ if (sourceClient && sourceIndexName && typeof targetMappings === 'undefined') {
27
+ try {
28
+ var mapping = await sourceClient.indices.getMapping({
29
+ index: sourceIndexName,
30
+ });
31
+ targetMappings = mapping[sourceIndexName].mappings;
32
+ } catch (err) {
33
+ console.log('Error reading source mapping', err);
34
+ return;
35
+ }
36
+ }
37
+
38
+ if (typeof targetMappings === 'object' && targetMappings !== null) {
39
+ if (mappingsOverride) {
40
+ targetMappings = Object.assign({}, targetMappings,
41
+ {properties: Object.assign({}, targetMappings.properties,
42
+ mappings)});
43
+ }
44
+
45
+ try {
46
+ var resp = await targetClient.indices.create({
47
+ index: targetIndexName,
48
+ body: Object.assign({}, {mappings: targetMappings},
49
+ (indexMappingTotalFieldsLimit !== undefined
50
+ ? {
51
+ settings: {
52
+ 'index.mapping.total_fields.limit': indexMappingTotalFieldsLimit,
53
+ },
54
+ }
55
+ : {})),
56
+ });
57
+ if (verbose) { console.log('Created target mapping', resp); }
58
+ } catch (err) {
59
+ console.log('Error creating target mapping', err);
60
+ }
61
+ }
62
+ };
63
+ }
64
+
65
+ function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
66
+ function startIndex(files) {
67
+ var file = files.shift();
68
+ var s = fs
69
+ .createReadStream(file)
70
+ .pipe(es.split(splitRegex))
71
+ .pipe(
72
+ es
73
+ .mapSync(function (line) {
74
+ s.pause();
75
+ try {
76
+ var doc = typeof transform === 'function' ? transform(line) : line;
77
+ // if doc is undefined we'll skip indexing it
78
+ if (typeof doc === 'undefined') {
79
+ s.resume();
80
+ return;
81
+ }
82
+
83
+ // the transform callback may return an array of docs so we can emit
84
+ // multiple docs from a single line
85
+ if (Array.isArray(doc)) {
86
+ doc.forEach(function (d) { return indexer.add(d); });
87
+ return;
88
+ }
89
+
90
+ indexer.add(doc);
91
+ } catch (e) {
92
+ console.log('error', e);
93
+ }
94
+ })
95
+ .on('error', function (err) {
96
+ console.log('Error while reading file.', err);
97
+ })
98
+ .on('end', function () {
99
+ if (verbose) { console.log('Read entire file: ', file); }
100
+ indexer.finish();
101
+ if (files.length > 0) {
102
+ startIndex(files);
103
+ }
104
+ })
105
+ );
106
+
107
+ indexer.queueEmitter.on('resume', function () {
108
+ s.resume();
109
+ });
110
+ }
111
+
112
+ return function () {
113
+ glob(fileName, function (er, files) {
114
+ startIndex(files);
115
+ });
116
+ };
117
+ }
118
+
119
+ var EventEmitter = require('events');
120
+
121
+ var queueEmitter = new EventEmitter();
122
+
123
+ // a simple helper queue to bulk index documents
124
+ function indexQueueFactory(ref) {
125
+ var client = ref.targetClient;
126
+ var targetIndexName = ref.targetIndexName;
127
+ var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
128
+ var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
129
+ var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
130
+
131
+ var buffer = [];
132
+ var queue = [];
133
+ var ingesting = false;
134
+
135
+ var ingest = async function (b) {
136
+ if (typeof b !== 'undefined') {
137
+ queue.push(b);
138
+ queueEmitter.emit('queue-size', queue.length);
139
+ }
140
+
141
+ if (ingesting === false) {
142
+ var docs = queue.shift();
143
+ queueEmitter.emit('queue-size', queue.length);
144
+ ingesting = true;
145
+ if (verbose)
146
+ { console.log(("bulk ingest docs: " + (docs.length / 2) + ", queue length: " + (queue.length))); }
147
+
148
+ try {
149
+ await client.bulk({ body: docs });
150
+ ingesting = false;
151
+ if (queue.length > 0) {
152
+ ingest();
153
+ }
154
+ } catch (err) {
155
+ console.log('bulk index error', err);
156
+ }
157
+ }
158
+
159
+ // console.log(`ingest: queue.length ${queue.length}`);
160
+ if (queue.length === 0) {
161
+ queueEmitter.emit('queue-size', 0);
162
+ queueEmitter.emit('resume');
163
+ }
164
+ };
165
+
166
+ return {
167
+ add: function (doc) {
168
+ if (!skipHeader) {
169
+ var header = { index: { _index: targetIndexName } };
170
+ buffer.push(header);
171
+ }
172
+ buffer.push(doc);
173
+
174
+ // console.log(`add: queue.length ${queue.length}`);
175
+ if (queue.length === 0) {
176
+ queueEmitter.emit('resume');
177
+ }
178
+
179
+ if (buffer.length >= bufferSize * 2) {
180
+ ingest(buffer);
181
+ buffer = [];
182
+ }
183
+ },
184
+ finish: async function () {
185
+ await ingest(buffer);
186
+ buffer = [];
187
+ queueEmitter.emit('finish');
188
+ },
189
+ queueEmitter: queueEmitter,
190
+ };
191
+ }
192
+
193
+ var MAX_QUEUE_SIZE = 5;
194
+
195
+ // create a new progress bar instance and use shades_classic theme
196
+ var progressBar = new cliProgress.SingleBar({}, cliProgress.Presets.shades_classic);
197
+
198
+ function indexReaderFactory(
199
+ indexer,
200
+ sourceIndexName,
201
+ transform,
202
+ client,
203
+ query,
204
+ bufferSize
205
+ ) {
206
+ if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
207
+
208
+ return async function indexReader() {
209
+ var responseQueue = [];
210
+ var docsNum = 0;
211
+
212
+ function search() {
213
+ return client.search({
214
+ index: sourceIndexName,
215
+ scroll: '30s',
216
+ size: bufferSize,
217
+ query: query,
218
+ });
219
+ }
220
+
221
+ function scroll(id) {
222
+ return client.scroll({
223
+ scroll_id: id,
224
+ scroll: '30s',
225
+ });
226
+ }
227
+
228
+ // start things off by searching, setting a scroll timeout, and pushing
229
+ // our first response into the queue to be processed
230
+ var se = await search();
231
+ responseQueue.push(se);
232
+ progressBar.start(se.hits.total.value, 0);
233
+
234
+ function processHit(hit) {
235
+ docsNum += 1;
236
+ try {
237
+ var doc = typeof transform === 'function' ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle
238
+ // if doc is undefined we'll skip indexing it
239
+ if (typeof doc === 'undefined') {
240
+ return;
241
+ }
242
+
243
+ // the transform callback may return an array of docs so we can emit
244
+ // multiple docs from a single line
245
+ if (Array.isArray(doc)) {
246
+ doc.forEach(function (d) { return indexer.add(d); });
247
+ return;
248
+ }
249
+
250
+ indexer.add(doc);
251
+ } catch (e) {
252
+ console.log('error', e);
253
+ }
254
+ }
255
+
256
+ var ingestQueueSize = 0;
257
+ var scrollId = se._scroll_id; // eslint-disable-line no-underscore-dangle
258
+ var readActive = false;
259
+
260
+ async function processResponseQueue() {
261
+ while (responseQueue.length) {
262
+ readActive = true;
263
+ var response = responseQueue.shift();
264
+
265
+ // collect the docs from this response
266
+ response.hits.hits.forEach(processHit);
267
+
268
+ progressBar.update(docsNum);
269
+
270
+ // check to see if we have collected all of the docs
271
+ // console.log('check count', response.hits.total.value, docsNum);
272
+ if (response.hits.total.value === docsNum) {
273
+ indexer.finish();
274
+ progressBar.stop();
275
+ break;
276
+ }
277
+
278
+ if (ingestQueueSize < MAX_QUEUE_SIZE) {
279
+ // get the next response if there are more docs to fetch
280
+ var sc = await scroll(response._scroll_id); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
281
+ scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
282
+ responseQueue.push(sc);
283
+ } else {
284
+ readActive = false;
285
+ }
286
+ }
287
+ }
288
+
289
+ indexer.queueEmitter.on('queue-size', async function (size) {
290
+ ingestQueueSize = size;
291
+
292
+ if (!readActive && ingestQueueSize < MAX_QUEUE_SIZE) {
293
+ // get the next response if there are more docs to fetch
294
+ var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
295
+ scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
296
+ responseQueue.push(sc);
297
+ processResponseQueue();
298
+ }
299
+ });
300
+
301
+ indexer.queueEmitter.on('resume', async function () {
302
+ ingestQueueSize = 0;
303
+
304
+ if (readActive) {
305
+ return;
306
+ }
307
+
308
+ // get the next response if there are more docs to fetch
309
+ var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
310
+ scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
311
+ responseQueue.push(sc);
312
+ processResponseQueue();
313
+ });
314
+
315
+ processResponseQueue();
316
+ };
317
+ }
318
+
319
+ async function transformer(ref) {
320
+ var deleteIndex = ref.deleteIndex; if ( deleteIndex === void 0 ) deleteIndex = false;
321
+ var sourceClientConfig = ref.sourceClientConfig;
322
+ var targetClientConfig = ref.targetClientConfig;
323
+ var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
324
+ var fileName = ref.fileName;
325
+ var splitRegex = ref.splitRegex; if ( splitRegex === void 0 ) splitRegex = /\n/;
326
+ var sourceIndexName = ref.sourceIndexName;
327
+ var targetIndexName = ref.targetIndexName;
328
+ var mappings = ref.mappings;
329
+ var mappingsOverride = ref.mappingsOverride; if ( mappingsOverride === void 0 ) mappingsOverride = false;
330
+ var indexMappingTotalFieldsLimit = ref.indexMappingTotalFieldsLimit;
331
+ var query = ref.query;
332
+ var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
333
+ var transform = ref.transform;
334
+ var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
335
+
336
+ if (typeof targetIndexName === 'undefined') {
337
+ throw Error('targetIndexName must be specified.');
338
+ }
339
+
340
+ var defaultClientConfig = {
341
+ node: 'http://localhost:9200',
342
+ };
343
+
344
+ var sourceClient = new elasticsearch.Client(sourceClientConfig || defaultClientConfig);
345
+ var targetClient = new elasticsearch.Client(
346
+ targetClientConfig || sourceClientConfig || defaultClientConfig
347
+ );
348
+
349
+ var createMapping = createMappingFactory({
350
+ sourceClient: sourceClient,
351
+ sourceIndexName: sourceIndexName,
352
+ targetClient: targetClient,
353
+ targetIndexName: targetIndexName,
354
+ mappings: mappings,
355
+ mappingsOverride: mappingsOverride,
356
+ indexMappingTotalFieldsLimit: indexMappingTotalFieldsLimit,
357
+ verbose: verbose,
358
+ });
359
+ var indexer = indexQueueFactory({
360
+ targetClient: targetClient,
361
+ targetIndexName: targetIndexName,
362
+ bufferSize: bufferSize,
363
+ skipHeader: skipHeader,
364
+ verbose: verbose,
365
+ });
366
+
367
+ function getReader() {
368
+ if (typeof fileName !== 'undefined' && typeof sourceIndexName !== 'undefined') {
369
+ throw Error('Only either one of fileName or sourceIndexName can be specified.');
370
+ }
371
+
372
+ if (typeof fileName === 'undefined' && typeof sourceIndexName === 'undefined') {
373
+ throw Error('Either fileName or sourceIndexName must be specified.');
374
+ }
375
+
376
+ if (typeof fileName !== 'undefined') {
377
+ return fileReaderFactory(indexer, fileName, transform, splitRegex, verbose);
378
+ }
379
+
380
+ if (typeof sourceIndexName !== 'undefined') {
381
+ return indexReaderFactory(
382
+ indexer,
383
+ sourceIndexName,
384
+ transform,
385
+ sourceClient,
386
+ query,
387
+ bufferSize
388
+ );
389
+ }
390
+
391
+ return null;
392
+ }
393
+
394
+ var reader = getReader();
395
+
396
+ try {
397
+ var indexExists = await targetClient.indices.exists({ index: targetIndexName });
398
+
399
+ if (indexExists === false) {
400
+ await createMapping();
401
+ reader();
402
+ } else if (deleteIndex === true) {
403
+ await targetClient.indices.delete({ index: targetIndexName });
404
+ await createMapping();
405
+ reader();
406
+ } else {
407
+ reader();
408
+ }
409
+ } catch (error) {
410
+ console.error('Error checking index existence:', error);
411
+ } finally {
412
+ // targetClient.close();
413
+ }
414
+
415
+ return { events: indexer.queueEmitter };
102
416
  }
103
417
 
104
- exports.transformer = transformer;
418
+ module.exports = transformer;
@@ -1,98 +1,414 @@
1
1
  import fs from 'fs';
2
2
  import es from 'event-stream';
3
- import elasticsearch from 'elasticsearch';
4
-
5
- function transformer(ref) {
6
- var deleteIndex = ref.deleteIndex; if ( deleteIndex === void 0 ) deleteIndex = false;
7
- var host = ref.host; if ( host === void 0 ) host = 'localhost';
8
- var port = ref.port; if ( port === void 0 ) port = '9200';
9
- var fileName = ref.fileName;
10
- var indexName = ref.indexName;
11
- var typeName = ref.typeName;
12
- var mappings = ref.mappings;
13
- var transform = ref.transform;
14
- var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
15
-
16
-
17
- var client = new elasticsearch.Client({
18
- host: (host + ":" + port)
19
- });
20
-
21
- client.indices.exists({
22
- index: indexName
23
- }, function (err, resp) {
24
- if (resp === false) {
25
- createMapping();
26
- } else {
27
- if (deleteIndex === true) {
28
- client.indices.delete({
29
- index: indexName
30
- }, function (err, resp) {
31
- createMapping();
32
- });
33
- } else {
34
- indexFile();
35
- }
36
- }
37
- });
38
-
39
- function createMapping() {
40
- if (
41
- typeof mappings === 'object' &&
42
- mappings !== null
43
- ) {
44
- client.indices.create({
45
- index: indexName,
46
- body: {
47
- mappings: mappings
48
- }
49
- }, function (err, resp) {
50
- console.log('create mapping', err, resp);
51
- indexFile();
52
- });
53
- } else {
54
- indexFile();
55
- }
56
- }
57
-
58
- function indexFile() {
59
- var docs = [];
60
- var s = fs.createReadStream(fileName)
61
- .pipe(es.split())
62
- .pipe(es.mapSync(function (line) {
63
- s.pause();
64
-
65
- if (line) {
66
- try {
67
- var header = { index: { _index: indexName, _type: typeName } };
68
-
69
- var doc = (typeof transform === 'function') ? transform(line) : line;
70
-
71
- docs.push(header);
72
- docs.push(doc);
73
- } catch (e) {
74
- console.log('error', e);
75
- }
76
- }
77
-
78
- // resume the readstream, possibly from a callback
79
- s.resume();
80
- })
81
- .on('error', function (err) {
82
- console.log('Error while reading file.', err);
83
- })
84
- .on('end', function () {
85
- verbose && console.log('Read entire file.');
86
- client.bulk({
87
- body: docs
88
- }, function (err, resp) {
89
- if (err) {
90
- console.log('Ingest Error:', err);
91
- }
92
- });
93
- })
94
- );
95
- }
3
+ import glob from 'glob';
4
+ import cliProgress from 'cli-progress';
5
+ import elasticsearch from '@elastic/elasticsearch';
6
+
7
+ var DEFAULT_BUFFER_SIZE = 1000;
8
+
9
+ function createMappingFactory(ref) {
10
+ var sourceClient = ref.sourceClient;
11
+ var sourceIndexName = ref.sourceIndexName;
12
+ var targetClient = ref.targetClient;
13
+ var targetIndexName = ref.targetIndexName;
14
+ var mappings = ref.mappings;
15
+ var mappingsOverride = ref.mappingsOverride;
16
+ var indexMappingTotalFieldsLimit = ref.indexMappingTotalFieldsLimit;
17
+ var verbose = ref.verbose;
18
+
19
+ return async function () {
20
+ var targetMappings = mappingsOverride ? undefined : mappings;
21
+
22
+ if (sourceClient && sourceIndexName && typeof targetMappings === 'undefined') {
23
+ try {
24
+ var mapping = await sourceClient.indices.getMapping({
25
+ index: sourceIndexName,
26
+ });
27
+ targetMappings = mapping[sourceIndexName].mappings;
28
+ } catch (err) {
29
+ console.log('Error reading source mapping', err);
30
+ return;
31
+ }
32
+ }
33
+
34
+ if (typeof targetMappings === 'object' && targetMappings !== null) {
35
+ if (mappingsOverride) {
36
+ targetMappings = Object.assign({}, targetMappings,
37
+ {properties: Object.assign({}, targetMappings.properties,
38
+ mappings)});
39
+ }
40
+
41
+ try {
42
+ var resp = await targetClient.indices.create({
43
+ index: targetIndexName,
44
+ body: Object.assign({}, {mappings: targetMappings},
45
+ (indexMappingTotalFieldsLimit !== undefined
46
+ ? {
47
+ settings: {
48
+ 'index.mapping.total_fields.limit': indexMappingTotalFieldsLimit,
49
+ },
50
+ }
51
+ : {})),
52
+ });
53
+ if (verbose) { console.log('Created target mapping', resp); }
54
+ } catch (err) {
55
+ console.log('Error creating target mapping', err);
56
+ }
57
+ }
58
+ };
59
+ }
60
+
61
+ function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
62
+ function startIndex(files) {
63
+ var file = files.shift();
64
+ var s = fs
65
+ .createReadStream(file)
66
+ .pipe(es.split(splitRegex))
67
+ .pipe(
68
+ es
69
+ .mapSync(function (line) {
70
+ s.pause();
71
+ try {
72
+ var doc = typeof transform === 'function' ? transform(line) : line;
73
+ // if doc is undefined we'll skip indexing it
74
+ if (typeof doc === 'undefined') {
75
+ s.resume();
76
+ return;
77
+ }
78
+
79
+ // the transform callback may return an array of docs so we can emit
80
+ // multiple docs from a single line
81
+ if (Array.isArray(doc)) {
82
+ doc.forEach(function (d) { return indexer.add(d); });
83
+ return;
84
+ }
85
+
86
+ indexer.add(doc);
87
+ } catch (e) {
88
+ console.log('error', e);
89
+ }
90
+ })
91
+ .on('error', function (err) {
92
+ console.log('Error while reading file.', err);
93
+ })
94
+ .on('end', function () {
95
+ if (verbose) { console.log('Read entire file: ', file); }
96
+ indexer.finish();
97
+ if (files.length > 0) {
98
+ startIndex(files);
99
+ }
100
+ })
101
+ );
102
+
103
+ indexer.queueEmitter.on('resume', function () {
104
+ s.resume();
105
+ });
106
+ }
107
+
108
+ return function () {
109
+ glob(fileName, function (er, files) {
110
+ startIndex(files);
111
+ });
112
+ };
113
+ }
114
+
115
+ var EventEmitter = require('events');
116
+
117
+ var queueEmitter = new EventEmitter();
118
+
119
+ // a simple helper queue to bulk index documents
120
+ function indexQueueFactory(ref) {
121
+ var client = ref.targetClient;
122
+ var targetIndexName = ref.targetIndexName;
123
+ var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
124
+ var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
125
+ var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
126
+
127
+ var buffer = [];
128
+ var queue = [];
129
+ var ingesting = false;
130
+
131
+ var ingest = async function (b) {
132
+ if (typeof b !== 'undefined') {
133
+ queue.push(b);
134
+ queueEmitter.emit('queue-size', queue.length);
135
+ }
136
+
137
+ if (ingesting === false) {
138
+ var docs = queue.shift();
139
+ queueEmitter.emit('queue-size', queue.length);
140
+ ingesting = true;
141
+ if (verbose)
142
+ { console.log(("bulk ingest docs: " + (docs.length / 2) + ", queue length: " + (queue.length))); }
143
+
144
+ try {
145
+ await client.bulk({ body: docs });
146
+ ingesting = false;
147
+ if (queue.length > 0) {
148
+ ingest();
149
+ }
150
+ } catch (err) {
151
+ console.log('bulk index error', err);
152
+ }
153
+ }
154
+
155
+ // console.log(`ingest: queue.length ${queue.length}`);
156
+ if (queue.length === 0) {
157
+ queueEmitter.emit('queue-size', 0);
158
+ queueEmitter.emit('resume');
159
+ }
160
+ };
161
+
162
+ return {
163
+ add: function (doc) {
164
+ if (!skipHeader) {
165
+ var header = { index: { _index: targetIndexName } };
166
+ buffer.push(header);
167
+ }
168
+ buffer.push(doc);
169
+
170
+ // console.log(`add: queue.length ${queue.length}`);
171
+ if (queue.length === 0) {
172
+ queueEmitter.emit('resume');
173
+ }
174
+
175
+ if (buffer.length >= bufferSize * 2) {
176
+ ingest(buffer);
177
+ buffer = [];
178
+ }
179
+ },
180
+ finish: async function () {
181
+ await ingest(buffer);
182
+ buffer = [];
183
+ queueEmitter.emit('finish');
184
+ },
185
+ queueEmitter: queueEmitter,
186
+ };
187
+ }
188
+
189
+ var MAX_QUEUE_SIZE = 5;
190
+
191
+ // create a new progress bar instance and use shades_classic theme
192
+ var progressBar = new cliProgress.SingleBar({}, cliProgress.Presets.shades_classic);
193
+
194
+ function indexReaderFactory(
195
+ indexer,
196
+ sourceIndexName,
197
+ transform,
198
+ client,
199
+ query,
200
+ bufferSize
201
+ ) {
202
+ if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
203
+
204
+ return async function indexReader() {
205
+ var responseQueue = [];
206
+ var docsNum = 0;
207
+
208
+ function search() {
209
+ return client.search({
210
+ index: sourceIndexName,
211
+ scroll: '30s',
212
+ size: bufferSize,
213
+ query: query,
214
+ });
215
+ }
216
+
217
+ function scroll(id) {
218
+ return client.scroll({
219
+ scroll_id: id,
220
+ scroll: '30s',
221
+ });
222
+ }
223
+
224
+ // start things off by searching, setting a scroll timeout, and pushing
225
+ // our first response into the queue to be processed
226
+ var se = await search();
227
+ responseQueue.push(se);
228
+ progressBar.start(se.hits.total.value, 0);
229
+
230
+ function processHit(hit) {
231
+ docsNum += 1;
232
+ try {
233
+ var doc = typeof transform === 'function' ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle
234
+ // if doc is undefined we'll skip indexing it
235
+ if (typeof doc === 'undefined') {
236
+ return;
237
+ }
238
+
239
+ // the transform callback may return an array of docs so we can emit
240
+ // multiple docs from a single line
241
+ if (Array.isArray(doc)) {
242
+ doc.forEach(function (d) { return indexer.add(d); });
243
+ return;
244
+ }
245
+
246
+ indexer.add(doc);
247
+ } catch (e) {
248
+ console.log('error', e);
249
+ }
250
+ }
251
+
252
+ var ingestQueueSize = 0;
253
+ var scrollId = se._scroll_id; // eslint-disable-line no-underscore-dangle
254
+ var readActive = false;
255
+
256
+ async function processResponseQueue() {
257
+ while (responseQueue.length) {
258
+ readActive = true;
259
+ var response = responseQueue.shift();
260
+
261
+ // collect the docs from this response
262
+ response.hits.hits.forEach(processHit);
263
+
264
+ progressBar.update(docsNum);
265
+
266
+ // check to see if we have collected all of the docs
267
+ // console.log('check count', response.hits.total.value, docsNum);
268
+ if (response.hits.total.value === docsNum) {
269
+ indexer.finish();
270
+ progressBar.stop();
271
+ break;
272
+ }
273
+
274
+ if (ingestQueueSize < MAX_QUEUE_SIZE) {
275
+ // get the next response if there are more docs to fetch
276
+ var sc = await scroll(response._scroll_id); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
277
+ scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
278
+ responseQueue.push(sc);
279
+ } else {
280
+ readActive = false;
281
+ }
282
+ }
283
+ }
284
+
285
+ indexer.queueEmitter.on('queue-size', async function (size) {
286
+ ingestQueueSize = size;
287
+
288
+ if (!readActive && ingestQueueSize < MAX_QUEUE_SIZE) {
289
+ // get the next response if there are more docs to fetch
290
+ var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
291
+ scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
292
+ responseQueue.push(sc);
293
+ processResponseQueue();
294
+ }
295
+ });
296
+
297
+ indexer.queueEmitter.on('resume', async function () {
298
+ ingestQueueSize = 0;
299
+
300
+ if (readActive) {
301
+ return;
302
+ }
303
+
304
+ // get the next response if there are more docs to fetch
305
+ var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
306
+ scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
307
+ responseQueue.push(sc);
308
+ processResponseQueue();
309
+ });
310
+
311
+ processResponseQueue();
312
+ };
313
+ }
314
+
315
+ async function transformer(ref) {
316
+ var deleteIndex = ref.deleteIndex; if ( deleteIndex === void 0 ) deleteIndex = false;
317
+ var sourceClientConfig = ref.sourceClientConfig;
318
+ var targetClientConfig = ref.targetClientConfig;
319
+ var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
320
+ var fileName = ref.fileName;
321
+ var splitRegex = ref.splitRegex; if ( splitRegex === void 0 ) splitRegex = /\n/;
322
+ var sourceIndexName = ref.sourceIndexName;
323
+ var targetIndexName = ref.targetIndexName;
324
+ var mappings = ref.mappings;
325
+ var mappingsOverride = ref.mappingsOverride; if ( mappingsOverride === void 0 ) mappingsOverride = false;
326
+ var indexMappingTotalFieldsLimit = ref.indexMappingTotalFieldsLimit;
327
+ var query = ref.query;
328
+ var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
329
+ var transform = ref.transform;
330
+ var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
331
+
332
+ if (typeof targetIndexName === 'undefined') {
333
+ throw Error('targetIndexName must be specified.');
334
+ }
335
+
336
+ var defaultClientConfig = {
337
+ node: 'http://localhost:9200',
338
+ };
339
+
340
+ var sourceClient = new elasticsearch.Client(sourceClientConfig || defaultClientConfig);
341
+ var targetClient = new elasticsearch.Client(
342
+ targetClientConfig || sourceClientConfig || defaultClientConfig
343
+ );
344
+
345
+ var createMapping = createMappingFactory({
346
+ sourceClient: sourceClient,
347
+ sourceIndexName: sourceIndexName,
348
+ targetClient: targetClient,
349
+ targetIndexName: targetIndexName,
350
+ mappings: mappings,
351
+ mappingsOverride: mappingsOverride,
352
+ indexMappingTotalFieldsLimit: indexMappingTotalFieldsLimit,
353
+ verbose: verbose,
354
+ });
355
+ var indexer = indexQueueFactory({
356
+ targetClient: targetClient,
357
+ targetIndexName: targetIndexName,
358
+ bufferSize: bufferSize,
359
+ skipHeader: skipHeader,
360
+ verbose: verbose,
361
+ });
362
+
363
+ function getReader() {
364
+ if (typeof fileName !== 'undefined' && typeof sourceIndexName !== 'undefined') {
365
+ throw Error('Only either one of fileName or sourceIndexName can be specified.');
366
+ }
367
+
368
+ if (typeof fileName === 'undefined' && typeof sourceIndexName === 'undefined') {
369
+ throw Error('Either fileName or sourceIndexName must be specified.');
370
+ }
371
+
372
+ if (typeof fileName !== 'undefined') {
373
+ return fileReaderFactory(indexer, fileName, transform, splitRegex, verbose);
374
+ }
375
+
376
+ if (typeof sourceIndexName !== 'undefined') {
377
+ return indexReaderFactory(
378
+ indexer,
379
+ sourceIndexName,
380
+ transform,
381
+ sourceClient,
382
+ query,
383
+ bufferSize
384
+ );
385
+ }
386
+
387
+ return null;
388
+ }
389
+
390
+ var reader = getReader();
391
+
392
+ try {
393
+ var indexExists = await targetClient.indices.exists({ index: targetIndexName });
394
+
395
+ if (indexExists === false) {
396
+ await createMapping();
397
+ reader();
398
+ } else if (deleteIndex === true) {
399
+ await targetClient.indices.delete({ index: targetIndexName });
400
+ await createMapping();
401
+ reader();
402
+ } else {
403
+ reader();
404
+ }
405
+ } catch (error) {
406
+ console.error('Error checking index existence:', error);
407
+ } finally {
408
+ // targetClient.close();
409
+ }
410
+
411
+ return { events: indexer.queueEmitter };
96
412
  }
97
413
 
98
- export { transformer };
414
+ export default transformer;
package/package.json CHANGED
@@ -1,28 +1,58 @@
1
1
  {
2
2
  "name": "node-es-transformer",
3
- "version": "1.0.0-alpha1",
3
+ "description": "A nodejs based library to (re)index and transform data from/to Elasticsearch.",
4
+ "keywords": [
5
+ "elasticsearch",
6
+ "data-transformation"
7
+ ],
8
+ "private": false,
9
+ "homepage": "https://github.com/walterra/node-es-transformer",
10
+ "repository": "https://github.com/walterra/node-es-transformer",
11
+ "bugs": {
12
+ "url": "https://github.com/walterra/node-es-transformer/issues"
13
+ },
14
+ "license": "Apache-2.0",
15
+ "author": "Walter Rafelsberger <walter@rafelsberger.at>",
16
+ "contributors": [],
17
+ "version": "1.0.0-alpha11",
4
18
  "main": "dist/node-es-transformer.cjs.js",
5
19
  "module": "dist/node-es-transformer.esm.js",
6
20
  "dependencies": {
7
- "elasticsearch": "^15.0.0",
8
- "event-stream": "^3.3.4"
21
+ "@elastic/elasticsearch": "^8.8.1",
22
+ "cli-progress": "^3.12.0",
23
+ "event-stream": "3.3.4",
24
+ "glob": "7.1.2"
9
25
  },
10
26
  "devDependencies": {
11
- "eslint": "^4.19.1",
12
- "eslint-config-airbnb": "^16.1.0",
13
- "eslint-plugin-import": "^2.12.0",
14
- "rollup": "^0.46.0",
15
- "rollup-plugin-buble": "^0.15.0",
16
- "rollup-plugin-commonjs": "^8.0.2",
17
- "rollup-plugin-node-resolve": "^3.0.0"
27
+ "acorn": "^6.4.2",
28
+ "commit-and-tag-version": "^11.3.0",
29
+ "cz-conventional-changelog": "^3.3.0",
30
+ "eslint": "8.2.0",
31
+ "eslint-config-airbnb": "19.0.4",
32
+ "eslint-config-prettier": "^9.0.0",
33
+ "eslint-plugin-import": "2.27.5",
34
+ "eslint-plugin-jsx-a11y": "6.7.1",
35
+ "eslint-plugin-prettier": "^3.3.1",
36
+ "eslint-plugin-react": "7.32.2",
37
+ "prettier": "^2.2.1",
38
+ "rollup": "0.66.6",
39
+ "rollup-plugin-buble": "0.19.6",
40
+ "rollup-plugin-commonjs": "8.0.2",
41
+ "rollup-plugin-node-resolve": "3.0.0"
18
42
  },
19
43
  "scripts": {
20
44
  "build": "rollup -c",
21
45
  "dev": "rollup -c -w",
22
46
  "test": "node test/test.js",
23
- "pretest": "npm run build"
47
+ "pretest": "npm run build",
48
+ "release": "commit-and-tag-version"
24
49
  },
25
50
  "files": [
26
51
  "dist"
27
- ]
52
+ ],
53
+ "config": {
54
+ "commitizen": {
55
+ "path": "./node_modules/cz-conventional-changelog"
56
+ }
57
+ }
28
58
  }