node-es-transformer 1.0.0-alpha9 → 1.0.0-beta2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +20 -4
- package/dist/node-es-transformer.cjs.js +241 -102
- package/dist/node-es-transformer.esm.js +241 -102
- package/package.json +27 -6
package/README.md
CHANGED
|
@@ -1,6 +1,8 @@
|
|
|
1
1
|
[](https://www.npmjs.com/package/node-es-transformer)
|
|
2
2
|
[](https://www.npmjs.com/package/node-es-transformer)
|
|
3
3
|
[](https://www.npmjs.com/package/node-es-transformer)
|
|
4
|
+
[](http://commitizen.github.io/cz-cli/)
|
|
5
|
+
[](https://github.com/walterra/node-es-transformer/actions)
|
|
4
6
|
|
|
5
7
|
# node-es-transformer
|
|
6
8
|
|
|
@@ -10,7 +12,7 @@ A nodejs based library to (re)index and transform data from/to Elasticsearch.
|
|
|
10
12
|
|
|
11
13
|
If you're looking for a nodejs based tool which allows you to ingest large CSV/JSON files in the GigaBytes you've come to the right place. Everything else I've tried with larger files runs out of JS heap, hammers ES with too many single requests, times out or tries to do everything with a single bulk request.
|
|
12
14
|
|
|
13
|
-
While I'd generally recommend using [Logstash](https://www.elastic.co/products/logstash), [filebeat](https://www.elastic.co/products/beats/filebeat)
|
|
15
|
+
While I'd generally recommend using [Logstash](https://www.elastic.co/products/logstash), [filebeat](https://www.elastic.co/products/beats/filebeat), [Ingest Nodes](https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html), [Elastic Agent](https://www.elastic.co/guide/en/fleet/current/fleet-overview.html) or [Elasticsearch Transforms](https://www.elastic.co/guide/en/elasticsearch/reference/current/transforms.html) for established use cases, this tool may be of help especially if you feel more at home in the JavaScript/nodejs universe and have use cases with customized ingestion and data transformation needs.
|
|
14
16
|
|
|
15
17
|
**This is experimental code, use at your own risk. Nonetheless, I encourage you to give it a try so I can gather some feedback.**
|
|
16
18
|
|
|
@@ -26,7 +28,7 @@ Now that we've talked about the caveats, let's have a look what you actually get
|
|
|
26
28
|
|
|
27
29
|
## Features
|
|
28
30
|
|
|
29
|
-
- Buffering/Streaming for both reading and indexing. Files are read using streaming and Elasticsearch ingestion is done using buffered bulk indexing. This is tailored towards ingestion of large files. Successfully tested so far with JSON and CSV files in the range of 20-30 GBytes. On a single machine running both `node-es-transformer` and Elasticsearch ingestion rates up to 20k documents/second were achieved (2,9 GHz Intel Core i7, 16GByte RAM, SSD).
|
|
31
|
+
- Buffering/Streaming for both reading and indexing. Files are read using streaming and Elasticsearch ingestion is done using buffered bulk indexing. This is tailored towards ingestion of large files. Successfully tested so far with JSON and CSV files in the range of 20-30 GBytes. On a single machine running both `node-es-transformer` and Elasticsearch ingestion rates up to 20k documents/second were achieved (2,9 GHz Intel Core i7, 16GByte RAM, SSD), depending on document size.
|
|
30
32
|
- Supports wildcards to ingest/transform a range of files in one go.
|
|
31
33
|
- Supports fetching documents from existing indices using search/scroll. This allows you to reindex with custom data transformations just using JavaScript in the `transform` callback.
|
|
32
34
|
- The `transform` callback gives you each source document, but you can split it up in multiple ones and return an array of documents. An example use case for this: Each source document is a Tweet and you want to transform that into an entity centric index based on Hashtags.
|
|
@@ -113,7 +115,11 @@ transformer({
|
|
|
113
115
|
- `splitRegex`: Custom line split regex, defaults to `/\n/`.
|
|
114
116
|
- `sourceIndexName`: The source Elasticsearch index to reindex from. If this is set, `fileName` is not allowed.
|
|
115
117
|
- `targetIndexName`: The target Elasticsearch index where documents will be indexed.
|
|
116
|
-
- `mappings`: Elasticsearch document
|
|
118
|
+
- `mappings`: Optional Elasticsearch document mappings. If not set and you're reindexing from another index, the mappings from the existing index will be used.
|
|
119
|
+
- `mappingsOverride`: If you're reindexing and this is set to `true`, `mappings` will be applied on top of the source index's mappings. Defaults to `false`.
|
|
120
|
+
- `indexMappingTotalFieldsLimit`: Optional field limit for the target index to be created that will be passed on as the `index.mapping.total_fields.limit` setting.
|
|
121
|
+
- `populatedFields`: If `true`, fetches a set of random documents to identify which fields are actually used by documents. Can be useful for indices with lots of field mappings to increase query/reindex performance. Defaults to `false`.
|
|
122
|
+
- `query`: Optional Elasticsearch [DSL query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html) to filter documents from the source index.
|
|
117
123
|
- `skipHeader`: If true, skips the first line of the source file. Defaults to `false`.
|
|
118
124
|
- `transform(line)`: A callback function which allows the transformation of a source line into one or several documents.
|
|
119
125
|
- `verbose`: Logging verbosity, defaults to `true`
|
|
@@ -137,7 +143,17 @@ yarn
|
|
|
137
143
|
|
|
138
144
|
`yarn dev` builds the library, then keeps rebuilding it whenever the source files change using [rollup-watch](https://github.com/rollup/rollup-watch).
|
|
139
145
|
|
|
140
|
-
`yarn test`
|
|
146
|
+
`yarn test` runs the tests. The tests expect that you have an Elasticsearch instance running without security at `http://localhost:9200`. Using docker, you can set this up with:
|
|
147
|
+
|
|
148
|
+
```bash
|
|
149
|
+
# Download the docker image
|
|
150
|
+
docker pull docker.elastic.co/elasticsearch/elasticsearch:8.10.4
|
|
151
|
+
|
|
152
|
+
# Run the container
|
|
153
|
+
docker run --name es01 --net elastic -p 9200:9200 -it -m 1GB -e "discovery.type=single-node" -e "xpack.security.enabled=false" docker.elastic.co/elasticsearch/elasticsearch:8.10.4
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
To commit, use `cz`. To prepare a release, use e.g. `yarn release -- --release-as 1.0.0-beta2`.
|
|
141
157
|
|
|
142
158
|
## License
|
|
143
159
|
|
|
@@ -8,20 +8,26 @@ var glob = _interopDefault(require('glob'));
|
|
|
8
8
|
var cliProgress = _interopDefault(require('cli-progress'));
|
|
9
9
|
var elasticsearch = _interopDefault(require('@elastic/elasticsearch'));
|
|
10
10
|
|
|
11
|
+
var DEFAULT_BUFFER_SIZE = 1000;
|
|
12
|
+
|
|
11
13
|
function createMappingFactory(ref) {
|
|
12
14
|
var sourceClient = ref.sourceClient;
|
|
13
15
|
var sourceIndexName = ref.sourceIndexName;
|
|
14
16
|
var targetClient = ref.targetClient;
|
|
15
17
|
var targetIndexName = ref.targetIndexName;
|
|
16
18
|
var mappings = ref.mappings;
|
|
19
|
+
var mappingsOverride = ref.mappingsOverride;
|
|
20
|
+
var indexMappingTotalFieldsLimit = ref.indexMappingTotalFieldsLimit;
|
|
17
21
|
var verbose = ref.verbose;
|
|
18
22
|
|
|
19
23
|
return async function () {
|
|
20
|
-
var targetMappings = mappings;
|
|
24
|
+
var targetMappings = mappingsOverride ? undefined : mappings;
|
|
21
25
|
|
|
22
26
|
if (sourceClient && sourceIndexName && typeof targetMappings === 'undefined') {
|
|
23
27
|
try {
|
|
24
|
-
var mapping = await sourceClient.indices.getMapping({
|
|
28
|
+
var mapping = await sourceClient.indices.getMapping({
|
|
29
|
+
index: sourceIndexName,
|
|
30
|
+
});
|
|
25
31
|
targetMappings = mapping[sourceIndexName].mappings;
|
|
26
32
|
} catch (err) {
|
|
27
33
|
console.log('Error reading source mapping', err);
|
|
@@ -30,13 +36,24 @@ function createMappingFactory(ref) {
|
|
|
30
36
|
}
|
|
31
37
|
|
|
32
38
|
if (typeof targetMappings === 'object' && targetMappings !== null) {
|
|
39
|
+
if (mappingsOverride) {
|
|
40
|
+
targetMappings = Object.assign({}, targetMappings,
|
|
41
|
+
{properties: Object.assign({}, targetMappings.properties,
|
|
42
|
+
mappings)});
|
|
43
|
+
}
|
|
44
|
+
|
|
33
45
|
try {
|
|
34
|
-
var resp = await targetClient.indices.create(
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
|
|
46
|
+
var resp = await targetClient.indices.create({
|
|
47
|
+
index: targetIndexName,
|
|
48
|
+
body: Object.assign({}, {mappings: targetMappings},
|
|
49
|
+
(indexMappingTotalFieldsLimit !== undefined
|
|
50
|
+
? {
|
|
51
|
+
settings: {
|
|
52
|
+
'index.mapping.total_fields.limit': indexMappingTotalFieldsLimit,
|
|
53
|
+
},
|
|
54
|
+
}
|
|
55
|
+
: {})),
|
|
56
|
+
});
|
|
40
57
|
if (verbose) { console.log('Created target mapping', resp); }
|
|
41
58
|
} catch (err) {
|
|
42
59
|
console.log('Error creating target mapping', err);
|
|
@@ -45,45 +62,78 @@ function createMappingFactory(ref) {
|
|
|
45
62
|
};
|
|
46
63
|
}
|
|
47
64
|
|
|
65
|
+
var MAX_QUEUE_SIZE = 15;
|
|
66
|
+
|
|
48
67
|
function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
|
|
49
68
|
function startIndex(files) {
|
|
69
|
+
var ingestQueueSize = 0;
|
|
70
|
+
var finished = false;
|
|
71
|
+
|
|
50
72
|
var file = files.shift();
|
|
51
|
-
var s = fs
|
|
73
|
+
var s = fs
|
|
74
|
+
.createReadStream(file)
|
|
52
75
|
.pipe(es.split(splitRegex))
|
|
53
|
-
.pipe(
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
76
|
+
.pipe(
|
|
77
|
+
es
|
|
78
|
+
.mapSync(function (line) {
|
|
79
|
+
try {
|
|
80
|
+
// skip empty lines
|
|
81
|
+
if (line === '') {
|
|
82
|
+
return;
|
|
83
|
+
}
|
|
84
|
+
|
|
85
|
+
var doc =
|
|
86
|
+
typeof transform === 'function'
|
|
87
|
+
? JSON.stringify(transform(JSON.parse(line)))
|
|
88
|
+
: line;
|
|
89
|
+
|
|
90
|
+
// if doc is undefined we'll skip indexing it
|
|
91
|
+
if (typeof doc === 'undefined') {
|
|
92
|
+
s.resume();
|
|
93
|
+
return;
|
|
94
|
+
}
|
|
95
|
+
|
|
96
|
+
// the transform callback may return an array of docs so we can emit
|
|
97
|
+
// multiple docs from a single line
|
|
98
|
+
if (Array.isArray(doc)) {
|
|
99
|
+
doc.forEach(function (d) { return indexer.add(d); });
|
|
100
|
+
return;
|
|
101
|
+
}
|
|
102
|
+
|
|
103
|
+
indexer.add(doc);
|
|
104
|
+
} catch (e) {
|
|
105
|
+
console.log('error', e);
|
|
106
|
+
}
|
|
107
|
+
})
|
|
108
|
+
.on('error', function (err) {
|
|
109
|
+
console.log('Error while reading file.', err);
|
|
110
|
+
})
|
|
111
|
+
.on('end', function () {
|
|
112
|
+
if (verbose) { console.log('Read entire file: ', file); }
|
|
113
|
+
if (files.length > 0) {
|
|
114
|
+
startIndex(files);
|
|
115
|
+
return;
|
|
116
|
+
}
|
|
117
|
+
|
|
118
|
+
indexer.finish();
|
|
119
|
+
finished = true;
|
|
120
|
+
})
|
|
121
|
+
);
|
|
62
122
|
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
doc.forEach(function (d) { return indexer.add(d); });
|
|
67
|
-
return;
|
|
68
|
-
}
|
|
123
|
+
indexer.queueEmitter.on('queue-size', async function (size) {
|
|
124
|
+
if (finished) { return; }
|
|
125
|
+
ingestQueueSize = size;
|
|
69
126
|
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
}
|
|
75
|
-
|
|
76
|
-
console.log('Error while reading file.', err);
|
|
77
|
-
})
|
|
78
|
-
.on('end', function () {
|
|
79
|
-
if (verbose) { console.log('Read entire file: ', file); }
|
|
80
|
-
indexer.finish();
|
|
81
|
-
if (files.length > 0) {
|
|
82
|
-
startIndex(files);
|
|
83
|
-
}
|
|
84
|
-
}));
|
|
127
|
+
if (ingestQueueSize < MAX_QUEUE_SIZE) {
|
|
128
|
+
s.resume();
|
|
129
|
+
} else {
|
|
130
|
+
s.pause();
|
|
131
|
+
}
|
|
132
|
+
});
|
|
85
133
|
|
|
86
134
|
indexer.queueEmitter.on('resume', function () {
|
|
135
|
+
if (finished) { return; }
|
|
136
|
+
ingestQueueSize = 0;
|
|
87
137
|
s.resume();
|
|
88
138
|
});
|
|
89
139
|
}
|
|
@@ -99,110 +149,202 @@ var EventEmitter = require('events');
|
|
|
99
149
|
|
|
100
150
|
var queueEmitter = new EventEmitter();
|
|
101
151
|
|
|
152
|
+
var parallelCalls = 1;
|
|
153
|
+
|
|
102
154
|
// a simple helper queue to bulk index documents
|
|
103
155
|
function indexQueueFactory(ref) {
|
|
104
156
|
var client = ref.targetClient;
|
|
105
157
|
var targetIndexName = ref.targetIndexName;
|
|
106
|
-
var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize =
|
|
158
|
+
var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
|
|
107
159
|
var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
|
|
108
160
|
var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
|
|
109
161
|
|
|
110
162
|
var buffer = [];
|
|
111
163
|
var queue = [];
|
|
112
|
-
var ingesting =
|
|
164
|
+
var ingesting = 0;
|
|
165
|
+
var ingestTimes = [];
|
|
166
|
+
var finished = false;
|
|
113
167
|
|
|
114
|
-
var ingest =
|
|
168
|
+
var ingest = function (b) {
|
|
115
169
|
if (typeof b !== 'undefined') {
|
|
116
170
|
queue.push(b);
|
|
117
171
|
queueEmitter.emit('queue-size', queue.length);
|
|
118
172
|
}
|
|
119
173
|
|
|
120
|
-
if (
|
|
174
|
+
if (ingestTimes.length > 5) { ingestTimes = ingestTimes.slice(-5); }
|
|
175
|
+
|
|
176
|
+
if (ingesting < parallelCalls) {
|
|
121
177
|
var docs = queue.shift();
|
|
122
|
-
queueEmitter.emit('queue-size', queue.length);
|
|
123
|
-
ingesting = true;
|
|
124
|
-
if (verbose) { console.log(("bulk ingest docs: " + (docs.length / 2) + ", queue length: " + (queue.length))); }
|
|
125
178
|
|
|
126
|
-
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
if (queue.length > 0) {
|
|
130
|
-
ingest();
|
|
131
|
-
}
|
|
132
|
-
} catch (err) {
|
|
133
|
-
console.log('bulk index error', err);
|
|
179
|
+
queueEmitter.emit('queue-size', queue.length);
|
|
180
|
+
if (queue.length <= 5) {
|
|
181
|
+
queueEmitter.emit('resume');
|
|
134
182
|
}
|
|
135
|
-
}
|
|
136
183
|
|
|
137
|
-
|
|
138
|
-
|
|
139
|
-
|
|
140
|
-
|
|
184
|
+
ingesting += 1;
|
|
185
|
+
|
|
186
|
+
if (verbose)
|
|
187
|
+
{ console.log(("bulk ingest docs: " + (docs.length / 2) + ", queue length: " + (queue.length))); }
|
|
188
|
+
|
|
189
|
+
var start = Date.now();
|
|
190
|
+
client
|
|
191
|
+
.bulk({ body: docs })
|
|
192
|
+
.then(function () {
|
|
193
|
+
var end = Date.now();
|
|
194
|
+
var delta = end - start;
|
|
195
|
+
ingestTimes.push(delta);
|
|
196
|
+
ingesting -= 1;
|
|
197
|
+
|
|
198
|
+
var ingestTimesMovingAverage =
|
|
199
|
+
ingestTimes.length > 0
|
|
200
|
+
? ingestTimes.reduce(function (p, c) { return p + c; }, 0) / ingestTimes.length
|
|
201
|
+
: 0;
|
|
202
|
+
var ingestTimesMovingAverageSeconds = Math.floor(ingestTimesMovingAverage / 1000);
|
|
203
|
+
|
|
204
|
+
if (
|
|
205
|
+
ingestTimes.length > 0 &&
|
|
206
|
+
ingestTimesMovingAverageSeconds < 30 &&
|
|
207
|
+
parallelCalls < 10
|
|
208
|
+
) {
|
|
209
|
+
parallelCalls += 1;
|
|
210
|
+
} else if (
|
|
211
|
+
ingestTimes.length > 0 &&
|
|
212
|
+
ingestTimesMovingAverageSeconds >= 30 &&
|
|
213
|
+
parallelCalls > 1
|
|
214
|
+
) {
|
|
215
|
+
parallelCalls -= 1;
|
|
216
|
+
}
|
|
217
|
+
|
|
218
|
+
if (queue.length > 0) {
|
|
219
|
+
ingest();
|
|
220
|
+
} else if (queue.length === 0 && finished) {
|
|
221
|
+
queueEmitter.emit('finish');
|
|
222
|
+
}
|
|
223
|
+
})
|
|
224
|
+
.catch(function (error) {
|
|
225
|
+
console.error(error);
|
|
226
|
+
ingesting -= 1;
|
|
227
|
+
parallelCalls = 1;
|
|
228
|
+
if (queue.length > 0) {
|
|
229
|
+
ingest();
|
|
230
|
+
}
|
|
231
|
+
});
|
|
141
232
|
}
|
|
142
233
|
};
|
|
143
234
|
|
|
144
235
|
return {
|
|
145
236
|
add: function (doc) {
|
|
237
|
+
if (finished) {
|
|
238
|
+
throw new Error('Unexpected doc added after indexer should finish.');
|
|
239
|
+
}
|
|
240
|
+
|
|
146
241
|
if (!skipHeader) {
|
|
147
242
|
var header = { index: { _index: targetIndexName } };
|
|
148
243
|
buffer.push(header);
|
|
149
244
|
}
|
|
150
245
|
buffer.push(doc);
|
|
151
246
|
|
|
152
|
-
// console.log(`add: queue.length ${queue.length}`);
|
|
153
247
|
if (queue.length === 0) {
|
|
154
248
|
queueEmitter.emit('resume');
|
|
155
249
|
}
|
|
156
250
|
|
|
157
|
-
if (buffer.length >=
|
|
251
|
+
if (buffer.length >= bufferSize * 2) {
|
|
158
252
|
ingest(buffer);
|
|
159
253
|
buffer = [];
|
|
160
254
|
}
|
|
161
255
|
},
|
|
162
|
-
finish:
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
|
|
256
|
+
finish: function () {
|
|
257
|
+
finished = true;
|
|
258
|
+
|
|
259
|
+
if (buffer.length > 0) {
|
|
260
|
+
ingest(buffer);
|
|
261
|
+
buffer = [];
|
|
262
|
+
} else if (queue.length === 0 && ingesting === 0) {
|
|
263
|
+
queueEmitter.emit('finish');
|
|
264
|
+
}
|
|
166
265
|
},
|
|
167
266
|
queueEmitter: queueEmitter,
|
|
168
267
|
};
|
|
169
268
|
}
|
|
170
269
|
|
|
171
|
-
var MAX_QUEUE_SIZE =
|
|
270
|
+
var MAX_QUEUE_SIZE$1 = 15;
|
|
172
271
|
|
|
173
272
|
// create a new progress bar instance and use shades_classic theme
|
|
174
273
|
var progressBar = new cliProgress.SingleBar({}, cliProgress.Presets.shades_classic);
|
|
175
274
|
|
|
176
|
-
function indexReaderFactory(
|
|
275
|
+
function indexReaderFactory(
|
|
276
|
+
indexer,
|
|
277
|
+
sourceIndexName,
|
|
278
|
+
transform,
|
|
279
|
+
client,
|
|
280
|
+
query,
|
|
281
|
+
bufferSize,
|
|
282
|
+
populatedFields
|
|
283
|
+
) {
|
|
284
|
+
if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
|
|
285
|
+
if ( populatedFields === void 0 ) populatedFields = false;
|
|
286
|
+
|
|
177
287
|
return async function indexReader() {
|
|
178
288
|
var responseQueue = [];
|
|
179
289
|
var docsNum = 0;
|
|
180
290
|
|
|
181
|
-
function
|
|
182
|
-
|
|
183
|
-
|
|
184
|
-
|
|
185
|
-
|
|
186
|
-
|
|
291
|
+
async function fetchPopulatedFields() {
|
|
292
|
+
try {
|
|
293
|
+
var response = await client.search({
|
|
294
|
+
index: sourceIndexName,
|
|
295
|
+
size: bufferSize,
|
|
296
|
+
query: {
|
|
297
|
+
function_score: {
|
|
298
|
+
query: query,
|
|
299
|
+
random_score: {},
|
|
300
|
+
},
|
|
301
|
+
},
|
|
302
|
+
});
|
|
303
|
+
|
|
304
|
+
// Get all field names for each returned doc and flatten it
|
|
305
|
+
// to a list of unique field names used across all docs.
|
|
306
|
+
return new Set(response.hits.hits.map(function (d) { return Object.keys(d._source); }).flat(1));
|
|
307
|
+
} catch (e) {
|
|
308
|
+
console.log('error', e);
|
|
309
|
+
}
|
|
310
|
+
}
|
|
311
|
+
|
|
312
|
+
function search(fields) {
|
|
313
|
+
return client.search(Object.assign({}, {index: sourceIndexName,
|
|
314
|
+
scroll: '600s',
|
|
315
|
+
size: bufferSize,
|
|
316
|
+
query: query},
|
|
317
|
+
(fields ? { _source: fields } : {})));
|
|
187
318
|
}
|
|
188
319
|
|
|
189
320
|
function scroll(id) {
|
|
190
321
|
return client.scroll({
|
|
191
322
|
scroll_id: id,
|
|
192
|
-
scroll: '
|
|
323
|
+
scroll: '600s',
|
|
193
324
|
});
|
|
194
325
|
}
|
|
195
326
|
|
|
327
|
+
var fieldsWithData;
|
|
328
|
+
|
|
329
|
+
// identify populated fields
|
|
330
|
+
if (populatedFields) {
|
|
331
|
+
fieldsWithData = await fetchPopulatedFields();
|
|
332
|
+
console.log('fieldsWithData', fieldsWithData);
|
|
333
|
+
}
|
|
334
|
+
|
|
196
335
|
// start things off by searching, setting a scroll timeout, and pushing
|
|
197
336
|
// our first response into the queue to be processed
|
|
198
|
-
var se = await search();
|
|
337
|
+
var se = await search(fieldsWithData);
|
|
199
338
|
responseQueue.push(se);
|
|
200
339
|
progressBar.start(se.hits.total.value, 0);
|
|
340
|
+
console.log('se', se.hits.hits[0]);
|
|
201
341
|
|
|
202
342
|
function processHit(hit) {
|
|
203
343
|
docsNum += 1;
|
|
204
344
|
try {
|
|
205
|
-
var doc =
|
|
345
|
+
var doc = typeof transform === 'function' ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle
|
|
346
|
+
// console.log('doc', doc);
|
|
347
|
+
|
|
206
348
|
// if doc is undefined we'll skip indexing it
|
|
207
349
|
if (typeof doc === 'undefined') {
|
|
208
350
|
return;
|
|
@@ -236,15 +378,13 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
|
|
|
236
378
|
progressBar.update(docsNum);
|
|
237
379
|
|
|
238
380
|
// check to see if we have collected all of the docs
|
|
239
|
-
// console.log('check count', response.hits.total.value, docsNum);
|
|
240
381
|
if (response.hits.total.value === docsNum) {
|
|
241
382
|
indexer.finish();
|
|
242
|
-
progressBar.stop();
|
|
243
383
|
break;
|
|
244
384
|
}
|
|
245
385
|
|
|
246
|
-
if (ingestQueueSize < MAX_QUEUE_SIZE) {
|
|
247
|
-
|
|
386
|
+
if (ingestQueueSize < MAX_QUEUE_SIZE$1) {
|
|
387
|
+
// get the next response if there are more docs to fetch
|
|
248
388
|
var sc = await scroll(response._scroll_id); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
|
|
249
389
|
scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
|
|
250
390
|
responseQueue.push(sc);
|
|
@@ -257,8 +397,8 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
|
|
|
257
397
|
indexer.queueEmitter.on('queue-size', async function (size) {
|
|
258
398
|
ingestQueueSize = size;
|
|
259
399
|
|
|
260
|
-
if (!readActive && ingestQueueSize < MAX_QUEUE_SIZE) {
|
|
261
|
-
|
|
400
|
+
if (!readActive && ingestQueueSize < MAX_QUEUE_SIZE$1) {
|
|
401
|
+
// get the next response if there are more docs to fetch
|
|
262
402
|
var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
|
|
263
403
|
scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
|
|
264
404
|
responseQueue.push(sc);
|
|
@@ -280,6 +420,10 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
|
|
|
280
420
|
processResponseQueue();
|
|
281
421
|
});
|
|
282
422
|
|
|
423
|
+
indexer.queueEmitter.on('finish', function () {
|
|
424
|
+
progressBar.stop();
|
|
425
|
+
});
|
|
426
|
+
|
|
283
427
|
processResponseQueue();
|
|
284
428
|
};
|
|
285
429
|
}
|
|
@@ -288,12 +432,16 @@ async function transformer(ref) {
|
|
|
288
432
|
var deleteIndex = ref.deleteIndex; if ( deleteIndex === void 0 ) deleteIndex = false;
|
|
289
433
|
var sourceClientConfig = ref.sourceClientConfig;
|
|
290
434
|
var targetClientConfig = ref.targetClientConfig;
|
|
291
|
-
var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize =
|
|
435
|
+
var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
|
|
292
436
|
var fileName = ref.fileName;
|
|
293
437
|
var splitRegex = ref.splitRegex; if ( splitRegex === void 0 ) splitRegex = /\n/;
|
|
294
438
|
var sourceIndexName = ref.sourceIndexName;
|
|
295
439
|
var targetIndexName = ref.targetIndexName;
|
|
296
440
|
var mappings = ref.mappings;
|
|
441
|
+
var mappingsOverride = ref.mappingsOverride; if ( mappingsOverride === void 0 ) mappingsOverride = false;
|
|
442
|
+
var indexMappingTotalFieldsLimit = ref.indexMappingTotalFieldsLimit;
|
|
443
|
+
var populatedFields = ref.populatedFields; if ( populatedFields === void 0 ) populatedFields = false;
|
|
444
|
+
var query = ref.query;
|
|
297
445
|
var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
|
|
298
446
|
var transform = ref.transform;
|
|
299
447
|
var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
|
|
@@ -317,6 +465,8 @@ async function transformer(ref) {
|
|
|
317
465
|
targetClient: targetClient,
|
|
318
466
|
targetIndexName: targetIndexName,
|
|
319
467
|
mappings: mappings,
|
|
468
|
+
mappingsOverride: mappingsOverride,
|
|
469
|
+
indexMappingTotalFieldsLimit: indexMappingTotalFieldsLimit,
|
|
320
470
|
verbose: verbose,
|
|
321
471
|
});
|
|
322
472
|
var indexer = indexQueueFactory({
|
|
@@ -328,30 +478,16 @@ async function transformer(ref) {
|
|
|
328
478
|
});
|
|
329
479
|
|
|
330
480
|
function getReader() {
|
|
331
|
-
if (
|
|
332
|
-
|
|
333
|
-
&& typeof sourceIndexName !== 'undefined'
|
|
334
|
-
) {
|
|
335
|
-
throw Error(
|
|
336
|
-
'Only either one of fileName or sourceIndexName can be specified.'
|
|
337
|
-
);
|
|
481
|
+
if (typeof fileName !== 'undefined' && typeof sourceIndexName !== 'undefined') {
|
|
482
|
+
throw Error('Only either one of fileName or sourceIndexName can be specified.');
|
|
338
483
|
}
|
|
339
484
|
|
|
340
|
-
if (
|
|
341
|
-
typeof fileName === 'undefined'
|
|
342
|
-
&& typeof sourceIndexName === 'undefined'
|
|
343
|
-
) {
|
|
485
|
+
if (typeof fileName === 'undefined' && typeof sourceIndexName === 'undefined') {
|
|
344
486
|
throw Error('Either fileName or sourceIndexName must be specified.');
|
|
345
487
|
}
|
|
346
488
|
|
|
347
489
|
if (typeof fileName !== 'undefined') {
|
|
348
|
-
return fileReaderFactory(
|
|
349
|
-
indexer,
|
|
350
|
-
fileName,
|
|
351
|
-
transform,
|
|
352
|
-
splitRegex,
|
|
353
|
-
verbose
|
|
354
|
-
);
|
|
490
|
+
return fileReaderFactory(indexer, fileName, transform, splitRegex, verbose);
|
|
355
491
|
}
|
|
356
492
|
|
|
357
493
|
if (typeof sourceIndexName !== 'undefined') {
|
|
@@ -359,7 +495,10 @@ async function transformer(ref) {
|
|
|
359
495
|
indexer,
|
|
360
496
|
sourceIndexName,
|
|
361
497
|
transform,
|
|
362
|
-
sourceClient
|
|
498
|
+
sourceClient,
|
|
499
|
+
query,
|
|
500
|
+
bufferSize,
|
|
501
|
+
populatedFields
|
|
363
502
|
);
|
|
364
503
|
}
|
|
365
504
|
|
|
@@ -4,20 +4,26 @@ import glob from 'glob';
|
|
|
4
4
|
import cliProgress from 'cli-progress';
|
|
5
5
|
import elasticsearch from '@elastic/elasticsearch';
|
|
6
6
|
|
|
7
|
+
var DEFAULT_BUFFER_SIZE = 1000;
|
|
8
|
+
|
|
7
9
|
function createMappingFactory(ref) {
|
|
8
10
|
var sourceClient = ref.sourceClient;
|
|
9
11
|
var sourceIndexName = ref.sourceIndexName;
|
|
10
12
|
var targetClient = ref.targetClient;
|
|
11
13
|
var targetIndexName = ref.targetIndexName;
|
|
12
14
|
var mappings = ref.mappings;
|
|
15
|
+
var mappingsOverride = ref.mappingsOverride;
|
|
16
|
+
var indexMappingTotalFieldsLimit = ref.indexMappingTotalFieldsLimit;
|
|
13
17
|
var verbose = ref.verbose;
|
|
14
18
|
|
|
15
19
|
return async function () {
|
|
16
|
-
var targetMappings = mappings;
|
|
20
|
+
var targetMappings = mappingsOverride ? undefined : mappings;
|
|
17
21
|
|
|
18
22
|
if (sourceClient && sourceIndexName && typeof targetMappings === 'undefined') {
|
|
19
23
|
try {
|
|
20
|
-
var mapping = await sourceClient.indices.getMapping({
|
|
24
|
+
var mapping = await sourceClient.indices.getMapping({
|
|
25
|
+
index: sourceIndexName,
|
|
26
|
+
});
|
|
21
27
|
targetMappings = mapping[sourceIndexName].mappings;
|
|
22
28
|
} catch (err) {
|
|
23
29
|
console.log('Error reading source mapping', err);
|
|
@@ -26,13 +32,24 @@ function createMappingFactory(ref) {
|
|
|
26
32
|
}
|
|
27
33
|
|
|
28
34
|
if (typeof targetMappings === 'object' && targetMappings !== null) {
|
|
35
|
+
if (mappingsOverride) {
|
|
36
|
+
targetMappings = Object.assign({}, targetMappings,
|
|
37
|
+
{properties: Object.assign({}, targetMappings.properties,
|
|
38
|
+
mappings)});
|
|
39
|
+
}
|
|
40
|
+
|
|
29
41
|
try {
|
|
30
|
-
var resp = await targetClient.indices.create(
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
42
|
+
var resp = await targetClient.indices.create({
|
|
43
|
+
index: targetIndexName,
|
|
44
|
+
body: Object.assign({}, {mappings: targetMappings},
|
|
45
|
+
(indexMappingTotalFieldsLimit !== undefined
|
|
46
|
+
? {
|
|
47
|
+
settings: {
|
|
48
|
+
'index.mapping.total_fields.limit': indexMappingTotalFieldsLimit,
|
|
49
|
+
},
|
|
50
|
+
}
|
|
51
|
+
: {})),
|
|
52
|
+
});
|
|
36
53
|
if (verbose) { console.log('Created target mapping', resp); }
|
|
37
54
|
} catch (err) {
|
|
38
55
|
console.log('Error creating target mapping', err);
|
|
@@ -41,45 +58,78 @@ function createMappingFactory(ref) {
|
|
|
41
58
|
};
|
|
42
59
|
}
|
|
43
60
|
|
|
61
|
+
var MAX_QUEUE_SIZE = 15;
|
|
62
|
+
|
|
44
63
|
function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
|
|
45
64
|
function startIndex(files) {
|
|
65
|
+
var ingestQueueSize = 0;
|
|
66
|
+
var finished = false;
|
|
67
|
+
|
|
46
68
|
var file = files.shift();
|
|
47
|
-
var s = fs
|
|
69
|
+
var s = fs
|
|
70
|
+
.createReadStream(file)
|
|
48
71
|
.pipe(es.split(splitRegex))
|
|
49
|
-
.pipe(
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
72
|
+
.pipe(
|
|
73
|
+
es
|
|
74
|
+
.mapSync(function (line) {
|
|
75
|
+
try {
|
|
76
|
+
// skip empty lines
|
|
77
|
+
if (line === '') {
|
|
78
|
+
return;
|
|
79
|
+
}
|
|
80
|
+
|
|
81
|
+
var doc =
|
|
82
|
+
typeof transform === 'function'
|
|
83
|
+
? JSON.stringify(transform(JSON.parse(line)))
|
|
84
|
+
: line;
|
|
85
|
+
|
|
86
|
+
// if doc is undefined we'll skip indexing it
|
|
87
|
+
if (typeof doc === 'undefined') {
|
|
88
|
+
s.resume();
|
|
89
|
+
return;
|
|
90
|
+
}
|
|
91
|
+
|
|
92
|
+
// the transform callback may return an array of docs so we can emit
|
|
93
|
+
// multiple docs from a single line
|
|
94
|
+
if (Array.isArray(doc)) {
|
|
95
|
+
doc.forEach(function (d) { return indexer.add(d); });
|
|
96
|
+
return;
|
|
97
|
+
}
|
|
98
|
+
|
|
99
|
+
indexer.add(doc);
|
|
100
|
+
} catch (e) {
|
|
101
|
+
console.log('error', e);
|
|
102
|
+
}
|
|
103
|
+
})
|
|
104
|
+
.on('error', function (err) {
|
|
105
|
+
console.log('Error while reading file.', err);
|
|
106
|
+
})
|
|
107
|
+
.on('end', function () {
|
|
108
|
+
if (verbose) { console.log('Read entire file: ', file); }
|
|
109
|
+
if (files.length > 0) {
|
|
110
|
+
startIndex(files);
|
|
111
|
+
return;
|
|
112
|
+
}
|
|
113
|
+
|
|
114
|
+
indexer.finish();
|
|
115
|
+
finished = true;
|
|
116
|
+
})
|
|
117
|
+
);
|
|
58
118
|
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
doc.forEach(function (d) { return indexer.add(d); });
|
|
63
|
-
return;
|
|
64
|
-
}
|
|
119
|
+
indexer.queueEmitter.on('queue-size', async function (size) {
|
|
120
|
+
if (finished) { return; }
|
|
121
|
+
ingestQueueSize = size;
|
|
65
122
|
|
|
66
|
-
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
}
|
|
71
|
-
|
|
72
|
-
console.log('Error while reading file.', err);
|
|
73
|
-
})
|
|
74
|
-
.on('end', function () {
|
|
75
|
-
if (verbose) { console.log('Read entire file: ', file); }
|
|
76
|
-
indexer.finish();
|
|
77
|
-
if (files.length > 0) {
|
|
78
|
-
startIndex(files);
|
|
79
|
-
}
|
|
80
|
-
}));
|
|
123
|
+
if (ingestQueueSize < MAX_QUEUE_SIZE) {
|
|
124
|
+
s.resume();
|
|
125
|
+
} else {
|
|
126
|
+
s.pause();
|
|
127
|
+
}
|
|
128
|
+
});
|
|
81
129
|
|
|
82
130
|
indexer.queueEmitter.on('resume', function () {
|
|
131
|
+
if (finished) { return; }
|
|
132
|
+
ingestQueueSize = 0;
|
|
83
133
|
s.resume();
|
|
84
134
|
});
|
|
85
135
|
}
|
|
@@ -95,110 +145,202 @@ var EventEmitter = require('events');
|
|
|
95
145
|
|
|
96
146
|
var queueEmitter = new EventEmitter();
|
|
97
147
|
|
|
148
|
+
var parallelCalls = 1;
|
|
149
|
+
|
|
98
150
|
// a simple helper queue to bulk index documents
|
|
99
151
|
function indexQueueFactory(ref) {
|
|
100
152
|
var client = ref.targetClient;
|
|
101
153
|
var targetIndexName = ref.targetIndexName;
|
|
102
|
-
var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize =
|
|
154
|
+
var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
|
|
103
155
|
var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
|
|
104
156
|
var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
|
|
105
157
|
|
|
106
158
|
var buffer = [];
|
|
107
159
|
var queue = [];
|
|
108
|
-
var ingesting =
|
|
160
|
+
var ingesting = 0;
|
|
161
|
+
var ingestTimes = [];
|
|
162
|
+
var finished = false;
|
|
109
163
|
|
|
110
|
-
var ingest =
|
|
164
|
+
var ingest = function (b) {
|
|
111
165
|
if (typeof b !== 'undefined') {
|
|
112
166
|
queue.push(b);
|
|
113
167
|
queueEmitter.emit('queue-size', queue.length);
|
|
114
168
|
}
|
|
115
169
|
|
|
116
|
-
if (
|
|
170
|
+
if (ingestTimes.length > 5) { ingestTimes = ingestTimes.slice(-5); }
|
|
171
|
+
|
|
172
|
+
if (ingesting < parallelCalls) {
|
|
117
173
|
var docs = queue.shift();
|
|
118
|
-
queueEmitter.emit('queue-size', queue.length);
|
|
119
|
-
ingesting = true;
|
|
120
|
-
if (verbose) { console.log(("bulk ingest docs: " + (docs.length / 2) + ", queue length: " + (queue.length))); }
|
|
121
174
|
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
if (queue.length > 0) {
|
|
126
|
-
ingest();
|
|
127
|
-
}
|
|
128
|
-
} catch (err) {
|
|
129
|
-
console.log('bulk index error', err);
|
|
175
|
+
queueEmitter.emit('queue-size', queue.length);
|
|
176
|
+
if (queue.length <= 5) {
|
|
177
|
+
queueEmitter.emit('resume');
|
|
130
178
|
}
|
|
131
|
-
}
|
|
132
179
|
|
|
133
|
-
|
|
134
|
-
|
|
135
|
-
|
|
136
|
-
|
|
180
|
+
ingesting += 1;
|
|
181
|
+
|
|
182
|
+
if (verbose)
|
|
183
|
+
{ console.log(("bulk ingest docs: " + (docs.length / 2) + ", queue length: " + (queue.length))); }
|
|
184
|
+
|
|
185
|
+
var start = Date.now();
|
|
186
|
+
client
|
|
187
|
+
.bulk({ body: docs })
|
|
188
|
+
.then(function () {
|
|
189
|
+
var end = Date.now();
|
|
190
|
+
var delta = end - start;
|
|
191
|
+
ingestTimes.push(delta);
|
|
192
|
+
ingesting -= 1;
|
|
193
|
+
|
|
194
|
+
var ingestTimesMovingAverage =
|
|
195
|
+
ingestTimes.length > 0
|
|
196
|
+
? ingestTimes.reduce(function (p, c) { return p + c; }, 0) / ingestTimes.length
|
|
197
|
+
: 0;
|
|
198
|
+
var ingestTimesMovingAverageSeconds = Math.floor(ingestTimesMovingAverage / 1000);
|
|
199
|
+
|
|
200
|
+
if (
|
|
201
|
+
ingestTimes.length > 0 &&
|
|
202
|
+
ingestTimesMovingAverageSeconds < 30 &&
|
|
203
|
+
parallelCalls < 10
|
|
204
|
+
) {
|
|
205
|
+
parallelCalls += 1;
|
|
206
|
+
} else if (
|
|
207
|
+
ingestTimes.length > 0 &&
|
|
208
|
+
ingestTimesMovingAverageSeconds >= 30 &&
|
|
209
|
+
parallelCalls > 1
|
|
210
|
+
) {
|
|
211
|
+
parallelCalls -= 1;
|
|
212
|
+
}
|
|
213
|
+
|
|
214
|
+
if (queue.length > 0) {
|
|
215
|
+
ingest();
|
|
216
|
+
} else if (queue.length === 0 && finished) {
|
|
217
|
+
queueEmitter.emit('finish');
|
|
218
|
+
}
|
|
219
|
+
})
|
|
220
|
+
.catch(function (error) {
|
|
221
|
+
console.error(error);
|
|
222
|
+
ingesting -= 1;
|
|
223
|
+
parallelCalls = 1;
|
|
224
|
+
if (queue.length > 0) {
|
|
225
|
+
ingest();
|
|
226
|
+
}
|
|
227
|
+
});
|
|
137
228
|
}
|
|
138
229
|
};
|
|
139
230
|
|
|
140
231
|
return {
|
|
141
232
|
add: function (doc) {
|
|
233
|
+
if (finished) {
|
|
234
|
+
throw new Error('Unexpected doc added after indexer should finish.');
|
|
235
|
+
}
|
|
236
|
+
|
|
142
237
|
if (!skipHeader) {
|
|
143
238
|
var header = { index: { _index: targetIndexName } };
|
|
144
239
|
buffer.push(header);
|
|
145
240
|
}
|
|
146
241
|
buffer.push(doc);
|
|
147
242
|
|
|
148
|
-
// console.log(`add: queue.length ${queue.length}`);
|
|
149
243
|
if (queue.length === 0) {
|
|
150
244
|
queueEmitter.emit('resume');
|
|
151
245
|
}
|
|
152
246
|
|
|
153
|
-
if (buffer.length >=
|
|
247
|
+
if (buffer.length >= bufferSize * 2) {
|
|
154
248
|
ingest(buffer);
|
|
155
249
|
buffer = [];
|
|
156
250
|
}
|
|
157
251
|
},
|
|
158
|
-
finish:
|
|
159
|
-
|
|
160
|
-
|
|
161
|
-
|
|
252
|
+
finish: function () {
|
|
253
|
+
finished = true;
|
|
254
|
+
|
|
255
|
+
if (buffer.length > 0) {
|
|
256
|
+
ingest(buffer);
|
|
257
|
+
buffer = [];
|
|
258
|
+
} else if (queue.length === 0 && ingesting === 0) {
|
|
259
|
+
queueEmitter.emit('finish');
|
|
260
|
+
}
|
|
162
261
|
},
|
|
163
262
|
queueEmitter: queueEmitter,
|
|
164
263
|
};
|
|
165
264
|
}
|
|
166
265
|
|
|
167
|
-
var MAX_QUEUE_SIZE =
|
|
266
|
+
var MAX_QUEUE_SIZE$1 = 15;
|
|
168
267
|
|
|
169
268
|
// create a new progress bar instance and use shades_classic theme
|
|
170
269
|
var progressBar = new cliProgress.SingleBar({}, cliProgress.Presets.shades_classic);
|
|
171
270
|
|
|
172
|
-
function indexReaderFactory(
|
|
271
|
+
function indexReaderFactory(
|
|
272
|
+
indexer,
|
|
273
|
+
sourceIndexName,
|
|
274
|
+
transform,
|
|
275
|
+
client,
|
|
276
|
+
query,
|
|
277
|
+
bufferSize,
|
|
278
|
+
populatedFields
|
|
279
|
+
) {
|
|
280
|
+
if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
|
|
281
|
+
if ( populatedFields === void 0 ) populatedFields = false;
|
|
282
|
+
|
|
173
283
|
return async function indexReader() {
|
|
174
284
|
var responseQueue = [];
|
|
175
285
|
var docsNum = 0;
|
|
176
286
|
|
|
177
|
-
function
|
|
178
|
-
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
|
|
182
|
-
|
|
287
|
+
async function fetchPopulatedFields() {
|
|
288
|
+
try {
|
|
289
|
+
var response = await client.search({
|
|
290
|
+
index: sourceIndexName,
|
|
291
|
+
size: bufferSize,
|
|
292
|
+
query: {
|
|
293
|
+
function_score: {
|
|
294
|
+
query: query,
|
|
295
|
+
random_score: {},
|
|
296
|
+
},
|
|
297
|
+
},
|
|
298
|
+
});
|
|
299
|
+
|
|
300
|
+
// Get all field names for each returned doc and flatten it
|
|
301
|
+
// to a list of unique field names used across all docs.
|
|
302
|
+
return new Set(response.hits.hits.map(function (d) { return Object.keys(d._source); }).flat(1));
|
|
303
|
+
} catch (e) {
|
|
304
|
+
console.log('error', e);
|
|
305
|
+
}
|
|
306
|
+
}
|
|
307
|
+
|
|
308
|
+
function search(fields) {
|
|
309
|
+
return client.search(Object.assign({}, {index: sourceIndexName,
|
|
310
|
+
scroll: '600s',
|
|
311
|
+
size: bufferSize,
|
|
312
|
+
query: query},
|
|
313
|
+
(fields ? { _source: fields } : {})));
|
|
183
314
|
}
|
|
184
315
|
|
|
185
316
|
function scroll(id) {
|
|
186
317
|
return client.scroll({
|
|
187
318
|
scroll_id: id,
|
|
188
|
-
scroll: '
|
|
319
|
+
scroll: '600s',
|
|
189
320
|
});
|
|
190
321
|
}
|
|
191
322
|
|
|
323
|
+
var fieldsWithData;
|
|
324
|
+
|
|
325
|
+
// identify populated fields
|
|
326
|
+
if (populatedFields) {
|
|
327
|
+
fieldsWithData = await fetchPopulatedFields();
|
|
328
|
+
console.log('fieldsWithData', fieldsWithData);
|
|
329
|
+
}
|
|
330
|
+
|
|
192
331
|
// start things off by searching, setting a scroll timeout, and pushing
|
|
193
332
|
// our first response into the queue to be processed
|
|
194
|
-
var se = await search();
|
|
333
|
+
var se = await search(fieldsWithData);
|
|
195
334
|
responseQueue.push(se);
|
|
196
335
|
progressBar.start(se.hits.total.value, 0);
|
|
336
|
+
console.log('se', se.hits.hits[0]);
|
|
197
337
|
|
|
198
338
|
function processHit(hit) {
|
|
199
339
|
docsNum += 1;
|
|
200
340
|
try {
|
|
201
|
-
var doc =
|
|
341
|
+
var doc = typeof transform === 'function' ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle
|
|
342
|
+
// console.log('doc', doc);
|
|
343
|
+
|
|
202
344
|
// if doc is undefined we'll skip indexing it
|
|
203
345
|
if (typeof doc === 'undefined') {
|
|
204
346
|
return;
|
|
@@ -232,15 +374,13 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
|
|
|
232
374
|
progressBar.update(docsNum);
|
|
233
375
|
|
|
234
376
|
// check to see if we have collected all of the docs
|
|
235
|
-
// console.log('check count', response.hits.total.value, docsNum);
|
|
236
377
|
if (response.hits.total.value === docsNum) {
|
|
237
378
|
indexer.finish();
|
|
238
|
-
progressBar.stop();
|
|
239
379
|
break;
|
|
240
380
|
}
|
|
241
381
|
|
|
242
|
-
if (ingestQueueSize < MAX_QUEUE_SIZE) {
|
|
243
|
-
|
|
382
|
+
if (ingestQueueSize < MAX_QUEUE_SIZE$1) {
|
|
383
|
+
// get the next response if there are more docs to fetch
|
|
244
384
|
var sc = await scroll(response._scroll_id); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
|
|
245
385
|
scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
|
|
246
386
|
responseQueue.push(sc);
|
|
@@ -253,8 +393,8 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
|
|
|
253
393
|
indexer.queueEmitter.on('queue-size', async function (size) {
|
|
254
394
|
ingestQueueSize = size;
|
|
255
395
|
|
|
256
|
-
if (!readActive && ingestQueueSize < MAX_QUEUE_SIZE) {
|
|
257
|
-
|
|
396
|
+
if (!readActive && ingestQueueSize < MAX_QUEUE_SIZE$1) {
|
|
397
|
+
// get the next response if there are more docs to fetch
|
|
258
398
|
var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
|
|
259
399
|
scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
|
|
260
400
|
responseQueue.push(sc);
|
|
@@ -276,6 +416,10 @@ function indexReaderFactory(indexer, sourceIndexName, transform, client) {
|
|
|
276
416
|
processResponseQueue();
|
|
277
417
|
});
|
|
278
418
|
|
|
419
|
+
indexer.queueEmitter.on('finish', function () {
|
|
420
|
+
progressBar.stop();
|
|
421
|
+
});
|
|
422
|
+
|
|
279
423
|
processResponseQueue();
|
|
280
424
|
};
|
|
281
425
|
}
|
|
@@ -284,12 +428,16 @@ async function transformer(ref) {
|
|
|
284
428
|
var deleteIndex = ref.deleteIndex; if ( deleteIndex === void 0 ) deleteIndex = false;
|
|
285
429
|
var sourceClientConfig = ref.sourceClientConfig;
|
|
286
430
|
var targetClientConfig = ref.targetClientConfig;
|
|
287
|
-
var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize =
|
|
431
|
+
var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
|
|
288
432
|
var fileName = ref.fileName;
|
|
289
433
|
var splitRegex = ref.splitRegex; if ( splitRegex === void 0 ) splitRegex = /\n/;
|
|
290
434
|
var sourceIndexName = ref.sourceIndexName;
|
|
291
435
|
var targetIndexName = ref.targetIndexName;
|
|
292
436
|
var mappings = ref.mappings;
|
|
437
|
+
var mappingsOverride = ref.mappingsOverride; if ( mappingsOverride === void 0 ) mappingsOverride = false;
|
|
438
|
+
var indexMappingTotalFieldsLimit = ref.indexMappingTotalFieldsLimit;
|
|
439
|
+
var populatedFields = ref.populatedFields; if ( populatedFields === void 0 ) populatedFields = false;
|
|
440
|
+
var query = ref.query;
|
|
293
441
|
var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
|
|
294
442
|
var transform = ref.transform;
|
|
295
443
|
var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
|
|
@@ -313,6 +461,8 @@ async function transformer(ref) {
|
|
|
313
461
|
targetClient: targetClient,
|
|
314
462
|
targetIndexName: targetIndexName,
|
|
315
463
|
mappings: mappings,
|
|
464
|
+
mappingsOverride: mappingsOverride,
|
|
465
|
+
indexMappingTotalFieldsLimit: indexMappingTotalFieldsLimit,
|
|
316
466
|
verbose: verbose,
|
|
317
467
|
});
|
|
318
468
|
var indexer = indexQueueFactory({
|
|
@@ -324,30 +474,16 @@ async function transformer(ref) {
|
|
|
324
474
|
});
|
|
325
475
|
|
|
326
476
|
function getReader() {
|
|
327
|
-
if (
|
|
328
|
-
|
|
329
|
-
&& typeof sourceIndexName !== 'undefined'
|
|
330
|
-
) {
|
|
331
|
-
throw Error(
|
|
332
|
-
'Only either one of fileName or sourceIndexName can be specified.'
|
|
333
|
-
);
|
|
477
|
+
if (typeof fileName !== 'undefined' && typeof sourceIndexName !== 'undefined') {
|
|
478
|
+
throw Error('Only either one of fileName or sourceIndexName can be specified.');
|
|
334
479
|
}
|
|
335
480
|
|
|
336
|
-
if (
|
|
337
|
-
typeof fileName === 'undefined'
|
|
338
|
-
&& typeof sourceIndexName === 'undefined'
|
|
339
|
-
) {
|
|
481
|
+
if (typeof fileName === 'undefined' && typeof sourceIndexName === 'undefined') {
|
|
340
482
|
throw Error('Either fileName or sourceIndexName must be specified.');
|
|
341
483
|
}
|
|
342
484
|
|
|
343
485
|
if (typeof fileName !== 'undefined') {
|
|
344
|
-
return fileReaderFactory(
|
|
345
|
-
indexer,
|
|
346
|
-
fileName,
|
|
347
|
-
transform,
|
|
348
|
-
splitRegex,
|
|
349
|
-
verbose
|
|
350
|
-
);
|
|
486
|
+
return fileReaderFactory(indexer, fileName, transform, splitRegex, verbose);
|
|
351
487
|
}
|
|
352
488
|
|
|
353
489
|
if (typeof sourceIndexName !== 'undefined') {
|
|
@@ -355,7 +491,10 @@ async function transformer(ref) {
|
|
|
355
491
|
indexer,
|
|
356
492
|
sourceIndexName,
|
|
357
493
|
transform,
|
|
358
|
-
sourceClient
|
|
494
|
+
sourceClient,
|
|
495
|
+
query,
|
|
496
|
+
bufferSize,
|
|
497
|
+
populatedFields
|
|
359
498
|
);
|
|
360
499
|
}
|
|
361
500
|
|
package/package.json
CHANGED
|
@@ -14,22 +14,30 @@
|
|
|
14
14
|
"license": "Apache-2.0",
|
|
15
15
|
"author": "Walter Rafelsberger <walter@rafelsberger.at>",
|
|
16
16
|
"contributors": [],
|
|
17
|
-
"version": "1.0.0-
|
|
17
|
+
"version": "1.0.0-beta2",
|
|
18
18
|
"main": "dist/node-es-transformer.cjs.js",
|
|
19
19
|
"module": "dist/node-es-transformer.esm.js",
|
|
20
20
|
"dependencies": {
|
|
21
|
-
"@elastic/elasticsearch": "^8.
|
|
21
|
+
"@elastic/elasticsearch": "^8.10.0",
|
|
22
22
|
"cli-progress": "^3.12.0",
|
|
23
23
|
"event-stream": "3.3.4",
|
|
24
24
|
"glob": "7.1.2"
|
|
25
25
|
},
|
|
26
26
|
"devDependencies": {
|
|
27
27
|
"acorn": "^6.4.2",
|
|
28
|
-
"
|
|
28
|
+
"async-retry": "^1.3.3",
|
|
29
|
+
"commit-and-tag-version": "^11.3.0",
|
|
30
|
+
"cz-conventional-changelog": "^3.3.0",
|
|
31
|
+
"eslint": "^8.51.0",
|
|
29
32
|
"eslint-config-airbnb": "19.0.4",
|
|
33
|
+
"eslint-config-prettier": "^9.0.0",
|
|
30
34
|
"eslint-plugin-import": "2.27.5",
|
|
35
|
+
"eslint-plugin-jest": "^27.4.2",
|
|
31
36
|
"eslint-plugin-jsx-a11y": "6.7.1",
|
|
37
|
+
"eslint-plugin-prettier": "^3.3.1",
|
|
32
38
|
"eslint-plugin-react": "7.32.2",
|
|
39
|
+
"jest": "^29.7.0",
|
|
40
|
+
"prettier": "^2.2.1",
|
|
33
41
|
"rollup": "0.66.6",
|
|
34
42
|
"rollup-plugin-buble": "0.19.6",
|
|
35
43
|
"rollup-plugin-commonjs": "8.0.2",
|
|
@@ -38,10 +46,23 @@
|
|
|
38
46
|
"scripts": {
|
|
39
47
|
"build": "rollup -c",
|
|
40
48
|
"dev": "rollup -c -w",
|
|
41
|
-
"test": "
|
|
42
|
-
"pretest": "npm run build"
|
|
49
|
+
"test": "jest --runInBand --detectOpenHandles --forceExit",
|
|
50
|
+
"pretest": "npm run build",
|
|
51
|
+
"release": "commit-and-tag-version",
|
|
52
|
+
"create-sample-data-10000": "node scripts/create_sample_data_10000",
|
|
53
|
+
"create-sample-data-100": "node scripts/create_sample_data_100"
|
|
43
54
|
},
|
|
44
55
|
"files": [
|
|
45
56
|
"dist"
|
|
46
|
-
]
|
|
57
|
+
],
|
|
58
|
+
"config": {
|
|
59
|
+
"commitizen": {
|
|
60
|
+
"path": "./node_modules/cz-conventional-changelog"
|
|
61
|
+
}
|
|
62
|
+
},
|
|
63
|
+
"jest": {
|
|
64
|
+
"testMatch": [
|
|
65
|
+
"**/__tests__/**/*.test.js"
|
|
66
|
+
]
|
|
67
|
+
}
|
|
47
68
|
}
|