node-es-transformer 1.0.0-beta2 → 1.0.0-beta4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +8 -17
- package/dist/node-es-transformer.cjs.js +234 -182
- package/dist/node-es-transformer.esm.js +234 -182
- package/package.json +6 -5
package/README.md
CHANGED
|
@@ -14,23 +14,12 @@ If you're looking for a nodejs based tool which allows you to ingest large CSV/J
|
|
|
14
14
|
|
|
15
15
|
While I'd generally recommend using [Logstash](https://www.elastic.co/products/logstash), [filebeat](https://www.elastic.co/products/beats/filebeat), [Ingest Nodes](https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html), [Elastic Agent](https://www.elastic.co/guide/en/fleet/current/fleet-overview.html) or [Elasticsearch Transforms](https://www.elastic.co/guide/en/elasticsearch/reference/current/transforms.html) for established use cases, this tool may be of help especially if you feel more at home in the JavaScript/nodejs universe and have use cases with customized ingestion and data transformation needs.
|
|
16
16
|
|
|
17
|
-
**This is experimental code, use at your own risk. Nonetheless, I encourage you to give it a try so I can gather some feedback.**
|
|
18
|
-
|
|
19
|
-
### So why is this still _alpha_?
|
|
20
|
-
|
|
21
|
-
- The API is not quite final and might change from release to release.
|
|
22
|
-
- The code needs some more safety measures to avoid some possible accidental data loss scenarios.
|
|
23
|
-
- No test coverage yet.
|
|
24
|
-
|
|
25
|
-
---
|
|
26
|
-
|
|
27
|
-
Now that we've talked about the caveats, let's have a look what you actually get with this tool:
|
|
28
|
-
|
|
29
17
|
## Features
|
|
30
18
|
|
|
31
19
|
- Buffering/Streaming for both reading and indexing. Files are read using streaming and Elasticsearch ingestion is done using buffered bulk indexing. This is tailored towards ingestion of large files. Successfully tested so far with JSON and CSV files in the range of 20-30 GBytes. On a single machine running both `node-es-transformer` and Elasticsearch ingestion rates up to 20k documents/second were achieved (2,9 GHz Intel Core i7, 16GByte RAM, SSD), depending on document size.
|
|
32
20
|
- Supports wildcards to ingest/transform a range of files in one go.
|
|
33
21
|
- Supports fetching documents from existing indices using search/scroll. This allows you to reindex with custom data transformations just using JavaScript in the `transform` callback.
|
|
22
|
+
- Supports ingesting docs based on a nodejs stream.
|
|
34
23
|
- The `transform` callback gives you each source document, but you can split it up in multiple ones and return an array of documents. An example use case for this: Each source document is a Tweet and you want to transform that into an entity centric index based on Hashtags.
|
|
35
24
|
|
|
36
25
|
## Getting started
|
|
@@ -110,10 +99,12 @@ transformer({
|
|
|
110
99
|
|
|
111
100
|
- `deleteIndex`: Setting to automatically delete an existing index, default is `false`.
|
|
112
101
|
- `sourceClientConfig`/`targetClientConfig`: Optional Elasticsearch client options, defaults to `{ node: 'http://localhost:9200' }`.
|
|
113
|
-
- `bufferSize`: The
|
|
114
|
-
- `
|
|
102
|
+
- `bufferSize`: The threshold to flush bulk index request in KBytes, defaults to `5120`.
|
|
103
|
+
- `searchSize`: The amount of documents to be fetched with each search request when reindexing from another source index.
|
|
104
|
+
- `fileName`: Source filename to ingest, supports wildcards. If this is set, `sourceIndexName` and `stream` are not allowed.
|
|
105
|
+
- `stream`: Source nodejs stream to ingest. If this is set, `sourceIndexName` and `fileName` are not allowed.
|
|
115
106
|
- `splitRegex`: Custom line split regex, defaults to `/\n/`.
|
|
116
|
-
- `sourceIndexName`: The source Elasticsearch index to reindex from. If this is set, `fileName`
|
|
107
|
+
- `sourceIndexName`: The source Elasticsearch index to reindex from. If this is set, `fileName` and `stream` are not allowed.
|
|
117
108
|
- `targetIndexName`: The target Elasticsearch index where documents will be indexed.
|
|
118
109
|
- `mappings`: Optional Elasticsearch document mappings. If not set and you're reindexing from another index, the mappings from the existing index will be used.
|
|
119
110
|
- `mappingsOverride`: If you're reindexing and this is set to `true`, `mappings` will be applied on top of the source index's mappings. Defaults to `false`.
|
|
@@ -147,10 +138,10 @@ yarn
|
|
|
147
138
|
|
|
148
139
|
```bash
|
|
149
140
|
# Download the docker image
|
|
150
|
-
docker pull docker.elastic.co/elasticsearch/elasticsearch:8.
|
|
141
|
+
docker pull docker.elastic.co/elasticsearch/elasticsearch:8.17.0
|
|
151
142
|
|
|
152
143
|
# Run the container
|
|
153
|
-
docker run --name es01 --net elastic -p 9200:9200 -it -m 1GB -e "discovery.type=single-node" -e "xpack.security.enabled=false" docker.elastic.co/elasticsearch/elasticsearch:8.
|
|
144
|
+
docker run --name es01 --net elastic -p 9200:9200 -it -m 1GB -e "discovery.type=single-node" -e "xpack.security.enabled=false" docker.elastic.co/elasticsearch/elasticsearch:8.17.0
|
|
154
145
|
```
|
|
155
146
|
|
|
156
147
|
To commit, use `cz`. To prepare a release, use e.g. `yarn release -- --release-as 1.0.0-beta2`.
|
|
@@ -5,10 +5,20 @@ function _interopDefault (ex) { return (ex && (typeof ex === 'object') && 'defau
|
|
|
5
5
|
var fs = _interopDefault(require('fs'));
|
|
6
6
|
var es = _interopDefault(require('event-stream'));
|
|
7
7
|
var glob = _interopDefault(require('glob'));
|
|
8
|
+
var split = _interopDefault(require('split2'));
|
|
9
|
+
var stream = require('stream');
|
|
8
10
|
var cliProgress = _interopDefault(require('cli-progress'));
|
|
9
11
|
var elasticsearch = _interopDefault(require('@elastic/elasticsearch'));
|
|
10
12
|
|
|
11
|
-
|
|
13
|
+
// In earlier versions this was used to set the number of docs to index in a
|
|
14
|
+
// single bulk request. Since we switched to use the helpers.bulk() method from
|
|
15
|
+
// the ES client, this now translates to the `flushBytes` option of the helper.
|
|
16
|
+
// However, for kind of a backwards compability with the old values, this uses
|
|
17
|
+
// KBytes instead of Bytes. It will be multiplied by 1024 in the index queue.
|
|
18
|
+
var DEFAULT_BUFFER_SIZE = 5120;
|
|
19
|
+
|
|
20
|
+
// The default number of docs to fetch in a single search request when reindexing.
|
|
21
|
+
var DEFAULT_SEARCH_SIZE = 1000;
|
|
12
22
|
|
|
13
23
|
function createMappingFactory(ref) {
|
|
14
24
|
var sourceClient = ref.sourceClient;
|
|
@@ -19,6 +29,7 @@ function createMappingFactory(ref) {
|
|
|
19
29
|
var mappingsOverride = ref.mappingsOverride;
|
|
20
30
|
var indexMappingTotalFieldsLimit = ref.indexMappingTotalFieldsLimit;
|
|
21
31
|
var verbose = ref.verbose;
|
|
32
|
+
var deleteIndex = ref.deleteIndex;
|
|
22
33
|
|
|
23
34
|
return async function () {
|
|
24
35
|
var targetMappings = mappingsOverride ? undefined : mappings;
|
|
@@ -28,7 +39,14 @@ function createMappingFactory(ref) {
|
|
|
28
39
|
var mapping = await sourceClient.indices.getMapping({
|
|
29
40
|
index: sourceIndexName,
|
|
30
41
|
});
|
|
31
|
-
|
|
42
|
+
if (mapping[sourceIndexName]) {
|
|
43
|
+
targetMappings = mapping[sourceIndexName].mappings;
|
|
44
|
+
} else {
|
|
45
|
+
var allMappings = Object.values(mapping);
|
|
46
|
+
if (allMappings.length > 0) {
|
|
47
|
+
targetMappings = Object.values(mapping)[0].mappings;
|
|
48
|
+
}
|
|
49
|
+
}
|
|
32
50
|
} catch (err) {
|
|
33
51
|
console.log('Error reading source mapping', err);
|
|
34
52
|
return;
|
|
@@ -43,18 +61,28 @@ function createMappingFactory(ref) {
|
|
|
43
61
|
}
|
|
44
62
|
|
|
45
63
|
try {
|
|
46
|
-
var
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
64
|
+
var indexExists = await targetClient.indices.exists({ index: targetIndexName });
|
|
65
|
+
|
|
66
|
+
if (indexExists === true && deleteIndex === true) {
|
|
67
|
+
await targetClient.indices.delete({ index: targetIndexName });
|
|
68
|
+
}
|
|
69
|
+
|
|
70
|
+
if (indexExists === false || deleteIndex === true) {
|
|
71
|
+
var resp = await targetClient.indices.create({
|
|
72
|
+
index: targetIndexName,
|
|
73
|
+
body: Object.assign({}, {mappings: targetMappings},
|
|
74
|
+
(indexMappingTotalFieldsLimit !== undefined
|
|
75
|
+
? {
|
|
76
|
+
settings: {
|
|
77
|
+
'index.mapping.total_fields.limit': indexMappingTotalFieldsLimit,
|
|
78
|
+
'index.number_of_shards': 1,
|
|
79
|
+
'index.number_of_replicas': 0,
|
|
80
|
+
},
|
|
81
|
+
}
|
|
82
|
+
: {})),
|
|
83
|
+
});
|
|
84
|
+
if (verbose) { console.log('Created target mapping', resp); }
|
|
85
|
+
}
|
|
58
86
|
} catch (err) {
|
|
59
87
|
console.log('Error creating target mapping', err);
|
|
60
88
|
}
|
|
@@ -62,17 +90,14 @@ function createMappingFactory(ref) {
|
|
|
62
90
|
};
|
|
63
91
|
}
|
|
64
92
|
|
|
65
|
-
var MAX_QUEUE_SIZE = 15;
|
|
66
|
-
|
|
67
93
|
function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
|
|
68
94
|
function startIndex(files) {
|
|
69
|
-
var ingestQueueSize = 0;
|
|
70
95
|
var finished = false;
|
|
71
96
|
|
|
72
97
|
var file = files.shift();
|
|
73
98
|
var s = fs
|
|
74
99
|
.createReadStream(file)
|
|
75
|
-
.pipe(
|
|
100
|
+
.pipe(split(splitRegex))
|
|
76
101
|
.pipe(
|
|
77
102
|
es
|
|
78
103
|
.mapSync(function (line) {
|
|
@@ -120,20 +145,13 @@ function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
|
|
|
120
145
|
})
|
|
121
146
|
);
|
|
122
147
|
|
|
123
|
-
indexer.queueEmitter.on('
|
|
148
|
+
indexer.queueEmitter.on('pause', function () {
|
|
124
149
|
if (finished) { return; }
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
if (ingestQueueSize < MAX_QUEUE_SIZE) {
|
|
128
|
-
s.resume();
|
|
129
|
-
} else {
|
|
130
|
-
s.pause();
|
|
131
|
-
}
|
|
150
|
+
s.pause();
|
|
132
151
|
});
|
|
133
152
|
|
|
134
153
|
indexer.queueEmitter.on('resume', function () {
|
|
135
154
|
if (finished) { return; }
|
|
136
|
-
ingestQueueSize = 0;
|
|
137
155
|
s.resume();
|
|
138
156
|
});
|
|
139
157
|
}
|
|
@@ -149,7 +167,7 @@ var EventEmitter = require('events');
|
|
|
149
167
|
|
|
150
168
|
var queueEmitter = new EventEmitter();
|
|
151
169
|
|
|
152
|
-
var parallelCalls =
|
|
170
|
+
var parallelCalls = 5;
|
|
153
171
|
|
|
154
172
|
// a simple helper queue to bulk index documents
|
|
155
173
|
function indexQueueFactory(ref) {
|
|
@@ -159,78 +177,74 @@ function indexQueueFactory(ref) {
|
|
|
159
177
|
var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
|
|
160
178
|
var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
|
|
161
179
|
|
|
162
|
-
var
|
|
163
|
-
var
|
|
164
|
-
var ingesting = 0;
|
|
165
|
-
var ingestTimes = [];
|
|
166
|
-
var finished = false;
|
|
180
|
+
var flushBytes = bufferSize * 1024; // Convert KB to Bytes
|
|
181
|
+
var highWaterMark = flushBytes * parallelCalls;
|
|
167
182
|
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
|
|
183
|
+
// Create a Readable stream
|
|
184
|
+
var stream$$1 = new stream.Readable({
|
|
185
|
+
read: function read() {}, // Implement read but we manage pushing manually
|
|
186
|
+
highWaterMark: highWaterMark, // Buffer size for backpressure management
|
|
187
|
+
});
|
|
173
188
|
|
|
174
|
-
|
|
189
|
+
async function* ndjsonStreamIterator(readableStream) {
|
|
190
|
+
var buffer = ''; // To hold the incomplete data
|
|
191
|
+
var skippedHeader = false;
|
|
175
192
|
|
|
176
|
-
|
|
177
|
-
|
|
193
|
+
// Iterate over the stream using async iteration
|
|
194
|
+
for await (var chunk of readableStream) {
|
|
195
|
+
buffer += chunk.toString(); // Accumulate the chunk data in the buffer
|
|
178
196
|
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
queueEmitter.emit('resume');
|
|
182
|
-
}
|
|
197
|
+
// Split the buffer into lines (NDJSON items)
|
|
198
|
+
var lines = buffer.split('\n');
|
|
183
199
|
|
|
184
|
-
|
|
185
|
-
|
|
186
|
-
if (verbose)
|
|
187
|
-
{ console.log(("bulk ingest docs: " + (docs.length / 2) + ", queue length: " + (queue.length))); }
|
|
188
|
-
|
|
189
|
-
var start = Date.now();
|
|
190
|
-
client
|
|
191
|
-
.bulk({ body: docs })
|
|
192
|
-
.then(function () {
|
|
193
|
-
var end = Date.now();
|
|
194
|
-
var delta = end - start;
|
|
195
|
-
ingestTimes.push(delta);
|
|
196
|
-
ingesting -= 1;
|
|
197
|
-
|
|
198
|
-
var ingestTimesMovingAverage =
|
|
199
|
-
ingestTimes.length > 0
|
|
200
|
-
? ingestTimes.reduce(function (p, c) { return p + c; }, 0) / ingestTimes.length
|
|
201
|
-
: 0;
|
|
202
|
-
var ingestTimesMovingAverageSeconds = Math.floor(ingestTimesMovingAverage / 1000);
|
|
203
|
-
|
|
204
|
-
if (
|
|
205
|
-
ingestTimes.length > 0 &&
|
|
206
|
-
ingestTimesMovingAverageSeconds < 30 &&
|
|
207
|
-
parallelCalls < 10
|
|
208
|
-
) {
|
|
209
|
-
parallelCalls += 1;
|
|
210
|
-
} else if (
|
|
211
|
-
ingestTimes.length > 0 &&
|
|
212
|
-
ingestTimesMovingAverageSeconds >= 30 &&
|
|
213
|
-
parallelCalls > 1
|
|
214
|
-
) {
|
|
215
|
-
parallelCalls -= 1;
|
|
216
|
-
}
|
|
200
|
+
// The last line might be incomplete, so hold it back in the buffer
|
|
201
|
+
buffer = lines.pop();
|
|
217
202
|
|
|
218
|
-
|
|
219
|
-
|
|
220
|
-
|
|
221
|
-
|
|
222
|
-
|
|
223
|
-
|
|
224
|
-
|
|
225
|
-
|
|
226
|
-
|
|
227
|
-
|
|
228
|
-
|
|
229
|
-
ingest();
|
|
203
|
+
// Yield each complete JSON object
|
|
204
|
+
for (var line of lines) {
|
|
205
|
+
if (line.trim()) {
|
|
206
|
+
try {
|
|
207
|
+
if (!skipHeader || (skipHeader && !skippedHeader)) {
|
|
208
|
+
yield JSON.parse(line); // Parse and yield the JSON object
|
|
209
|
+
skippedHeader = true;
|
|
210
|
+
}
|
|
211
|
+
} catch (err) {
|
|
212
|
+
// Handle JSON parse errors if necessary
|
|
213
|
+
console.error('Failed to parse JSON:', err);
|
|
230
214
|
}
|
|
231
|
-
}
|
|
215
|
+
}
|
|
216
|
+
}
|
|
232
217
|
}
|
|
233
|
-
|
|
218
|
+
|
|
219
|
+
// Handle any remaining data in the buffer after the stream ends
|
|
220
|
+
if (buffer.trim()) {
|
|
221
|
+
try {
|
|
222
|
+
yield JSON.parse(buffer);
|
|
223
|
+
} catch (err) {
|
|
224
|
+
console.error('Failed to parse final JSON:', err);
|
|
225
|
+
}
|
|
226
|
+
}
|
|
227
|
+
}
|
|
228
|
+
|
|
229
|
+
var finished = false;
|
|
230
|
+
|
|
231
|
+
// Async IIFE to start bulk indexing
|
|
232
|
+
(async function () {
|
|
233
|
+
await client.helpers.bulk({
|
|
234
|
+
concurrency: parallelCalls,
|
|
235
|
+
flushBytes: flushBytes,
|
|
236
|
+
flushInterval: 1000,
|
|
237
|
+
refreshOnCompletion: true,
|
|
238
|
+
datasource: ndjsonStreamIterator(stream$$1),
|
|
239
|
+
onDocument: function onDocument(doc) {
|
|
240
|
+
return {
|
|
241
|
+
index: { _index: targetIndexName },
|
|
242
|
+
};
|
|
243
|
+
},
|
|
244
|
+
});
|
|
245
|
+
|
|
246
|
+
queueEmitter.emit('finish');
|
|
247
|
+
})();
|
|
234
248
|
|
|
235
249
|
return {
|
|
236
250
|
add: function (doc) {
|
|
@@ -238,37 +252,22 @@ function indexQueueFactory(ref) {
|
|
|
238
252
|
throw new Error('Unexpected doc added after indexer should finish.');
|
|
239
253
|
}
|
|
240
254
|
|
|
241
|
-
|
|
242
|
-
|
|
243
|
-
|
|
244
|
-
|
|
245
|
-
|
|
246
|
-
|
|
247
|
-
if (queue.length === 0) {
|
|
248
|
-
queueEmitter.emit('resume');
|
|
249
|
-
}
|
|
250
|
-
|
|
251
|
-
if (buffer.length >= bufferSize * 2) {
|
|
252
|
-
ingest(buffer);
|
|
253
|
-
buffer = [];
|
|
255
|
+
var canContinue = stream$$1.push(((JSON.stringify(doc)) + "\n"));
|
|
256
|
+
if (!canContinue) {
|
|
257
|
+
queueEmitter.emit('pause');
|
|
258
|
+
stream$$1.once('drain', function () {
|
|
259
|
+
queueEmitter.emit('resume');
|
|
260
|
+
});
|
|
254
261
|
}
|
|
255
262
|
},
|
|
256
263
|
finish: function () {
|
|
257
264
|
finished = true;
|
|
258
|
-
|
|
259
|
-
if (buffer.length > 0) {
|
|
260
|
-
ingest(buffer);
|
|
261
|
-
buffer = [];
|
|
262
|
-
} else if (queue.length === 0 && ingesting === 0) {
|
|
263
|
-
queueEmitter.emit('finish');
|
|
264
|
-
}
|
|
265
|
+
stream$$1.push(null);
|
|
265
266
|
},
|
|
266
267
|
queueEmitter: queueEmitter,
|
|
267
268
|
};
|
|
268
269
|
}
|
|
269
270
|
|
|
270
|
-
var MAX_QUEUE_SIZE$1 = 15;
|
|
271
|
-
|
|
272
271
|
// create a new progress bar instance and use shades_classic theme
|
|
273
272
|
var progressBar = new cliProgress.SingleBar({}, cliProgress.Presets.shades_classic);
|
|
274
273
|
|
|
@@ -278,32 +277,33 @@ function indexReaderFactory(
|
|
|
278
277
|
transform,
|
|
279
278
|
client,
|
|
280
279
|
query,
|
|
281
|
-
|
|
280
|
+
searchSize,
|
|
282
281
|
populatedFields
|
|
283
282
|
) {
|
|
284
|
-
if (
|
|
283
|
+
if ( searchSize === void 0 ) searchSize = DEFAULT_SEARCH_SIZE;
|
|
285
284
|
if ( populatedFields === void 0 ) populatedFields = false;
|
|
286
285
|
|
|
287
286
|
return async function indexReader() {
|
|
288
|
-
var responseQueue = [];
|
|
289
287
|
var docsNum = 0;
|
|
288
|
+
var scrollId;
|
|
289
|
+
var finished = false;
|
|
290
|
+
var readActive = false;
|
|
291
|
+
var backPressurePause = false;
|
|
290
292
|
|
|
291
293
|
async function fetchPopulatedFields() {
|
|
292
294
|
try {
|
|
293
|
-
|
|
294
|
-
|
|
295
|
-
|
|
296
|
-
|
|
297
|
-
|
|
298
|
-
|
|
299
|
-
|
|
300
|
-
},
|
|
295
|
+
// Get all populated fields from the index
|
|
296
|
+
var response = await client.fieldCaps(
|
|
297
|
+
{
|
|
298
|
+
index: sourceIndexName,
|
|
299
|
+
fields: '*',
|
|
300
|
+
include_empty_fields: false,
|
|
301
|
+
filters: '-metadata',
|
|
301
302
|
},
|
|
302
|
-
|
|
303
|
+
{ maxRetries: 0 }
|
|
304
|
+
);
|
|
303
305
|
|
|
304
|
-
|
|
305
|
-
// to a list of unique field names used across all docs.
|
|
306
|
-
return new Set(response.hits.hits.map(function (d) { return Object.keys(d._source); }).flat(1));
|
|
306
|
+
return Object.keys(response.fields);
|
|
307
307
|
} catch (e) {
|
|
308
308
|
console.log('error', e);
|
|
309
309
|
}
|
|
@@ -312,7 +312,7 @@ function indexReaderFactory(
|
|
|
312
312
|
function search(fields) {
|
|
313
313
|
return client.search(Object.assign({}, {index: sourceIndexName,
|
|
314
314
|
scroll: '600s',
|
|
315
|
-
size:
|
|
315
|
+
size: searchSize,
|
|
316
316
|
query: query},
|
|
317
317
|
(fields ? { _source: fields } : {})));
|
|
318
318
|
}
|
|
@@ -329,21 +329,14 @@ function indexReaderFactory(
|
|
|
329
329
|
// identify populated fields
|
|
330
330
|
if (populatedFields) {
|
|
331
331
|
fieldsWithData = await fetchPopulatedFields();
|
|
332
|
-
console.log('fieldsWithData', fieldsWithData);
|
|
333
332
|
}
|
|
334
333
|
|
|
335
|
-
|
|
336
|
-
// our first response into the queue to be processed
|
|
337
|
-
var se = await search(fieldsWithData);
|
|
338
|
-
responseQueue.push(se);
|
|
339
|
-
progressBar.start(se.hits.total.value, 0);
|
|
340
|
-
console.log('se', se.hits.hits[0]);
|
|
334
|
+
await fetchNextResponse();
|
|
341
335
|
|
|
342
336
|
function processHit(hit) {
|
|
343
337
|
docsNum += 1;
|
|
344
338
|
try {
|
|
345
339
|
var doc = typeof transform === 'function' ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle
|
|
346
|
-
// console.log('doc', doc);
|
|
347
340
|
|
|
348
341
|
// if doc is undefined we'll skip indexing it
|
|
349
342
|
if (typeof doc === 'undefined') {
|
|
@@ -363,68 +356,116 @@ function indexReaderFactory(
|
|
|
363
356
|
}
|
|
364
357
|
}
|
|
365
358
|
|
|
366
|
-
|
|
367
|
-
|
|
368
|
-
var readActive = false;
|
|
369
|
-
|
|
370
|
-
async function processResponseQueue() {
|
|
371
|
-
while (responseQueue.length) {
|
|
372
|
-
readActive = true;
|
|
373
|
-
var response = responseQueue.shift();
|
|
359
|
+
async function fetchNextResponse() {
|
|
360
|
+
readActive = true;
|
|
374
361
|
|
|
375
|
-
|
|
376
|
-
response.hits.hits.forEach(processHit);
|
|
362
|
+
var sc = scrollId ? await scroll(scrollId) : await search(fieldsWithData);
|
|
377
363
|
|
|
378
|
-
|
|
364
|
+
if (!scrollId) {
|
|
365
|
+
progressBar.start(sc.hits.total.value, 0);
|
|
366
|
+
}
|
|
379
367
|
|
|
380
|
-
|
|
381
|
-
|
|
382
|
-
indexer.finish();
|
|
383
|
-
break;
|
|
384
|
-
}
|
|
368
|
+
scrollId = sc._scroll_id;
|
|
369
|
+
readActive = false;
|
|
385
370
|
|
|
386
|
-
|
|
387
|
-
// get the next response if there are more docs to fetch
|
|
388
|
-
var sc = await scroll(response._scroll_id); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
|
|
389
|
-
scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
|
|
390
|
-
responseQueue.push(sc);
|
|
391
|
-
} else {
|
|
392
|
-
readActive = false;
|
|
393
|
-
}
|
|
394
|
-
}
|
|
371
|
+
processResponse(sc);
|
|
395
372
|
}
|
|
396
373
|
|
|
397
|
-
|
|
398
|
-
|
|
374
|
+
async function processResponse(response) {
|
|
375
|
+
// collect the docs from this response
|
|
376
|
+
response.hits.hits.forEach(processHit);
|
|
377
|
+
|
|
378
|
+
progressBar.update(docsNum);
|
|
379
|
+
|
|
380
|
+
// check to see if we have collected all of the docs
|
|
381
|
+
if (response.hits.total.value === docsNum) {
|
|
382
|
+
indexer.finish();
|
|
383
|
+
return;
|
|
384
|
+
}
|
|
399
385
|
|
|
400
|
-
if (!
|
|
401
|
-
|
|
402
|
-
var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
|
|
403
|
-
scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
|
|
404
|
-
responseQueue.push(sc);
|
|
405
|
-
processResponseQueue();
|
|
386
|
+
if (!backPressurePause) {
|
|
387
|
+
await fetchNextResponse();
|
|
406
388
|
}
|
|
389
|
+
}
|
|
390
|
+
|
|
391
|
+
indexer.queueEmitter.on('pause', async function () {
|
|
392
|
+
backPressurePause = true;
|
|
407
393
|
});
|
|
408
394
|
|
|
409
395
|
indexer.queueEmitter.on('resume', async function () {
|
|
410
|
-
|
|
396
|
+
backPressurePause = false;
|
|
411
397
|
|
|
412
|
-
if (readActive) {
|
|
398
|
+
if (readActive || finished) {
|
|
413
399
|
return;
|
|
414
400
|
}
|
|
415
401
|
|
|
416
|
-
|
|
417
|
-
var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
|
|
418
|
-
scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
|
|
419
|
-
responseQueue.push(sc);
|
|
420
|
-
processResponseQueue();
|
|
402
|
+
await fetchNextResponse();
|
|
421
403
|
});
|
|
422
404
|
|
|
423
405
|
indexer.queueEmitter.on('finish', function () {
|
|
406
|
+
finished = true;
|
|
424
407
|
progressBar.stop();
|
|
425
408
|
});
|
|
409
|
+
};
|
|
410
|
+
}
|
|
411
|
+
|
|
412
|
+
function streamReaderFactory(indexer, stream$$1, transform, splitRegex, verbose) {
|
|
413
|
+
function startIndex() {
|
|
414
|
+
var finished = false;
|
|
415
|
+
|
|
416
|
+
var s = stream$$1.pipe(split(splitRegex)).pipe(
|
|
417
|
+
es
|
|
418
|
+
.mapSync(function (line) {
|
|
419
|
+
try {
|
|
420
|
+
// skip empty lines
|
|
421
|
+
if (line === '') {
|
|
422
|
+
return;
|
|
423
|
+
}
|
|
424
|
+
|
|
425
|
+
var doc =
|
|
426
|
+
typeof transform === 'function' ? JSON.stringify(transform(JSON.parse(line))) : line;
|
|
426
427
|
|
|
427
|
-
|
|
428
|
+
// if doc is undefined we'll skip indexing it
|
|
429
|
+
if (typeof doc === 'undefined') {
|
|
430
|
+
s.resume();
|
|
431
|
+
return;
|
|
432
|
+
}
|
|
433
|
+
|
|
434
|
+
// the transform callback may return an array of docs so we can emit
|
|
435
|
+
// multiple docs from a single line
|
|
436
|
+
if (Array.isArray(doc)) {
|
|
437
|
+
doc.forEach(function (d) { return indexer.add(d); });
|
|
438
|
+
return;
|
|
439
|
+
}
|
|
440
|
+
|
|
441
|
+
indexer.add(doc);
|
|
442
|
+
} catch (e) {
|
|
443
|
+
console.log('error', e);
|
|
444
|
+
}
|
|
445
|
+
})
|
|
446
|
+
.on('error', function (err) {
|
|
447
|
+
console.log('Error while reading stream.', err);
|
|
448
|
+
})
|
|
449
|
+
.on('end', function () {
|
|
450
|
+
if (verbose) { console.log('Read entire stream.'); }
|
|
451
|
+
indexer.finish();
|
|
452
|
+
finished = true;
|
|
453
|
+
})
|
|
454
|
+
);
|
|
455
|
+
|
|
456
|
+
indexer.queueEmitter.on('pause', function () {
|
|
457
|
+
if (finished) { return; }
|
|
458
|
+
s.pause();
|
|
459
|
+
});
|
|
460
|
+
|
|
461
|
+
indexer.queueEmitter.on('resume', function () {
|
|
462
|
+
if (finished) { return; }
|
|
463
|
+
s.resume();
|
|
464
|
+
});
|
|
465
|
+
}
|
|
466
|
+
|
|
467
|
+
return function () {
|
|
468
|
+
startIndex();
|
|
428
469
|
};
|
|
429
470
|
}
|
|
430
471
|
|
|
@@ -433,6 +474,8 @@ async function transformer(ref) {
|
|
|
433
474
|
var sourceClientConfig = ref.sourceClientConfig;
|
|
434
475
|
var targetClientConfig = ref.targetClientConfig;
|
|
435
476
|
var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
|
|
477
|
+
var searchSize = ref.searchSize; if ( searchSize === void 0 ) searchSize = DEFAULT_SEARCH_SIZE;
|
|
478
|
+
var stream$$1 = ref.stream;
|
|
436
479
|
var fileName = ref.fileName;
|
|
437
480
|
var splitRegex = ref.splitRegex; if ( splitRegex === void 0 ) splitRegex = /\n/;
|
|
438
481
|
var sourceIndexName = ref.sourceIndexName;
|
|
@@ -468,6 +511,7 @@ async function transformer(ref) {
|
|
|
468
511
|
mappingsOverride: mappingsOverride,
|
|
469
512
|
indexMappingTotalFieldsLimit: indexMappingTotalFieldsLimit,
|
|
470
513
|
verbose: verbose,
|
|
514
|
+
deleteIndex: deleteIndex,
|
|
471
515
|
});
|
|
472
516
|
var indexer = indexQueueFactory({
|
|
473
517
|
targetClient: targetClient,
|
|
@@ -482,8 +526,12 @@ async function transformer(ref) {
|
|
|
482
526
|
throw Error('Only either one of fileName or sourceIndexName can be specified.');
|
|
483
527
|
}
|
|
484
528
|
|
|
485
|
-
if (
|
|
486
|
-
|
|
529
|
+
if (
|
|
530
|
+
(typeof fileName !== 'undefined' && typeof sourceIndexName !== 'undefined') ||
|
|
531
|
+
(typeof fileName !== 'undefined' && typeof stream$$1 !== 'undefined') ||
|
|
532
|
+
(typeof sourceIndexName !== 'undefined' && typeof stream$$1 !== 'undefined')
|
|
533
|
+
) {
|
|
534
|
+
throw Error('Only one of fileName, sourceIndexName, or stream can be specified.');
|
|
487
535
|
}
|
|
488
536
|
|
|
489
537
|
if (typeof fileName !== 'undefined') {
|
|
@@ -497,11 +545,15 @@ async function transformer(ref) {
|
|
|
497
545
|
transform,
|
|
498
546
|
sourceClient,
|
|
499
547
|
query,
|
|
500
|
-
|
|
548
|
+
searchSize,
|
|
501
549
|
populatedFields
|
|
502
550
|
);
|
|
503
551
|
}
|
|
504
552
|
|
|
553
|
+
if (typeof stream$$1 !== 'undefined') {
|
|
554
|
+
return streamReaderFactory(indexer, stream$$1, transform, splitRegex, verbose);
|
|
555
|
+
}
|
|
556
|
+
|
|
505
557
|
return null;
|
|
506
558
|
}
|
|
507
559
|
|
|
@@ -1,10 +1,20 @@
|
|
|
1
1
|
import fs from 'fs';
|
|
2
2
|
import es from 'event-stream';
|
|
3
3
|
import glob from 'glob';
|
|
4
|
+
import split from 'split2';
|
|
5
|
+
import { Readable } from 'stream';
|
|
4
6
|
import cliProgress from 'cli-progress';
|
|
5
7
|
import elasticsearch from '@elastic/elasticsearch';
|
|
6
8
|
|
|
7
|
-
|
|
9
|
+
// In earlier versions this was used to set the number of docs to index in a
|
|
10
|
+
// single bulk request. Since we switched to use the helpers.bulk() method from
|
|
11
|
+
// the ES client, this now translates to the `flushBytes` option of the helper.
|
|
12
|
+
// However, for kind of a backwards compability with the old values, this uses
|
|
13
|
+
// KBytes instead of Bytes. It will be multiplied by 1024 in the index queue.
|
|
14
|
+
var DEFAULT_BUFFER_SIZE = 5120;
|
|
15
|
+
|
|
16
|
+
// The default number of docs to fetch in a single search request when reindexing.
|
|
17
|
+
var DEFAULT_SEARCH_SIZE = 1000;
|
|
8
18
|
|
|
9
19
|
function createMappingFactory(ref) {
|
|
10
20
|
var sourceClient = ref.sourceClient;
|
|
@@ -15,6 +25,7 @@ function createMappingFactory(ref) {
|
|
|
15
25
|
var mappingsOverride = ref.mappingsOverride;
|
|
16
26
|
var indexMappingTotalFieldsLimit = ref.indexMappingTotalFieldsLimit;
|
|
17
27
|
var verbose = ref.verbose;
|
|
28
|
+
var deleteIndex = ref.deleteIndex;
|
|
18
29
|
|
|
19
30
|
return async function () {
|
|
20
31
|
var targetMappings = mappingsOverride ? undefined : mappings;
|
|
@@ -24,7 +35,14 @@ function createMappingFactory(ref) {
|
|
|
24
35
|
var mapping = await sourceClient.indices.getMapping({
|
|
25
36
|
index: sourceIndexName,
|
|
26
37
|
});
|
|
27
|
-
|
|
38
|
+
if (mapping[sourceIndexName]) {
|
|
39
|
+
targetMappings = mapping[sourceIndexName].mappings;
|
|
40
|
+
} else {
|
|
41
|
+
var allMappings = Object.values(mapping);
|
|
42
|
+
if (allMappings.length > 0) {
|
|
43
|
+
targetMappings = Object.values(mapping)[0].mappings;
|
|
44
|
+
}
|
|
45
|
+
}
|
|
28
46
|
} catch (err) {
|
|
29
47
|
console.log('Error reading source mapping', err);
|
|
30
48
|
return;
|
|
@@ -39,18 +57,28 @@ function createMappingFactory(ref) {
|
|
|
39
57
|
}
|
|
40
58
|
|
|
41
59
|
try {
|
|
42
|
-
var
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
60
|
+
var indexExists = await targetClient.indices.exists({ index: targetIndexName });
|
|
61
|
+
|
|
62
|
+
if (indexExists === true && deleteIndex === true) {
|
|
63
|
+
await targetClient.indices.delete({ index: targetIndexName });
|
|
64
|
+
}
|
|
65
|
+
|
|
66
|
+
if (indexExists === false || deleteIndex === true) {
|
|
67
|
+
var resp = await targetClient.indices.create({
|
|
68
|
+
index: targetIndexName,
|
|
69
|
+
body: Object.assign({}, {mappings: targetMappings},
|
|
70
|
+
(indexMappingTotalFieldsLimit !== undefined
|
|
71
|
+
? {
|
|
72
|
+
settings: {
|
|
73
|
+
'index.mapping.total_fields.limit': indexMappingTotalFieldsLimit,
|
|
74
|
+
'index.number_of_shards': 1,
|
|
75
|
+
'index.number_of_replicas': 0,
|
|
76
|
+
},
|
|
77
|
+
}
|
|
78
|
+
: {})),
|
|
79
|
+
});
|
|
80
|
+
if (verbose) { console.log('Created target mapping', resp); }
|
|
81
|
+
}
|
|
54
82
|
} catch (err) {
|
|
55
83
|
console.log('Error creating target mapping', err);
|
|
56
84
|
}
|
|
@@ -58,17 +86,14 @@ function createMappingFactory(ref) {
|
|
|
58
86
|
};
|
|
59
87
|
}
|
|
60
88
|
|
|
61
|
-
var MAX_QUEUE_SIZE = 15;
|
|
62
|
-
|
|
63
89
|
function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
|
|
64
90
|
function startIndex(files) {
|
|
65
|
-
var ingestQueueSize = 0;
|
|
66
91
|
var finished = false;
|
|
67
92
|
|
|
68
93
|
var file = files.shift();
|
|
69
94
|
var s = fs
|
|
70
95
|
.createReadStream(file)
|
|
71
|
-
.pipe(
|
|
96
|
+
.pipe(split(splitRegex))
|
|
72
97
|
.pipe(
|
|
73
98
|
es
|
|
74
99
|
.mapSync(function (line) {
|
|
@@ -116,20 +141,13 @@ function fileReaderFactory(indexer, fileName, transform, splitRegex, verbose) {
|
|
|
116
141
|
})
|
|
117
142
|
);
|
|
118
143
|
|
|
119
|
-
indexer.queueEmitter.on('
|
|
144
|
+
indexer.queueEmitter.on('pause', function () {
|
|
120
145
|
if (finished) { return; }
|
|
121
|
-
|
|
122
|
-
|
|
123
|
-
if (ingestQueueSize < MAX_QUEUE_SIZE) {
|
|
124
|
-
s.resume();
|
|
125
|
-
} else {
|
|
126
|
-
s.pause();
|
|
127
|
-
}
|
|
146
|
+
s.pause();
|
|
128
147
|
});
|
|
129
148
|
|
|
130
149
|
indexer.queueEmitter.on('resume', function () {
|
|
131
150
|
if (finished) { return; }
|
|
132
|
-
ingestQueueSize = 0;
|
|
133
151
|
s.resume();
|
|
134
152
|
});
|
|
135
153
|
}
|
|
@@ -145,7 +163,7 @@ var EventEmitter = require('events');
|
|
|
145
163
|
|
|
146
164
|
var queueEmitter = new EventEmitter();
|
|
147
165
|
|
|
148
|
-
var parallelCalls =
|
|
166
|
+
var parallelCalls = 5;
|
|
149
167
|
|
|
150
168
|
// a simple helper queue to bulk index documents
|
|
151
169
|
function indexQueueFactory(ref) {
|
|
@@ -155,78 +173,74 @@ function indexQueueFactory(ref) {
|
|
|
155
173
|
var skipHeader = ref.skipHeader; if ( skipHeader === void 0 ) skipHeader = false;
|
|
156
174
|
var verbose = ref.verbose; if ( verbose === void 0 ) verbose = true;
|
|
157
175
|
|
|
158
|
-
var
|
|
159
|
-
var
|
|
160
|
-
var ingesting = 0;
|
|
161
|
-
var ingestTimes = [];
|
|
162
|
-
var finished = false;
|
|
176
|
+
var flushBytes = bufferSize * 1024; // Convert KB to Bytes
|
|
177
|
+
var highWaterMark = flushBytes * parallelCalls;
|
|
163
178
|
|
|
164
|
-
|
|
165
|
-
|
|
166
|
-
|
|
167
|
-
|
|
168
|
-
|
|
179
|
+
// Create a Readable stream
|
|
180
|
+
var stream = new Readable({
|
|
181
|
+
read: function read() {}, // Implement read but we manage pushing manually
|
|
182
|
+
highWaterMark: highWaterMark, // Buffer size for backpressure management
|
|
183
|
+
});
|
|
169
184
|
|
|
170
|
-
|
|
185
|
+
async function* ndjsonStreamIterator(readableStream) {
|
|
186
|
+
var buffer = ''; // To hold the incomplete data
|
|
187
|
+
var skippedHeader = false;
|
|
171
188
|
|
|
172
|
-
|
|
173
|
-
|
|
189
|
+
// Iterate over the stream using async iteration
|
|
190
|
+
for await (var chunk of readableStream) {
|
|
191
|
+
buffer += chunk.toString(); // Accumulate the chunk data in the buffer
|
|
174
192
|
|
|
175
|
-
|
|
176
|
-
|
|
177
|
-
queueEmitter.emit('resume');
|
|
178
|
-
}
|
|
193
|
+
// Split the buffer into lines (NDJSON items)
|
|
194
|
+
var lines = buffer.split('\n');
|
|
179
195
|
|
|
180
|
-
|
|
181
|
-
|
|
182
|
-
if (verbose)
|
|
183
|
-
{ console.log(("bulk ingest docs: " + (docs.length / 2) + ", queue length: " + (queue.length))); }
|
|
184
|
-
|
|
185
|
-
var start = Date.now();
|
|
186
|
-
client
|
|
187
|
-
.bulk({ body: docs })
|
|
188
|
-
.then(function () {
|
|
189
|
-
var end = Date.now();
|
|
190
|
-
var delta = end - start;
|
|
191
|
-
ingestTimes.push(delta);
|
|
192
|
-
ingesting -= 1;
|
|
193
|
-
|
|
194
|
-
var ingestTimesMovingAverage =
|
|
195
|
-
ingestTimes.length > 0
|
|
196
|
-
? ingestTimes.reduce(function (p, c) { return p + c; }, 0) / ingestTimes.length
|
|
197
|
-
: 0;
|
|
198
|
-
var ingestTimesMovingAverageSeconds = Math.floor(ingestTimesMovingAverage / 1000);
|
|
199
|
-
|
|
200
|
-
if (
|
|
201
|
-
ingestTimes.length > 0 &&
|
|
202
|
-
ingestTimesMovingAverageSeconds < 30 &&
|
|
203
|
-
parallelCalls < 10
|
|
204
|
-
) {
|
|
205
|
-
parallelCalls += 1;
|
|
206
|
-
} else if (
|
|
207
|
-
ingestTimes.length > 0 &&
|
|
208
|
-
ingestTimesMovingAverageSeconds >= 30 &&
|
|
209
|
-
parallelCalls > 1
|
|
210
|
-
) {
|
|
211
|
-
parallelCalls -= 1;
|
|
212
|
-
}
|
|
196
|
+
// The last line might be incomplete, so hold it back in the buffer
|
|
197
|
+
buffer = lines.pop();
|
|
213
198
|
|
|
214
|
-
|
|
215
|
-
|
|
216
|
-
|
|
217
|
-
|
|
218
|
-
|
|
219
|
-
|
|
220
|
-
|
|
221
|
-
|
|
222
|
-
|
|
223
|
-
|
|
224
|
-
|
|
225
|
-
ingest();
|
|
199
|
+
// Yield each complete JSON object
|
|
200
|
+
for (var line of lines) {
|
|
201
|
+
if (line.trim()) {
|
|
202
|
+
try {
|
|
203
|
+
if (!skipHeader || (skipHeader && !skippedHeader)) {
|
|
204
|
+
yield JSON.parse(line); // Parse and yield the JSON object
|
|
205
|
+
skippedHeader = true;
|
|
206
|
+
}
|
|
207
|
+
} catch (err) {
|
|
208
|
+
// Handle JSON parse errors if necessary
|
|
209
|
+
console.error('Failed to parse JSON:', err);
|
|
226
210
|
}
|
|
227
|
-
}
|
|
211
|
+
}
|
|
212
|
+
}
|
|
228
213
|
}
|
|
229
|
-
|
|
214
|
+
|
|
215
|
+
// Handle any remaining data in the buffer after the stream ends
|
|
216
|
+
if (buffer.trim()) {
|
|
217
|
+
try {
|
|
218
|
+
yield JSON.parse(buffer);
|
|
219
|
+
} catch (err) {
|
|
220
|
+
console.error('Failed to parse final JSON:', err);
|
|
221
|
+
}
|
|
222
|
+
}
|
|
223
|
+
}
|
|
224
|
+
|
|
225
|
+
var finished = false;
|
|
226
|
+
|
|
227
|
+
// Async IIFE to start bulk indexing
|
|
228
|
+
(async function () {
|
|
229
|
+
await client.helpers.bulk({
|
|
230
|
+
concurrency: parallelCalls,
|
|
231
|
+
flushBytes: flushBytes,
|
|
232
|
+
flushInterval: 1000,
|
|
233
|
+
refreshOnCompletion: true,
|
|
234
|
+
datasource: ndjsonStreamIterator(stream),
|
|
235
|
+
onDocument: function onDocument(doc) {
|
|
236
|
+
return {
|
|
237
|
+
index: { _index: targetIndexName },
|
|
238
|
+
};
|
|
239
|
+
},
|
|
240
|
+
});
|
|
241
|
+
|
|
242
|
+
queueEmitter.emit('finish');
|
|
243
|
+
})();
|
|
230
244
|
|
|
231
245
|
return {
|
|
232
246
|
add: function (doc) {
|
|
@@ -234,37 +248,22 @@ function indexQueueFactory(ref) {
|
|
|
234
248
|
throw new Error('Unexpected doc added after indexer should finish.');
|
|
235
249
|
}
|
|
236
250
|
|
|
237
|
-
|
|
238
|
-
|
|
239
|
-
|
|
240
|
-
|
|
241
|
-
|
|
242
|
-
|
|
243
|
-
if (queue.length === 0) {
|
|
244
|
-
queueEmitter.emit('resume');
|
|
245
|
-
}
|
|
246
|
-
|
|
247
|
-
if (buffer.length >= bufferSize * 2) {
|
|
248
|
-
ingest(buffer);
|
|
249
|
-
buffer = [];
|
|
251
|
+
var canContinue = stream.push(((JSON.stringify(doc)) + "\n"));
|
|
252
|
+
if (!canContinue) {
|
|
253
|
+
queueEmitter.emit('pause');
|
|
254
|
+
stream.once('drain', function () {
|
|
255
|
+
queueEmitter.emit('resume');
|
|
256
|
+
});
|
|
250
257
|
}
|
|
251
258
|
},
|
|
252
259
|
finish: function () {
|
|
253
260
|
finished = true;
|
|
254
|
-
|
|
255
|
-
if (buffer.length > 0) {
|
|
256
|
-
ingest(buffer);
|
|
257
|
-
buffer = [];
|
|
258
|
-
} else if (queue.length === 0 && ingesting === 0) {
|
|
259
|
-
queueEmitter.emit('finish');
|
|
260
|
-
}
|
|
261
|
+
stream.push(null);
|
|
261
262
|
},
|
|
262
263
|
queueEmitter: queueEmitter,
|
|
263
264
|
};
|
|
264
265
|
}
|
|
265
266
|
|
|
266
|
-
var MAX_QUEUE_SIZE$1 = 15;
|
|
267
|
-
|
|
268
267
|
// create a new progress bar instance and use shades_classic theme
|
|
269
268
|
var progressBar = new cliProgress.SingleBar({}, cliProgress.Presets.shades_classic);
|
|
270
269
|
|
|
@@ -274,32 +273,33 @@ function indexReaderFactory(
|
|
|
274
273
|
transform,
|
|
275
274
|
client,
|
|
276
275
|
query,
|
|
277
|
-
|
|
276
|
+
searchSize,
|
|
278
277
|
populatedFields
|
|
279
278
|
) {
|
|
280
|
-
if (
|
|
279
|
+
if ( searchSize === void 0 ) searchSize = DEFAULT_SEARCH_SIZE;
|
|
281
280
|
if ( populatedFields === void 0 ) populatedFields = false;
|
|
282
281
|
|
|
283
282
|
return async function indexReader() {
|
|
284
|
-
var responseQueue = [];
|
|
285
283
|
var docsNum = 0;
|
|
284
|
+
var scrollId;
|
|
285
|
+
var finished = false;
|
|
286
|
+
var readActive = false;
|
|
287
|
+
var backPressurePause = false;
|
|
286
288
|
|
|
287
289
|
async function fetchPopulatedFields() {
|
|
288
290
|
try {
|
|
289
|
-
|
|
290
|
-
|
|
291
|
-
|
|
292
|
-
|
|
293
|
-
|
|
294
|
-
|
|
295
|
-
|
|
296
|
-
},
|
|
291
|
+
// Get all populated fields from the index
|
|
292
|
+
var response = await client.fieldCaps(
|
|
293
|
+
{
|
|
294
|
+
index: sourceIndexName,
|
|
295
|
+
fields: '*',
|
|
296
|
+
include_empty_fields: false,
|
|
297
|
+
filters: '-metadata',
|
|
297
298
|
},
|
|
298
|
-
|
|
299
|
+
{ maxRetries: 0 }
|
|
300
|
+
);
|
|
299
301
|
|
|
300
|
-
|
|
301
|
-
// to a list of unique field names used across all docs.
|
|
302
|
-
return new Set(response.hits.hits.map(function (d) { return Object.keys(d._source); }).flat(1));
|
|
302
|
+
return Object.keys(response.fields);
|
|
303
303
|
} catch (e) {
|
|
304
304
|
console.log('error', e);
|
|
305
305
|
}
|
|
@@ -308,7 +308,7 @@ function indexReaderFactory(
|
|
|
308
308
|
function search(fields) {
|
|
309
309
|
return client.search(Object.assign({}, {index: sourceIndexName,
|
|
310
310
|
scroll: '600s',
|
|
311
|
-
size:
|
|
311
|
+
size: searchSize,
|
|
312
312
|
query: query},
|
|
313
313
|
(fields ? { _source: fields } : {})));
|
|
314
314
|
}
|
|
@@ -325,21 +325,14 @@ function indexReaderFactory(
|
|
|
325
325
|
// identify populated fields
|
|
326
326
|
if (populatedFields) {
|
|
327
327
|
fieldsWithData = await fetchPopulatedFields();
|
|
328
|
-
console.log('fieldsWithData', fieldsWithData);
|
|
329
328
|
}
|
|
330
329
|
|
|
331
|
-
|
|
332
|
-
// our first response into the queue to be processed
|
|
333
|
-
var se = await search(fieldsWithData);
|
|
334
|
-
responseQueue.push(se);
|
|
335
|
-
progressBar.start(se.hits.total.value, 0);
|
|
336
|
-
console.log('se', se.hits.hits[0]);
|
|
330
|
+
await fetchNextResponse();
|
|
337
331
|
|
|
338
332
|
function processHit(hit) {
|
|
339
333
|
docsNum += 1;
|
|
340
334
|
try {
|
|
341
335
|
var doc = typeof transform === 'function' ? transform(hit._source) : hit._source; // eslint-disable-line no-underscore-dangle
|
|
342
|
-
// console.log('doc', doc);
|
|
343
336
|
|
|
344
337
|
// if doc is undefined we'll skip indexing it
|
|
345
338
|
if (typeof doc === 'undefined') {
|
|
@@ -359,68 +352,116 @@ function indexReaderFactory(
|
|
|
359
352
|
}
|
|
360
353
|
}
|
|
361
354
|
|
|
362
|
-
|
|
363
|
-
|
|
364
|
-
var readActive = false;
|
|
365
|
-
|
|
366
|
-
async function processResponseQueue() {
|
|
367
|
-
while (responseQueue.length) {
|
|
368
|
-
readActive = true;
|
|
369
|
-
var response = responseQueue.shift();
|
|
355
|
+
async function fetchNextResponse() {
|
|
356
|
+
readActive = true;
|
|
370
357
|
|
|
371
|
-
|
|
372
|
-
response.hits.hits.forEach(processHit);
|
|
358
|
+
var sc = scrollId ? await scroll(scrollId) : await search(fieldsWithData);
|
|
373
359
|
|
|
374
|
-
|
|
360
|
+
if (!scrollId) {
|
|
361
|
+
progressBar.start(sc.hits.total.value, 0);
|
|
362
|
+
}
|
|
375
363
|
|
|
376
|
-
|
|
377
|
-
|
|
378
|
-
indexer.finish();
|
|
379
|
-
break;
|
|
380
|
-
}
|
|
364
|
+
scrollId = sc._scroll_id;
|
|
365
|
+
readActive = false;
|
|
381
366
|
|
|
382
|
-
|
|
383
|
-
// get the next response if there are more docs to fetch
|
|
384
|
-
var sc = await scroll(response._scroll_id); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
|
|
385
|
-
scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
|
|
386
|
-
responseQueue.push(sc);
|
|
387
|
-
} else {
|
|
388
|
-
readActive = false;
|
|
389
|
-
}
|
|
390
|
-
}
|
|
367
|
+
processResponse(sc);
|
|
391
368
|
}
|
|
392
369
|
|
|
393
|
-
|
|
394
|
-
|
|
370
|
+
async function processResponse(response) {
|
|
371
|
+
// collect the docs from this response
|
|
372
|
+
response.hits.hits.forEach(processHit);
|
|
373
|
+
|
|
374
|
+
progressBar.update(docsNum);
|
|
375
|
+
|
|
376
|
+
// check to see if we have collected all of the docs
|
|
377
|
+
if (response.hits.total.value === docsNum) {
|
|
378
|
+
indexer.finish();
|
|
379
|
+
return;
|
|
380
|
+
}
|
|
395
381
|
|
|
396
|
-
if (!
|
|
397
|
-
|
|
398
|
-
var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
|
|
399
|
-
scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
|
|
400
|
-
responseQueue.push(sc);
|
|
401
|
-
processResponseQueue();
|
|
382
|
+
if (!backPressurePause) {
|
|
383
|
+
await fetchNextResponse();
|
|
402
384
|
}
|
|
385
|
+
}
|
|
386
|
+
|
|
387
|
+
indexer.queueEmitter.on('pause', async function () {
|
|
388
|
+
backPressurePause = true;
|
|
403
389
|
});
|
|
404
390
|
|
|
405
391
|
indexer.queueEmitter.on('resume', async function () {
|
|
406
|
-
|
|
392
|
+
backPressurePause = false;
|
|
407
393
|
|
|
408
|
-
if (readActive) {
|
|
394
|
+
if (readActive || finished) {
|
|
409
395
|
return;
|
|
410
396
|
}
|
|
411
397
|
|
|
412
|
-
|
|
413
|
-
var sc = await scroll(scrollId); // eslint-disable-line no-await-in-loop,no-underscore-dangle,max-len
|
|
414
|
-
scrollId = sc._scroll_id; // eslint-disable-line no-underscore-dangle
|
|
415
|
-
responseQueue.push(sc);
|
|
416
|
-
processResponseQueue();
|
|
398
|
+
await fetchNextResponse();
|
|
417
399
|
});
|
|
418
400
|
|
|
419
401
|
indexer.queueEmitter.on('finish', function () {
|
|
402
|
+
finished = true;
|
|
420
403
|
progressBar.stop();
|
|
421
404
|
});
|
|
405
|
+
};
|
|
406
|
+
}
|
|
407
|
+
|
|
408
|
+
function streamReaderFactory(indexer, stream, transform, splitRegex, verbose) {
|
|
409
|
+
function startIndex() {
|
|
410
|
+
var finished = false;
|
|
411
|
+
|
|
412
|
+
var s = stream.pipe(split(splitRegex)).pipe(
|
|
413
|
+
es
|
|
414
|
+
.mapSync(function (line) {
|
|
415
|
+
try {
|
|
416
|
+
// skip empty lines
|
|
417
|
+
if (line === '') {
|
|
418
|
+
return;
|
|
419
|
+
}
|
|
420
|
+
|
|
421
|
+
var doc =
|
|
422
|
+
typeof transform === 'function' ? JSON.stringify(transform(JSON.parse(line))) : line;
|
|
422
423
|
|
|
423
|
-
|
|
424
|
+
// if doc is undefined we'll skip indexing it
|
|
425
|
+
if (typeof doc === 'undefined') {
|
|
426
|
+
s.resume();
|
|
427
|
+
return;
|
|
428
|
+
}
|
|
429
|
+
|
|
430
|
+
// the transform callback may return an array of docs so we can emit
|
|
431
|
+
// multiple docs from a single line
|
|
432
|
+
if (Array.isArray(doc)) {
|
|
433
|
+
doc.forEach(function (d) { return indexer.add(d); });
|
|
434
|
+
return;
|
|
435
|
+
}
|
|
436
|
+
|
|
437
|
+
indexer.add(doc);
|
|
438
|
+
} catch (e) {
|
|
439
|
+
console.log('error', e);
|
|
440
|
+
}
|
|
441
|
+
})
|
|
442
|
+
.on('error', function (err) {
|
|
443
|
+
console.log('Error while reading stream.', err);
|
|
444
|
+
})
|
|
445
|
+
.on('end', function () {
|
|
446
|
+
if (verbose) { console.log('Read entire stream.'); }
|
|
447
|
+
indexer.finish();
|
|
448
|
+
finished = true;
|
|
449
|
+
})
|
|
450
|
+
);
|
|
451
|
+
|
|
452
|
+
indexer.queueEmitter.on('pause', function () {
|
|
453
|
+
if (finished) { return; }
|
|
454
|
+
s.pause();
|
|
455
|
+
});
|
|
456
|
+
|
|
457
|
+
indexer.queueEmitter.on('resume', function () {
|
|
458
|
+
if (finished) { return; }
|
|
459
|
+
s.resume();
|
|
460
|
+
});
|
|
461
|
+
}
|
|
462
|
+
|
|
463
|
+
return function () {
|
|
464
|
+
startIndex();
|
|
424
465
|
};
|
|
425
466
|
}
|
|
426
467
|
|
|
@@ -429,6 +470,8 @@ async function transformer(ref) {
|
|
|
429
470
|
var sourceClientConfig = ref.sourceClientConfig;
|
|
430
471
|
var targetClientConfig = ref.targetClientConfig;
|
|
431
472
|
var bufferSize = ref.bufferSize; if ( bufferSize === void 0 ) bufferSize = DEFAULT_BUFFER_SIZE;
|
|
473
|
+
var searchSize = ref.searchSize; if ( searchSize === void 0 ) searchSize = DEFAULT_SEARCH_SIZE;
|
|
474
|
+
var stream = ref.stream;
|
|
432
475
|
var fileName = ref.fileName;
|
|
433
476
|
var splitRegex = ref.splitRegex; if ( splitRegex === void 0 ) splitRegex = /\n/;
|
|
434
477
|
var sourceIndexName = ref.sourceIndexName;
|
|
@@ -464,6 +507,7 @@ async function transformer(ref) {
|
|
|
464
507
|
mappingsOverride: mappingsOverride,
|
|
465
508
|
indexMappingTotalFieldsLimit: indexMappingTotalFieldsLimit,
|
|
466
509
|
verbose: verbose,
|
|
510
|
+
deleteIndex: deleteIndex,
|
|
467
511
|
});
|
|
468
512
|
var indexer = indexQueueFactory({
|
|
469
513
|
targetClient: targetClient,
|
|
@@ -478,8 +522,12 @@ async function transformer(ref) {
|
|
|
478
522
|
throw Error('Only either one of fileName or sourceIndexName can be specified.');
|
|
479
523
|
}
|
|
480
524
|
|
|
481
|
-
if (
|
|
482
|
-
|
|
525
|
+
if (
|
|
526
|
+
(typeof fileName !== 'undefined' && typeof sourceIndexName !== 'undefined') ||
|
|
527
|
+
(typeof fileName !== 'undefined' && typeof stream !== 'undefined') ||
|
|
528
|
+
(typeof sourceIndexName !== 'undefined' && typeof stream !== 'undefined')
|
|
529
|
+
) {
|
|
530
|
+
throw Error('Only one of fileName, sourceIndexName, or stream can be specified.');
|
|
483
531
|
}
|
|
484
532
|
|
|
485
533
|
if (typeof fileName !== 'undefined') {
|
|
@@ -493,11 +541,15 @@ async function transformer(ref) {
|
|
|
493
541
|
transform,
|
|
494
542
|
sourceClient,
|
|
495
543
|
query,
|
|
496
|
-
|
|
544
|
+
searchSize,
|
|
497
545
|
populatedFields
|
|
498
546
|
);
|
|
499
547
|
}
|
|
500
548
|
|
|
549
|
+
if (typeof stream !== 'undefined') {
|
|
550
|
+
return streamReaderFactory(indexer, stream, transform, splitRegex, verbose);
|
|
551
|
+
}
|
|
552
|
+
|
|
501
553
|
return null;
|
|
502
554
|
}
|
|
503
555
|
|
package/package.json
CHANGED
|
@@ -14,20 +14,21 @@
|
|
|
14
14
|
"license": "Apache-2.0",
|
|
15
15
|
"author": "Walter Rafelsberger <walter@rafelsberger.at>",
|
|
16
16
|
"contributors": [],
|
|
17
|
-
"version": "1.0.0-
|
|
17
|
+
"version": "1.0.0-beta4",
|
|
18
18
|
"main": "dist/node-es-transformer.cjs.js",
|
|
19
19
|
"module": "dist/node-es-transformer.esm.js",
|
|
20
20
|
"dependencies": {
|
|
21
|
-
"@elastic/elasticsearch": "^8.
|
|
21
|
+
"@elastic/elasticsearch": "^8.17.0",
|
|
22
22
|
"cli-progress": "^3.12.0",
|
|
23
23
|
"event-stream": "3.3.4",
|
|
24
|
-
"
|
|
24
|
+
"git-cz": "^4.9.0",
|
|
25
|
+
"glob": "7.1.2",
|
|
26
|
+
"split2": "^4.2.0"
|
|
25
27
|
},
|
|
26
28
|
"devDependencies": {
|
|
27
29
|
"acorn": "^6.4.2",
|
|
28
30
|
"async-retry": "^1.3.3",
|
|
29
31
|
"commit-and-tag-version": "^11.3.0",
|
|
30
|
-
"cz-conventional-changelog": "^3.3.0",
|
|
31
32
|
"eslint": "^8.51.0",
|
|
32
33
|
"eslint-config-airbnb": "19.0.4",
|
|
33
34
|
"eslint-config-prettier": "^9.0.0",
|
|
@@ -57,7 +58,7 @@
|
|
|
57
58
|
],
|
|
58
59
|
"config": {
|
|
59
60
|
"commitizen": {
|
|
60
|
-
"path": "
|
|
61
|
+
"path": "git-cz"
|
|
61
62
|
}
|
|
62
63
|
},
|
|
63
64
|
"jest": {
|