logstash-input-elasticsearch 5.0.2 → 5.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +6 -0
- data/docs/index.asciidoc +300 -5
- data/lib/logstash/inputs/elasticsearch/aggregation.rb +11 -8
- data/lib/logstash/inputs/elasticsearch/cursor_tracker.rb +58 -0
- data/lib/logstash/inputs/elasticsearch/esql.rb +153 -0
- data/lib/logstash/inputs/elasticsearch/paginated_search.rb +12 -2
- data/lib/logstash/inputs/elasticsearch.rb +130 -28
- data/logstash-input-elasticsearch.gemspec +1 -1
- data/spec/inputs/cursor_tracker_spec.rb +72 -0
- data/spec/inputs/elasticsearch_esql_spec.rb +180 -0
- data/spec/inputs/elasticsearch_spec.rb +130 -1
- data/spec/inputs/integration/elasticsearch_esql_spec.rb +150 -0
- data/spec/inputs/integration/elasticsearch_spec.rb +9 -1
- metadata +10 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 718adf02c14b980691bd1572ac7e46b14f0f7850d82cd267fa52ddbde8289892
|
4
|
+
data.tar.gz: e10f582747a7ae11d707c4268ffce6485ed35453afbe140d999857485e022cda
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e88b12e47cfad23b4a1128ab05c1510c1f89bd76d20511064ada02999b6fa694d118a8e37ec2fede700d2008beaef045255b3e345d21eadcfa5b492a250d01dd
|
7
|
+
data.tar.gz: f533570ba4268088ddbe73572ec1e49924fe7febf3f6dad1c6e957b6eff5d414410970749521b0543b6bd03264776d4ac6ea2082589704cf1e7ea5faf36c1f07
|
data/CHANGELOG.md
CHANGED
@@ -1,3 +1,9 @@
|
|
1
|
+
## 5.2.0
|
2
|
+
- ES|QL support [#233](https://github.com/logstash-plugins/logstash-input-elasticsearch/pull/233)
|
3
|
+
|
4
|
+
## 5.1.0
|
5
|
+
- Add "cursor"-like index tracking [#205](https://github.com/logstash-plugins/logstash-input-elasticsearch/pull/205)
|
6
|
+
|
1
7
|
## 5.0.2
|
2
8
|
- Add elastic-transport client support used in elasticsearch-ruby 8.x [#223](https://github.com/logstash-plugins/logstash-input-elasticsearch/pull/223)
|
3
9
|
|
data/docs/index.asciidoc
CHANGED
@@ -48,7 +48,7 @@ This would create an Elasticsearch query with the following format:
|
|
48
48
|
"sort": [ "_doc" ]
|
49
49
|
}'
|
50
50
|
|
51
|
-
|
51
|
+
[id="plugins-{type}s-{plugin}-scheduling"]
|
52
52
|
==== Scheduling
|
53
53
|
|
54
54
|
Input from this plugin can be scheduled to run periodically according to a specific
|
@@ -103,6 +103,237 @@ Common causes are:
|
|
103
103
|
- When the hit result contains top-level fields that are {logstash-ref}/processing.html#reserved-fields[reserved in Logstash] but do not have the expected shape. Use the <<plugins-{type}s-{plugin}-target>> directive to avoid conflicts with the top-level namespace.
|
104
104
|
- When <<plugins-{type}s-{plugin}-docinfo>> is enabled and the docinfo fields cannot be merged into the hit result. Combine <<plugins-{type}s-{plugin}-target>> and <<plugins-{type}s-{plugin}-docinfo_target>> to avoid conflict.
|
105
105
|
|
106
|
+
[id="plugins-{type}s-{plugin}-cursor"]
|
107
|
+
==== Tracking a field's value across runs
|
108
|
+
|
109
|
+
.Technical Preview: Tracking a field's value
|
110
|
+
****
|
111
|
+
The feature that allows tracking a field's value across runs is in _Technical Preview_.
|
112
|
+
Configuration options and implementation details are subject to change in minor releases without being preceded by deprecation warnings.
|
113
|
+
****
|
114
|
+
|
115
|
+
Some uses cases require tracking the value of a particular field between two jobs.
|
116
|
+
Examples include:
|
117
|
+
|
118
|
+
* avoiding the need to re-process the entire result set of a long query after an unplanned restart
|
119
|
+
* grabbing only new data from an index instead of processing the entire set on each job.
|
120
|
+
|
121
|
+
The Elasticsearch input plugin provides the <<plugins-{type}s-{plugin}-tracking_field>> and <<plugins-{type}s-{plugin}-tracking_field_seed>> options.
|
122
|
+
When <<plugins-{type}s-{plugin}-tracking_field>> is set, the plugin records the value of that field for the last document retrieved in a run into
|
123
|
+
a file.
|
124
|
+
(The file location defaults to <<plugins-{type}s-{plugin}-last_run_metadata_path>>.)
|
125
|
+
|
126
|
+
You can then inject this value in the query using the placeholder `:last_value`.
|
127
|
+
The value will be injected into the query before execution, and then updated after the query completes if new data was found.
|
128
|
+
|
129
|
+
This feature works best when:
|
130
|
+
|
131
|
+
* the query sorts by the tracking field,
|
132
|
+
* the timestamp field is added by {es}, and
|
133
|
+
* the field type has enough resolution so that two events are unlikely to have the same value.
|
134
|
+
|
135
|
+
Consider using a tracking field whose type is https://www.elastic.co/guide/en/elasticsearch/reference/current/date_nanos.html[date nanoseconds].
|
136
|
+
If the tracking field is of this data type, you can use an extra placeholder called `:present` to inject the nano-second based value of "now-30s".
|
137
|
+
This placeholder is useful as the right-hand side of a range filter, allowing the collection of
|
138
|
+
new data but leaving partially-searchable bulk request data to the next scheduled job.
|
139
|
+
|
140
|
+
[id="plugins-{type}s-{plugin}-tracking-sample"]
|
141
|
+
===== Sample configuration: Track field value across runs
|
142
|
+
|
143
|
+
This section contains a series of steps to help you set up the "tailing" of data being written to a set of indices, using a date nanosecond field added by an Elasticsearch ingest pipeline and the `tracking_field` capability of this plugin.
|
144
|
+
|
145
|
+
. Create ingest pipeline that adds Elasticsearch's `_ingest.timestamp` field to the documents as `event.ingested`:
|
146
|
+
+
|
147
|
+
[source, json]
|
148
|
+
PUT _ingest/pipeline/my-pipeline
|
149
|
+
{
|
150
|
+
"processors": [
|
151
|
+
{
|
152
|
+
"script": {
|
153
|
+
"lang": "painless",
|
154
|
+
"source": "ctx.putIfAbsent(\"event\", [:]); ctx.event.ingested = metadata().now.format(DateTimeFormatter.ISO_INSTANT);"
|
155
|
+
}
|
156
|
+
}
|
157
|
+
]
|
158
|
+
}
|
159
|
+
|
160
|
+
[start=2]
|
161
|
+
. Create an index mapping where the tracking field is of date nanosecond type and invokes the defined pipeline:
|
162
|
+
+
|
163
|
+
[source, json]
|
164
|
+
PUT /_template/my_template
|
165
|
+
{
|
166
|
+
"index_patterns": ["test-*"],
|
167
|
+
"settings": {
|
168
|
+
"index.default_pipeline": "my-pipeline",
|
169
|
+
},
|
170
|
+
"mappings": {
|
171
|
+
"properties": {
|
172
|
+
"event": {
|
173
|
+
"properties": {
|
174
|
+
"ingested": {
|
175
|
+
"type": "date_nanos",
|
176
|
+
"format": "strict_date_optional_time_nanos"
|
177
|
+
}
|
178
|
+
}
|
179
|
+
}
|
180
|
+
}
|
181
|
+
}
|
182
|
+
}
|
183
|
+
|
184
|
+
[start=3]
|
185
|
+
. Define a query that looks at all data of the indices, sorted by the tracking field, and with a range filter since the last value seen until present:
|
186
|
+
+
|
187
|
+
[source,json]
|
188
|
+
{
|
189
|
+
"query": {
|
190
|
+
"range": {
|
191
|
+
"event.ingested": {
|
192
|
+
"gt": ":last_value",
|
193
|
+
"lt": ":present"
|
194
|
+
}
|
195
|
+
}
|
196
|
+
},
|
197
|
+
"sort": [
|
198
|
+
{
|
199
|
+
"event.ingested": {
|
200
|
+
"order": "asc",
|
201
|
+
"format": "strict_date_optional_time_nanos",
|
202
|
+
"numeric_type": "date_nanos"
|
203
|
+
}
|
204
|
+
}
|
205
|
+
]
|
206
|
+
}
|
207
|
+
|
208
|
+
[start=4]
|
209
|
+
. Configure the Elasticsearch input to query the indices with the query defined above, every minute, and track the `event.ingested` field:
|
210
|
+
+
|
211
|
+
[source, ruby]
|
212
|
+
input {
|
213
|
+
elasticsearch {
|
214
|
+
id => tail_test_index
|
215
|
+
hosts => [ 'https://..']
|
216
|
+
api_key => '....'
|
217
|
+
index => 'test-*'
|
218
|
+
query => '{ "query": { "range": { "event.ingested": { "gt": ":last_value", "lt": ":present"}}}, "sort": [ { "event.ingested": {"order": "asc", "format": "strict_date_optional_time_nanos", "numeric_type" : "date_nanos" } } ] }'
|
219
|
+
tracking_field => "[event][ingested]"
|
220
|
+
slices => 5 # optional use of slices to speed data processing, should be equal to or less than number of primary shards
|
221
|
+
schedule => '* * * * *' # every minute
|
222
|
+
schedule_overlap => false # don't accumulate jobs if one takes longer than 1 minute
|
223
|
+
}
|
224
|
+
}
|
225
|
+
|
226
|
+
With this sample setup, new documents are indexed into a `test-*` index.
|
227
|
+
The next scheduled run:
|
228
|
+
|
229
|
+
* selects all new documents since the last observed value of the tracking field,
|
230
|
+
* uses {ref}/point-in-time-api.html#point-in-time-api[Point in time (PIT)] + {ref}/paginate-search-results.html#search-after[Search after] to paginate through all the data, and
|
231
|
+
* updates the value of the field at the end of the pagination.
|
232
|
+
|
233
|
+
[id="plugins-{type}s-{plugin}-esql"]
|
234
|
+
==== {esql} support
|
235
|
+
|
236
|
+
.Technical Preview
|
237
|
+
****
|
238
|
+
The {esql} feature that allows using ES|QL queries with this plugin is in Technical Preview.
|
239
|
+
Configuration options and implementation details are subject to change in minor releases without being preceded by deprecation warnings.
|
240
|
+
****
|
241
|
+
|
242
|
+
{es} Query Language ({esql}) provides a SQL-like interface for querying your {es} data.
|
243
|
+
|
244
|
+
To use {esql}, this plugin needs to be installed in {ls} 8.17.4 or newer, and must be connected to {es} 8.11 or newer.
|
245
|
+
|
246
|
+
To configure {esql} query in the plugin, set the `query_type` to `esql` and provide your {esql} query in the `query` parameter.
|
247
|
+
|
248
|
+
IMPORTANT: {esql} is evolving and may still have limitations with regard to result size or supported field types. We recommend understanding https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-limitations.html[ES|QL current limitations] before using it in production environments.
|
249
|
+
|
250
|
+
The following is a basic scheduled {esql} query that runs hourly:
|
251
|
+
[source, ruby]
|
252
|
+
input {
|
253
|
+
elasticsearch {
|
254
|
+
id => hourly_cron_job
|
255
|
+
hosts => [ 'https://..']
|
256
|
+
api_key => '....'
|
257
|
+
query_type => 'esql'
|
258
|
+
query => '
|
259
|
+
FROM food-index
|
260
|
+
| WHERE spicy_level = "hot" AND @timestamp > NOW() - 1 hour
|
261
|
+
| LIMIT 500
|
262
|
+
'
|
263
|
+
schedule => '0 * * * *' # every hour at min 0
|
264
|
+
}
|
265
|
+
}
|
266
|
+
|
267
|
+
Set `config.support_escapes: true` in `logstash.yml` if you need to escape special chars in the query.
|
268
|
+
|
269
|
+
NOTE: With {esql} query, {ls} doesn't generate `event.original`.
|
270
|
+
|
271
|
+
[id="plugins-{type}s-{plugin}-esql-event-mapping"]
|
272
|
+
===== Mapping {esql} result to {ls} event
|
273
|
+
{esql} returns query results in a structured tabular format, where data is organized into _columns_ (fields) and _values_ (entries).
|
274
|
+
The plugin maps each value entry to an event, populating corresponding fields.
|
275
|
+
For example, a query might produce a table like:
|
276
|
+
|
277
|
+
[cols="2,1,1,1,2",options="header"]
|
278
|
+
|===
|
279
|
+
|`timestamp` |`user_id` | `action` | `status.code` | `status.desc`
|
280
|
+
|
281
|
+
|2025-04-10T12:00:00 |123 |login |200 | Success
|
282
|
+
|2025-04-10T12:05:00 |456 |purchase |403 | Forbidden (unauthorized user)
|
283
|
+
|===
|
284
|
+
|
285
|
+
For this case, the plugin emits two events look like
|
286
|
+
[source, json]
|
287
|
+
[
|
288
|
+
{
|
289
|
+
"timestamp": "2025-04-10T12:00:00",
|
290
|
+
"user_id": 123,
|
291
|
+
"action": "login",
|
292
|
+
"status": {
|
293
|
+
"code": 200,
|
294
|
+
"desc": "Success"
|
295
|
+
}
|
296
|
+
},
|
297
|
+
{
|
298
|
+
"timestamp": "2025-04-10T12:05:00",
|
299
|
+
"user_id": 456,
|
300
|
+
"action": "purchase",
|
301
|
+
"status": {
|
302
|
+
"code": 403,
|
303
|
+
"desc": "Forbidden (unauthorized user)"
|
304
|
+
}
|
305
|
+
}
|
306
|
+
]
|
307
|
+
|
308
|
+
NOTE: If your index has a mapping with sub-objects where `status.code` and `status.desc` actually dotted fields, they appear in {ls} events as a nested structure.
|
309
|
+
|
310
|
+
[id="plugins-{type}s-{plugin}-esql-multifields"]
|
311
|
+
===== Conflict on multi-fields
|
312
|
+
|
313
|
+
{esql} query fetches all parent and sub-fields fields if your {es} index has https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/multi-fields[multi-fields] or https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/subobjects[subobjects].
|
314
|
+
Since {ls} events cannot contain parent field's concrete value and sub-field values together, the plugin ignores sub-fields with warning and includes parent.
|
315
|
+
We recommend using the `RENAME` (or `DROP` to avoid warnings) keyword in your {esql} query explicitly rename the fields to include sub-fields into the event.
|
316
|
+
|
317
|
+
This a common occurrence if your template or mapping follows the pattern of always indexing strings as "text" (`field`) + " keyword" (`field.keyword`) multi-field.
|
318
|
+
In this case it's recommended to do `KEEP field` if the string is identical and there is only one subfield as the engine will optimize and retrieve the keyword, otherwise you can do `KEEP field.keyword | RENAME field.keyword as field`.
|
319
|
+
|
320
|
+
To illustrate the situation with example, assuming your mapping has a time `time` field with `time.min` and `time.max` sub-fields as following:
|
321
|
+
[source, ruby]
|
322
|
+
"properties": {
|
323
|
+
"time": { "type": "long" },
|
324
|
+
"time.min": { "type": "long" },
|
325
|
+
"time.max": { "type": "long" }
|
326
|
+
}
|
327
|
+
|
328
|
+
The {esql} result will contain all three fields but the plugin cannot map them into {ls} event.
|
329
|
+
To avoid this, you can use the `RENAME` keyword to rename the `time` parent field to get all three fields with unique fields.
|
330
|
+
[source, ruby]
|
331
|
+
...
|
332
|
+
query => 'FROM my-index | RENAME time AS time.current'
|
333
|
+
...
|
334
|
+
|
335
|
+
For comprehensive {esql} syntax reference and best practices, see the https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-syntax.html[{esql} documentation].
|
336
|
+
|
106
337
|
[id="plugins-{type}s-{plugin}-options"]
|
107
338
|
==== Elasticsearch Input configuration options
|
108
339
|
|
@@ -126,12 +357,15 @@ Please check out <<plugins-{type}s-{plugin}-obsolete-options>> for details.
|
|
126
357
|
| <<plugins-{type}s-{plugin}-ecs_compatibility>> |<<string,string>>|No
|
127
358
|
| <<plugins-{type}s-{plugin}-hosts>> |<<array,array>>|No
|
128
359
|
| <<plugins-{type}s-{plugin}-index>> |<<string,string>>|No
|
360
|
+
| <<plugins-{type}s-{plugin}-last_run_metadata_path>> |<<string,string>>|No
|
129
361
|
| <<plugins-{type}s-{plugin}-password>> |<<password,password>>|No
|
130
362
|
| <<plugins-{type}s-{plugin}-proxy>> |<<uri,uri>>|No
|
131
363
|
| <<plugins-{type}s-{plugin}-query>> |<<string,string>>|No
|
364
|
+
| <<plugins-{type}s-{plugin}-query_type>> |<<string,string>>, one of `["dsl","esql"]`|No
|
132
365
|
| <<plugins-{type}s-{plugin}-response_type>> |<<string,string>>, one of `["hits","aggregations"]`|No
|
133
366
|
| <<plugins-{type}s-{plugin}-request_timeout_seconds>> | <<number,number>>|No
|
134
367
|
| <<plugins-{type}s-{plugin}-schedule>> |<<string,string>>|No
|
368
|
+
| <<plugins-{type}s-{plugin}-schedule_overlap>> |<<boolean,boolean>>|No
|
135
369
|
| <<plugins-{type}s-{plugin}-scroll>> |<<string,string>>|No
|
136
370
|
| <<plugins-{type}s-{plugin}-search_api>> |<<string,string>>, one of `["auto", "search_after", "scroll"]`|No
|
137
371
|
| <<plugins-{type}s-{plugin}-size>> |<<number,number>>|No
|
@@ -151,6 +385,8 @@ Please check out <<plugins-{type}s-{plugin}-obsolete-options>> for details.
|
|
151
385
|
| <<plugins-{type}s-{plugin}-ssl_verification_mode>> |<<string,string>>, one of `["full", "none"]`|No
|
152
386
|
| <<plugins-{type}s-{plugin}-socket_timeout_seconds>> | <<number,number>>|No
|
153
387
|
| <<plugins-{type}s-{plugin}-target>> | {logstash-ref}/field-references-deepdive.html[field reference] | No
|
388
|
+
| <<plugins-{type}s-{plugin}-tracking_field>> |<<string,string>>|No
|
389
|
+
| <<plugins-{type}s-{plugin}-tracking_field_seed>> |<<string,string>>|No
|
154
390
|
| <<plugins-{type}s-{plugin}-retries>> | <<number,number>>|No
|
155
391
|
| <<plugins-{type}s-{plugin}-user>> |<<string,string>>|No
|
156
392
|
|=======================================================================
|
@@ -330,6 +566,17 @@ Check out {ref}/api-conventions.html#api-multi-index[Multi Indices
|
|
330
566
|
documentation] in the Elasticsearch documentation for info on
|
331
567
|
referencing multiple indices.
|
332
568
|
|
569
|
+
[id="plugins-{type}s-{plugin}-last_run_metadata_path"]
|
570
|
+
===== `last_run_metadata_path`
|
571
|
+
|
572
|
+
* Value type is <<string,string>>
|
573
|
+
* There is no default value for this setting.
|
574
|
+
|
575
|
+
The path to store the last observed value of the tracking field, when used.
|
576
|
+
By default this file is stored as `<path.data>/plugins/inputs/elasticsearch/<pipeline_id>/last_run_value`.
|
577
|
+
|
578
|
+
This setting should point to file, not a directory, and Logstash must have read+write access to this file.
|
579
|
+
|
333
580
|
[id="plugins-{type}s-{plugin}-password"]
|
334
581
|
===== `password`
|
335
582
|
|
@@ -356,22 +603,35 @@ environment variables e.g. `proxy => '${LS_PROXY:}'`.
|
|
356
603
|
* Value type is <<string,string>>
|
357
604
|
* Default value is `'{ "sort": [ "_doc" ] }'`
|
358
605
|
|
359
|
-
The query to be executed.
|
360
|
-
|
606
|
+
The query to be executed.
|
607
|
+
Accepted query shape is DSL or {esql} (when `query_type => 'esql'`).
|
608
|
+
Read the {ref}/query-dsl.html[{es} query DSL documentation] or {ref}/esql.html[{esql} documentation] for more information.
|
361
609
|
|
362
610
|
When <<plugins-{type}s-{plugin}-search_api>> resolves to `search_after` and the query does not specify `sort`,
|
363
611
|
the default sort `'{ "sort": { "_shard_doc": "asc" } }'` will be added to the query. Please refer to the {ref}/paginate-search-results.html#search-after[Elasticsearch search_after] parameter to know more.
|
364
612
|
|
613
|
+
[id="plugins-{type}s-{plugin}-query_type"]
|
614
|
+
===== `query_type`
|
615
|
+
|
616
|
+
* Value can be `dsl` or `esql`
|
617
|
+
* Default value is `dsl`
|
618
|
+
|
619
|
+
Defines the <<plugins-{type}s-{plugin}-query>> shape.
|
620
|
+
When `dsl`, the query shape must be valid {es} JSON-style string.
|
621
|
+
When `esql`, the query shape must be a valid {esql} string and `index`, `size`, `slices`, `search_api`, `docinfo`, `docinfo_target`, `docinfo_fields`, `response_type` and `tracking_field` parameters are not allowed.
|
622
|
+
|
365
623
|
[id="plugins-{type}s-{plugin}-response_type"]
|
366
624
|
===== `response_type`
|
367
625
|
|
368
|
-
* Value can be any of: `hits`, `aggregations`
|
626
|
+
* Value can be any of: `hits`, `aggregations`, `esql`
|
369
627
|
* Default value is `hits`
|
370
628
|
|
371
629
|
Which part of the result to transform into Logstash events when processing the
|
372
630
|
response from the query.
|
631
|
+
|
373
632
|
The default `hits` will generate one event per returned document (i.e. "hit").
|
374
|
-
|
633
|
+
|
634
|
+
When set to `aggregations`, a single {ls} event will be generated with the
|
375
635
|
contents of the `aggregations` object of the query's response. In this case the
|
376
636
|
`hits` object will be ignored. The parameter `size` will be always be set to
|
377
637
|
0 regardless of the default or user-defined value set in this plugin.
|
@@ -410,6 +670,19 @@ for example: "* * * * *" (execute query every minute, on the minute)
|
|
410
670
|
There is no schedule by default. If no schedule is given, then the statement is run
|
411
671
|
exactly once.
|
412
672
|
|
673
|
+
[id="plugins-{type}s-{plugin}-schedule_overlap"]
|
674
|
+
===== `schedule_overlap`
|
675
|
+
|
676
|
+
* Value type is <<boolean,boolean>>
|
677
|
+
* Default value is `true`
|
678
|
+
|
679
|
+
Whether to allow queuing of a scheduled run if a run is occurring.
|
680
|
+
While this is ideal for ensuring a new run happens immediately after the previous on finishes if there
|
681
|
+
is a lot of work to do, but given the queue is unbounded it may lead to an out of memory over long periods of time
|
682
|
+
if the queue grows continuously.
|
683
|
+
|
684
|
+
When in doubt, set `schedule_overlap` to false (it may become the default value in the future).
|
685
|
+
|
413
686
|
[id="plugins-{type}s-{plugin}-scroll"]
|
414
687
|
===== `scroll`
|
415
688
|
|
@@ -622,6 +895,28 @@ When the `target` is set to a field reference, the `_source` of the hit is place
|
|
622
895
|
This option can be useful to avoid populating unknown fields when a downstream schema such as ECS is enforced.
|
623
896
|
It is also possible to target an entry in the event's metadata, which will be available during event processing but not exported to your outputs (e.g., `target \=> "[@metadata][_source]"`).
|
624
897
|
|
898
|
+
[id="plugins-{type}s-{plugin}-tracking_field"]
|
899
|
+
===== `tracking_field`
|
900
|
+
|
901
|
+
* Value type is <<string,string>>
|
902
|
+
* There is no default value for this setting.
|
903
|
+
|
904
|
+
Which field from the last event of a previous run will be used a cursor value for the following run.
|
905
|
+
The value of this field is injected into each query if the query uses the placeholder `:last_value`.
|
906
|
+
For the first query after a pipeline is started, the value used is either read from <<plugins-{type}s-{plugin}-last_run_metadata_path>> file,
|
907
|
+
or taken from <<plugins-{type}s-{plugin}-tracking_field_seed>> setting.
|
908
|
+
|
909
|
+
Note: The tracking value is updated after each page is read and at the end of each Point in Time. In case of a crash the last saved value will be used so some duplication of data can occur. For this reason the use of unique document IDs for each event is recommended in the downstream destination.
|
910
|
+
|
911
|
+
[id="plugins-{type}s-{plugin}-tracking_field_seed"]
|
912
|
+
===== `tracking_field_seed`
|
913
|
+
|
914
|
+
* Value type is <<string,string>>
|
915
|
+
* Default value is `"1970-01-01T00:00:00.000000000Z"`
|
916
|
+
|
917
|
+
The starting value for the <<plugins-{type}s-{plugin}-tracking_field>> if there is no <<plugins-{type}s-{plugin}-last_run_metadata_path>> already.
|
918
|
+
This field defaults to the nanosecond precision ISO8601 representation of `epoch`, or "1970-01-01T00:00:00.000000000Z", given nano-second precision timestamps are the
|
919
|
+
most reliable data format to use for this feature.
|
625
920
|
|
626
921
|
[id="plugins-{type}s-{plugin}-user"]
|
627
922
|
===== `user`
|
@@ -12,14 +12,9 @@ module LogStash
|
|
12
12
|
@client = client
|
13
13
|
@plugin_params = plugin.params
|
14
14
|
|
15
|
+
@index = @plugin_params["index"]
|
15
16
|
@size = @plugin_params["size"]
|
16
|
-
@query = @plugin_params["query"]
|
17
17
|
@retries = @plugin_params["retries"]
|
18
|
-
@agg_options = {
|
19
|
-
:index => @plugin_params["index"],
|
20
|
-
:size => 0
|
21
|
-
}.merge(:body => @query)
|
22
|
-
|
23
18
|
@plugin = plugin
|
24
19
|
end
|
25
20
|
|
@@ -33,10 +28,18 @@ module LogStash
|
|
33
28
|
false
|
34
29
|
end
|
35
30
|
|
36
|
-
def
|
31
|
+
def aggregation_options(query_object)
|
32
|
+
{
|
33
|
+
:index => @index,
|
34
|
+
:size => 0,
|
35
|
+
:body => query_object
|
36
|
+
}
|
37
|
+
end
|
38
|
+
|
39
|
+
def do_run(output_queue, query_object)
|
37
40
|
logger.info("Aggregation starting")
|
38
41
|
r = retryable(AGGREGATION_JOB) do
|
39
|
-
@client.search(
|
42
|
+
@client.search(aggregation_options(query_object))
|
40
43
|
end
|
41
44
|
@plugin.push_hit(r, output_queue, 'aggregations') if r
|
42
45
|
end
|
@@ -0,0 +1,58 @@
|
|
1
|
+
require 'fileutils'
|
2
|
+
|
3
|
+
module LogStash; module Inputs; class Elasticsearch
|
4
|
+
class CursorTracker
|
5
|
+
include LogStash::Util::Loggable
|
6
|
+
|
7
|
+
attr_reader :last_value
|
8
|
+
|
9
|
+
def initialize(last_run_metadata_path:, tracking_field:, tracking_field_seed:)
|
10
|
+
@last_run_metadata_path = last_run_metadata_path
|
11
|
+
@last_value_hashmap = Java::java.util.concurrent.ConcurrentHashMap.new
|
12
|
+
@last_value = IO.read(@last_run_metadata_path) rescue nil || tracking_field_seed
|
13
|
+
@tracking_field = tracking_field
|
14
|
+
logger.info "Starting value for cursor field \"#{@tracking_field}\": #{@last_value}"
|
15
|
+
@mutex = Mutex.new
|
16
|
+
end
|
17
|
+
|
18
|
+
def checkpoint_cursor(intermediate: true)
|
19
|
+
@mutex.synchronize do
|
20
|
+
if intermediate
|
21
|
+
# in intermediate checkpoints pick the smallest
|
22
|
+
converge_last_value {|v1, v2| v1 < v2 ? v1 : v2}
|
23
|
+
else
|
24
|
+
# in the last search of a PIT choose the largest
|
25
|
+
converge_last_value {|v1, v2| v1 > v2 ? v1 : v2}
|
26
|
+
@last_value_hashmap.clear
|
27
|
+
end
|
28
|
+
IO.write(@last_run_metadata_path, @last_value)
|
29
|
+
end
|
30
|
+
end
|
31
|
+
|
32
|
+
def converge_last_value(&block)
|
33
|
+
return if @last_value_hashmap.empty?
|
34
|
+
new_last_value = @last_value_hashmap.reduceValues(1000, &block)
|
35
|
+
logger.debug? && logger.debug("converge_last_value: got #{@last_value_hashmap.values.inspect}. won: #{new_last_value}")
|
36
|
+
return if new_last_value == @last_value
|
37
|
+
@last_value = new_last_value
|
38
|
+
logger.info "New cursor value for field \"#{@tracking_field}\" is: #{new_last_value}"
|
39
|
+
end
|
40
|
+
|
41
|
+
def record_last_value(event)
|
42
|
+
value = event.get(@tracking_field)
|
43
|
+
logger.trace? && logger.trace("storing last_value if #{@tracking_field} for #{Thread.current.object_id}: #{value}")
|
44
|
+
@last_value_hashmap.put(Thread.current.object_id, value)
|
45
|
+
end
|
46
|
+
|
47
|
+
def inject_cursor(query_json)
|
48
|
+
# ":present" means "now - 30s" to avoid grabbing partially visible data in the PIT
|
49
|
+
result = query_json.gsub(":last_value", @last_value.to_s).gsub(":present", now_minus_30s)
|
50
|
+
logger.debug("inject_cursor: injected values for ':last_value' and ':present'", :query => result)
|
51
|
+
result
|
52
|
+
end
|
53
|
+
|
54
|
+
def now_minus_30s
|
55
|
+
Java::java.time.Instant.now.minusSeconds(30).to_s
|
56
|
+
end
|
57
|
+
end
|
58
|
+
end; end; end
|
@@ -0,0 +1,153 @@
|
|
1
|
+
require 'logstash/helpers/loggable_try'
|
2
|
+
|
3
|
+
module LogStash
|
4
|
+
module Inputs
|
5
|
+
class Elasticsearch
|
6
|
+
class Esql
|
7
|
+
include LogStash::Util::Loggable
|
8
|
+
|
9
|
+
ESQL_JOB = "ES|QL job"
|
10
|
+
|
11
|
+
ESQL_PARSERS_BY_TYPE = Hash.new(lambda { |x| x }).merge(
|
12
|
+
'date' => ->(value) { value && LogStash::Timestamp.new(value) },
|
13
|
+
)
|
14
|
+
|
15
|
+
# Initialize the ESQL query executor
|
16
|
+
# @param client [Elasticsearch::Client] The Elasticsearch client instance
|
17
|
+
# @param plugin [LogStash::Inputs::Elasticsearch] The parent plugin instance
|
18
|
+
def initialize(client, plugin)
|
19
|
+
@client = client
|
20
|
+
@event_decorator = plugin.method(:decorate_event)
|
21
|
+
@retries = plugin.params["retries"]
|
22
|
+
|
23
|
+
target_field = plugin.params["target"]
|
24
|
+
if target_field
|
25
|
+
def self.apply_target(path); "[#{target_field}][#{path}]"; end
|
26
|
+
else
|
27
|
+
def self.apply_target(path); path; end
|
28
|
+
end
|
29
|
+
|
30
|
+
@query = plugin.params["query"]
|
31
|
+
unless @query.include?('METADATA')
|
32
|
+
logger.info("`METADATA` not found the query. `_id`, `_version` and `_index` will not be available in the result", {:query => @query})
|
33
|
+
end
|
34
|
+
logger.debug("ES|QL executor initialized with", {:query => @query})
|
35
|
+
end
|
36
|
+
|
37
|
+
# Execute the ESQL query and process results
|
38
|
+
# @param output_queue [Queue] The queue to push processed events to
|
39
|
+
# @param query A query (to obey interface definition)
|
40
|
+
def do_run(output_queue, query)
|
41
|
+
logger.info("ES|QL executor has started")
|
42
|
+
response = retryable(ESQL_JOB) do
|
43
|
+
@client.esql.query({ body: { query: @query }, format: 'json', drop_null_columns: true })
|
44
|
+
end
|
45
|
+
# retriable already printed error details
|
46
|
+
return if response == false
|
47
|
+
|
48
|
+
if response&.headers&.dig("warning")
|
49
|
+
logger.warn("ES|QL executor received warning", {:warning_message => response.headers["warning"]})
|
50
|
+
end
|
51
|
+
columns = response['columns']&.freeze
|
52
|
+
values = response['values']&.freeze
|
53
|
+
logger.debug("ES|QL query response size: #{values&.size}")
|
54
|
+
|
55
|
+
process_response(columns, values, output_queue) if columns && values
|
56
|
+
end
|
57
|
+
|
58
|
+
# Execute a retryable operation with proper error handling
|
59
|
+
# @param job_name [String] Name of the job for logging purposes
|
60
|
+
# @yield The block to execute
|
61
|
+
# @return [Boolean] true if successful, false otherwise
|
62
|
+
def retryable(job_name, &block)
|
63
|
+
stud_try = ::LogStash::Helpers::LoggableTry.new(logger, job_name)
|
64
|
+
stud_try.try((@retries + 1).times) { yield }
|
65
|
+
rescue => e
|
66
|
+
error_details = {:message => e.message, :cause => e.cause}
|
67
|
+
error_details[:backtrace] = e.backtrace if logger.debug?
|
68
|
+
logger.error("#{job_name} failed with ", error_details)
|
69
|
+
false
|
70
|
+
end
|
71
|
+
|
72
|
+
private
|
73
|
+
|
74
|
+
# Process the ESQL response and push events to the output queue
|
75
|
+
# @param columns [Array[Hash]] The ESQL query response columns
|
76
|
+
# @param values [Array[Array]] The ESQL query response hits
|
77
|
+
# @param output_queue [Queue] The queue to push processed events to
|
78
|
+
def process_response(columns, values, output_queue)
|
79
|
+
column_specs = columns.map { |column| ColumnSpec.new(column) }
|
80
|
+
sub_element_mark_map = mark_sub_elements(column_specs)
|
81
|
+
multi_fields = sub_element_mark_map.filter_map { |key, val| key.name if val == true }
|
82
|
+
logger.warn("Multi-fields found in ES|QL result and they will not be available in the event. Please use `RENAME` command if you want to include them.", { :detected_multi_fields => multi_fields }) if multi_fields.any?
|
83
|
+
|
84
|
+
values.each do |row|
|
85
|
+
event = column_specs.zip(row).each_with_object(LogStash::Event.new) do |(column, value), event|
|
86
|
+
# `unless value.nil?` is a part of `drop_null_columns` that if some of columns' values are not `nil`, `nil` values appear
|
87
|
+
# we should continuously filter out them to achieve full `drop_null_columns` on each individual row (ideal `LIMIT 1` result)
|
88
|
+
# we also exclude sub-elements of main field
|
89
|
+
if value && sub_element_mark_map[column] == false
|
90
|
+
field_reference = apply_target(column.field_reference)
|
91
|
+
event.set(field_reference, ESQL_PARSERS_BY_TYPE[column.type].call(value))
|
92
|
+
end
|
93
|
+
end
|
94
|
+
@event_decorator.call(event)
|
95
|
+
output_queue << event
|
96
|
+
rescue => e
|
97
|
+
# if event creation fails with whatever reason, inform user and tag with failure and return entry as it is
|
98
|
+
logger.warn("Event creation error, ", message: e.message, exception: e.class, data: { "columns" => columns, "values" => [row] })
|
99
|
+
failed_event = LogStash::Event.new("columns" => columns, "values" => [row], "tags" => ['_elasticsearch_input_failure'])
|
100
|
+
output_queue << failed_event
|
101
|
+
end
|
102
|
+
end
|
103
|
+
|
104
|
+
# Determines whether each column in a collection is a nested sub-element (example "user.age")
|
105
|
+
# of another column in the same collection (example "user").
|
106
|
+
#
|
107
|
+
# @param columns [Array<ColumnSpec>] An array of objects with a `name` attribute representing field paths.
|
108
|
+
# @return [Hash<ColumnSpec, Boolean>] A hash mapping each column to `true` if it is a sub-element of another field, `false` otherwise.
|
109
|
+
# Time complexity: (O(NlogN+N*K)) where K is the number of conflict depth
|
110
|
+
# without (`prefix_set`) memoization, it would be O(N^2)
|
111
|
+
def mark_sub_elements(columns)
|
112
|
+
# Sort columns by name length (ascending)
|
113
|
+
sorted_columns = columns.sort_by { |c| c.name.length }
|
114
|
+
prefix_set = Set.new # memoization set
|
115
|
+
|
116
|
+
sorted_columns.each_with_object({}) do |column, memo|
|
117
|
+
# Split the column name into parts (e.g., "user.profile.age" → ["user", "profile", "age"])
|
118
|
+
parts = column.name.split('.')
|
119
|
+
|
120
|
+
# Generate all possible parent prefixes (e.g., "user", "user.profile")
|
121
|
+
# and check if any parent prefix exists in the set
|
122
|
+
parent_prefixes = (0...parts.size - 1).map { |i| parts[0..i].join('.') }
|
123
|
+
memo[column] = parent_prefixes.any? { |prefix| prefix_set.include?(prefix) }
|
124
|
+
prefix_set.add(column.name)
|
125
|
+
end
|
126
|
+
end
|
127
|
+
end
|
128
|
+
|
129
|
+
# Class representing a column specification in the ESQL response['columns']
|
130
|
+
# The class's main purpose is to provide a structure for the event key
|
131
|
+
# columns is an array with `name` and `type` pair (example: `{"name"=>"@timestamp", "type"=>"date"}`)
|
132
|
+
# @attr_reader :name [String] The name of the column
|
133
|
+
# @attr_reader :type [String] The type of the column
|
134
|
+
class ColumnSpec
|
135
|
+
attr_reader :name, :type
|
136
|
+
|
137
|
+
def initialize(spec)
|
138
|
+
@name = isolate(spec.fetch('name'))
|
139
|
+
@type = isolate(spec.fetch('type'))
|
140
|
+
end
|
141
|
+
|
142
|
+
def field_reference
|
143
|
+
@_field_reference ||= '[' + name.gsub('.', '][') + ']'
|
144
|
+
end
|
145
|
+
|
146
|
+
private
|
147
|
+
def isolate(value)
|
148
|
+
value.frozen? ? value : value.clone.freeze
|
149
|
+
end
|
150
|
+
end
|
151
|
+
end
|
152
|
+
end
|
153
|
+
end
|