logstash-input-elasticsearch 4.21.2 → 4.23.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 70af2192f555f8afff4ef2f96072f2b215a2039207dfa12a9449f507f7b13f7b
4
- data.tar.gz: 73621246eccfd1fbb385be5e9ca5ef9a071cdb64008cb539a1e80a08c7a0ed34
3
+ metadata.gz: 330a1fd55cb3fa00918a73dcd41b66e63ce81d6fc79dc68f4209385429e588d4
4
+ data.tar.gz: 5ba0377bcaaa9b428a4a848e32fff5019353ca7f6b3c8bb77944156d05230d6f
5
5
  SHA512:
6
- metadata.gz: bbc5c842d77204339e0bb64174f98ffb8bb1728957a1f64d1f83e1f5bad27ad76fc24f44b23a64d23247b26a806cfee7cbd52a16ea34e5490f1355bcdbb98303
7
- data.tar.gz: 7b258f80ca64e5dd16593a65d7326a5f3695f840cbf32fdeac9363a6a19d4747de9135065a7b940602cd77f43a02910b74d667761184ccb846a864e128334a20
6
+ metadata.gz: 8be2dc35edde5b3b83c2c5711941c58c9aa3e45330e3785fef21269af134e13d031e10cdc324cf31e35b6c42a48f215f9ff2e8d58bc3a77fcc0c5a31c2084998
7
+ data.tar.gz: b0456a0a04f365a34e35d6b8f8040e75a4d9a0f73718a2958f3c87823e5e41a5b04308b8f1710b8d5aa57091015aba273f308c75bd79c61494fecea4baf00d8e
data/CHANGELOG.md CHANGED
@@ -1,3 +1,9 @@
1
+ ## 4.23.0
2
+ - ES|QL support [#235](https://github.com/logstash-plugins/logstash-input-elasticsearch/pull/235)
3
+
4
+ ## 4.22.0
5
+ - Add "cursor"-like index tracking [#205](https://github.com/logstash-plugins/logstash-input-elasticsearch/pull/205)
6
+
1
7
  ## 4.21.2
2
8
  - Add elastic-transport client support used in elasticsearch-ruby 8.x [#225](https://github.com/logstash-plugins/logstash-input-elasticsearch/pull/225)
3
9
 
data/docs/index.asciidoc CHANGED
@@ -48,7 +48,7 @@ This would create an Elasticsearch query with the following format:
48
48
  "sort": [ "_doc" ]
49
49
  }'
50
50
 
51
-
51
+ [id="plugins-{type}s-{plugin}-scheduling"]
52
52
  ==== Scheduling
53
53
 
54
54
  Input from this plugin can be scheduled to run periodically according to a specific
@@ -103,6 +103,237 @@ Common causes are:
103
103
  - When the hit result contains top-level fields that are {logstash-ref}/processing.html#reserved-fields[reserved in Logstash] but do not have the expected shape. Use the <<plugins-{type}s-{plugin}-target>> directive to avoid conflicts with the top-level namespace.
104
104
  - When <<plugins-{type}s-{plugin}-docinfo>> is enabled and the docinfo fields cannot be merged into the hit result. Combine <<plugins-{type}s-{plugin}-target>> and <<plugins-{type}s-{plugin}-docinfo_target>> to avoid conflict.
105
105
 
106
+ [id="plugins-{type}s-{plugin}-cursor"]
107
+ ==== Tracking a field's value across runs
108
+
109
+ .Technical Preview: Tracking a field's value
110
+ ****
111
+ The feature that allows tracking a field's value across runs is in _Technical Preview_.
112
+ Configuration options and implementation details are subject to change in minor releases without being preceded by deprecation warnings.
113
+ ****
114
+
115
+ Some uses cases require tracking the value of a particular field between two jobs.
116
+ Examples include:
117
+
118
+ * avoiding the need to re-process the entire result set of a long query after an unplanned restart
119
+ * grabbing only new data from an index instead of processing the entire set on each job.
120
+
121
+ The Elasticsearch input plugin provides the <<plugins-{type}s-{plugin}-tracking_field>> and <<plugins-{type}s-{plugin}-tracking_field_seed>> options.
122
+ When <<plugins-{type}s-{plugin}-tracking_field>> is set, the plugin records the value of that field for the last document retrieved in a run into
123
+ a file.
124
+ (The file location defaults to <<plugins-{type}s-{plugin}-last_run_metadata_path>>.)
125
+
126
+ You can then inject this value in the query using the placeholder `:last_value`.
127
+ The value will be injected into the query before execution, and then updated after the query completes if new data was found.
128
+
129
+ This feature works best when:
130
+
131
+ * the query sorts by the tracking field,
132
+ * the timestamp field is added by {es}, and
133
+ * the field type has enough resolution so that two events are unlikely to have the same value.
134
+
135
+ Consider using a tracking field whose type is https://www.elastic.co/guide/en/elasticsearch/reference/current/date_nanos.html[date nanoseconds].
136
+ If the tracking field is of this data type, you can use an extra placeholder called `:present` to inject the nano-second based value of "now-30s".
137
+ This placeholder is useful as the right-hand side of a range filter, allowing the collection of
138
+ new data but leaving partially-searchable bulk request data to the next scheduled job.
139
+
140
+ [id="plugins-{type}s-{plugin}-tracking-sample"]
141
+ ===== Sample configuration: Track field value across runs
142
+
143
+ This section contains a series of steps to help you set up the "tailing" of data being written to a set of indices, using a date nanosecond field added by an Elasticsearch ingest pipeline and the `tracking_field` capability of this plugin.
144
+
145
+ . Create ingest pipeline that adds Elasticsearch's `_ingest.timestamp` field to the documents as `event.ingested`:
146
+ +
147
+ [source, json]
148
+ PUT _ingest/pipeline/my-pipeline
149
+ {
150
+ "processors": [
151
+ {
152
+ "script": {
153
+ "lang": "painless",
154
+ "source": "ctx.putIfAbsent(\"event\", [:]); ctx.event.ingested = metadata().now.format(DateTimeFormatter.ISO_INSTANT);"
155
+ }
156
+ }
157
+ ]
158
+ }
159
+
160
+ [start=2]
161
+ . Create an index mapping where the tracking field is of date nanosecond type and invokes the defined pipeline:
162
+ +
163
+ [source, json]
164
+ PUT /_template/my_template
165
+ {
166
+ "index_patterns": ["test-*"],
167
+ "settings": {
168
+ "index.default_pipeline": "my-pipeline",
169
+ },
170
+ "mappings": {
171
+ "properties": {
172
+ "event": {
173
+ "properties": {
174
+ "ingested": {
175
+ "type": "date_nanos",
176
+ "format": "strict_date_optional_time_nanos"
177
+ }
178
+ }
179
+ }
180
+ }
181
+ }
182
+ }
183
+
184
+ [start=3]
185
+ . Define a query that looks at all data of the indices, sorted by the tracking field, and with a range filter since the last value seen until present:
186
+ +
187
+ [source,json]
188
+ {
189
+ "query": {
190
+ "range": {
191
+ "event.ingested": {
192
+ "gt": ":last_value",
193
+ "lt": ":present"
194
+ }
195
+ }
196
+ },
197
+ "sort": [
198
+ {
199
+ "event.ingested": {
200
+ "order": "asc",
201
+ "format": "strict_date_optional_time_nanos",
202
+ "numeric_type": "date_nanos"
203
+ }
204
+ }
205
+ ]
206
+ }
207
+
208
+ [start=4]
209
+ . Configure the Elasticsearch input to query the indices with the query defined above, every minute, and track the `event.ingested` field:
210
+ +
211
+ [source, ruby]
212
+ input {
213
+ elasticsearch {
214
+ id => tail_test_index
215
+ hosts => [ 'https://..']
216
+ api_key => '....'
217
+ index => 'test-*'
218
+ query => '{ "query": { "range": { "event.ingested": { "gt": ":last_value", "lt": ":present"}}}, "sort": [ { "event.ingested": {"order": "asc", "format": "strict_date_optional_time_nanos", "numeric_type" : "date_nanos" } } ] }'
219
+ tracking_field => "[event][ingested]"
220
+ slices => 5 # optional use of slices to speed data processing, should be equal to or less than number of primary shards
221
+ schedule => '* * * * *' # every minute
222
+ schedule_overlap => false # don't accumulate jobs if one takes longer than 1 minute
223
+ }
224
+ }
225
+
226
+ With this sample setup, new documents are indexed into a `test-*` index.
227
+ The next scheduled run:
228
+
229
+ * selects all new documents since the last observed value of the tracking field,
230
+ * uses {ref}/point-in-time-api.html#point-in-time-api[Point in time (PIT)] + {ref}/paginate-search-results.html#search-after[Search after] to paginate through all the data, and
231
+ * updates the value of the field at the end of the pagination.
232
+
233
+ [id="plugins-{type}s-{plugin}-esql"]
234
+ ==== {esql} support
235
+
236
+ .Technical Preview
237
+ ****
238
+ The {esql} feature that allows using ES|QL queries with this plugin is in Technical Preview.
239
+ Configuration options and implementation details are subject to change in minor releases without being preceded by deprecation warnings.
240
+ ****
241
+
242
+ {es} Query Language ({esql}) provides a SQL-like interface for querying your {es} data.
243
+
244
+ To use {esql}, this plugin needs to be installed in {ls} 8.17.4 or newer, and must be connected to {es} 8.11 or newer.
245
+
246
+ To configure {esql} query in the plugin, set the `query_type` to `esql` and provide your {esql} query in the `query` parameter.
247
+
248
+ IMPORTANT: {esql} is evolving and may still have limitations with regard to result size or supported field types. We recommend understanding https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-limitations.html[ES|QL current limitations] before using it in production environments.
249
+
250
+ The following is a basic scheduled {esql} query that runs hourly:
251
+ [source, ruby]
252
+ input {
253
+ elasticsearch {
254
+ id => hourly_cron_job
255
+ hosts => [ 'https://..']
256
+ api_key => '....'
257
+ query_type => 'esql'
258
+ query => '
259
+ FROM food-index
260
+ | WHERE spicy_level = "hot" AND @timestamp > NOW() - 1 hour
261
+ | LIMIT 500
262
+ '
263
+ schedule => '0 * * * *' # every hour at min 0
264
+ }
265
+ }
266
+
267
+ Set `config.support_escapes: true` in `logstash.yml` if you need to escape special chars in the query.
268
+
269
+ NOTE: With {esql} query, {ls} doesn't generate `event.original`.
270
+
271
+ [id="plugins-{type}s-{plugin}-esql-event-mapping"]
272
+ ===== Mapping {esql} result to {ls} event
273
+ {esql} returns query results in a structured tabular format, where data is organized into _columns_ (fields) and _values_ (entries).
274
+ The plugin maps each value entry to an event, populating corresponding fields.
275
+ For example, a query might produce a table like:
276
+
277
+ [cols="2,1,1,1,2",options="header"]
278
+ |===
279
+ |`timestamp` |`user_id` | `action` | `status.code` | `status.desc`
280
+
281
+ |2025-04-10T12:00:00 |123 |login |200 | Success
282
+ |2025-04-10T12:05:00 |456 |purchase |403 | Forbidden (unauthorized user)
283
+ |===
284
+
285
+ For this case, the plugin emits two events look like
286
+ [source, json]
287
+ [
288
+ {
289
+ "timestamp": "2025-04-10T12:00:00",
290
+ "user_id": 123,
291
+ "action": "login",
292
+ "status": {
293
+ "code": 200,
294
+ "desc": "Success"
295
+ }
296
+ },
297
+ {
298
+ "timestamp": "2025-04-10T12:05:00",
299
+ "user_id": 456,
300
+ "action": "purchase",
301
+ "status": {
302
+ "code": 403,
303
+ "desc": "Forbidden (unauthorized user)"
304
+ }
305
+ }
306
+ ]
307
+
308
+ NOTE: If your index has a mapping with sub-objects where `status.code` and `status.desc` actually dotted fields, they appear in {ls} events as a nested structure.
309
+
310
+ [id="plugins-{type}s-{plugin}-esql-multifields"]
311
+ ===== Conflict on multi-fields
312
+
313
+ {esql} query fetches all parent and sub-fields fields if your {es} index has https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/multi-fields[multi-fields] or https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/subobjects[subobjects].
314
+ Since {ls} events cannot contain parent field's concrete value and sub-field values together, the plugin ignores sub-fields with warning and includes parent.
315
+ We recommend using the `RENAME` (or `DROP` to avoid warnings) keyword in your {esql} query explicitly rename the fields to include sub-fields into the event.
316
+
317
+ This a common occurrence if your template or mapping follows the pattern of always indexing strings as "text" (`field`) + " keyword" (`field.keyword`) multi-field.
318
+ In this case it's recommended to do `KEEP field` if the string is identical and there is only one subfield as the engine will optimize and retrieve the keyword, otherwise you can do `KEEP field.keyword | RENAME field.keyword as field`.
319
+
320
+ To illustrate the situation with example, assuming your mapping has a time `time` field with `time.min` and `time.max` sub-fields as following:
321
+ [source, ruby]
322
+ "properties": {
323
+ "time": { "type": "long" },
324
+ "time.min": { "type": "long" },
325
+ "time.max": { "type": "long" }
326
+ }
327
+
328
+ The {esql} result will contain all three fields but the plugin cannot map them into {ls} event.
329
+ To avoid this, you can use the `RENAME` keyword to rename the `time` parent field to get all three fields with unique fields.
330
+ [source, ruby]
331
+ ...
332
+ query => 'FROM my-index | RENAME time AS time.current'
333
+ ...
334
+
335
+ For comprehensive {esql} syntax reference and best practices, see the https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-syntax.html[{esql} documentation].
336
+
106
337
  [id="plugins-{type}s-{plugin}-options"]
107
338
  ==== Elasticsearch Input configuration options
108
339
 
@@ -123,12 +354,15 @@ This plugin supports the following configuration options plus the <<plugins-{typ
123
354
  | <<plugins-{type}s-{plugin}-ecs_compatibility>> |<<string,string>>|No
124
355
  | <<plugins-{type}s-{plugin}-hosts>> |<<array,array>>|No
125
356
  | <<plugins-{type}s-{plugin}-index>> |<<string,string>>|No
357
+ | <<plugins-{type}s-{plugin}-last_run_metadata_path>> |<<string,string>>|No
126
358
  | <<plugins-{type}s-{plugin}-password>> |<<password,password>>|No
127
359
  | <<plugins-{type}s-{plugin}-proxy>> |<<uri,uri>>|No
128
360
  | <<plugins-{type}s-{plugin}-query>> |<<string,string>>|No
361
+ | <<plugins-{type}s-{plugin}-query_type>> |<<string,string>>, one of `["dsl","esql"]`|No
129
362
  | <<plugins-{type}s-{plugin}-response_type>> |<<string,string>>, one of `["hits","aggregations"]`|No
130
363
  | <<plugins-{type}s-{plugin}-request_timeout_seconds>> | <<number,number>>|No
131
364
  | <<plugins-{type}s-{plugin}-schedule>> |<<string,string>>|No
365
+ | <<plugins-{type}s-{plugin}-schedule_overlap>> |<<boolean,boolean>>|No
132
366
  | <<plugins-{type}s-{plugin}-scroll>> |<<string,string>>|No
133
367
  | <<plugins-{type}s-{plugin}-search_api>> |<<string,string>>, one of `["auto", "search_after", "scroll"]`|No
134
368
  | <<plugins-{type}s-{plugin}-size>> |<<number,number>>|No
@@ -148,6 +382,8 @@ This plugin supports the following configuration options plus the <<plugins-{typ
148
382
  | <<plugins-{type}s-{plugin}-ssl_verification_mode>> |<<string,string>>, one of `["full", "none"]`|No
149
383
  | <<plugins-{type}s-{plugin}-socket_timeout_seconds>> | <<number,number>>|No
150
384
  | <<plugins-{type}s-{plugin}-target>> | {logstash-ref}/field-references-deepdive.html[field reference] | No
385
+ | <<plugins-{type}s-{plugin}-tracking_field>> |<<string,string>>|No
386
+ | <<plugins-{type}s-{plugin}-tracking_field_seed>> |<<string,string>>|No
151
387
  | <<plugins-{type}s-{plugin}-retries>> | <<number,number>>|No
152
388
  | <<plugins-{type}s-{plugin}-user>> |<<string,string>>|No
153
389
  |=======================================================================
@@ -327,6 +563,17 @@ Check out {ref}/api-conventions.html#api-multi-index[Multi Indices
327
563
  documentation] in the Elasticsearch documentation for info on
328
564
  referencing multiple indices.
329
565
 
566
+ [id="plugins-{type}s-{plugin}-last_run_metadata_path"]
567
+ ===== `last_run_metadata_path`
568
+
569
+ * Value type is <<string,string>>
570
+ * There is no default value for this setting.
571
+
572
+ The path to store the last observed value of the tracking field, when used.
573
+ By default this file is stored as `<path.data>/plugins/inputs/elasticsearch/<pipeline_id>/last_run_value`.
574
+
575
+ This setting should point to file, not a directory, and Logstash must have read+write access to this file.
576
+
330
577
  [id="plugins-{type}s-{plugin}-password"]
331
578
  ===== `password`
332
579
 
@@ -353,22 +600,35 @@ environment variables e.g. `proxy => '${LS_PROXY:}'`.
353
600
  * Value type is <<string,string>>
354
601
  * Default value is `'{ "sort": [ "_doc" ] }'`
355
602
 
356
- The query to be executed. Read the {ref}/query-dsl.html[Elasticsearch query DSL
357
- documentation] for more information.
603
+ The query to be executed.
604
+ Accepted query shape is DSL or {esql} (when `query_type => 'esql'`).
605
+ Read the {ref}/query-dsl.html[{es} query DSL documentation] or {ref}/esql.html[{esql} documentation] for more information.
358
606
 
359
607
  When <<plugins-{type}s-{plugin}-search_api>> resolves to `search_after` and the query does not specify `sort`,
360
608
  the default sort `'{ "sort": { "_shard_doc": "asc" } }'` will be added to the query. Please refer to the {ref}/paginate-search-results.html#search-after[Elasticsearch search_after] parameter to know more.
361
609
 
610
+ [id="plugins-{type}s-{plugin}-query_type"]
611
+ ===== `query_type`
612
+
613
+ * Value can be `dsl` or `esql`
614
+ * Default value is `dsl`
615
+
616
+ Defines the <<plugins-{type}s-{plugin}-query>> shape.
617
+ When `dsl`, the query shape must be valid {es} JSON-style string.
618
+ When `esql`, the query shape must be a valid {esql} string and `index`, `size`, `slices`, `search_api`, `docinfo`, `docinfo_target`, `docinfo_fields`, `response_type` and `tracking_field` parameters are not allowed.
619
+
362
620
  [id="plugins-{type}s-{plugin}-response_type"]
363
621
  ===== `response_type`
364
622
 
365
- * Value can be any of: `hits`, `aggregations`
623
+ * Value can be any of: `hits`, `aggregations`, `esql`
366
624
  * Default value is `hits`
367
625
 
368
626
  Which part of the result to transform into Logstash events when processing the
369
627
  response from the query.
628
+
370
629
  The default `hits` will generate one event per returned document (i.e. "hit").
371
- When set to `aggregations`, a single Logstash event will be generated with the
630
+
631
+ When set to `aggregations`, a single {ls} event will be generated with the
372
632
  contents of the `aggregations` object of the query's response. In this case the
373
633
  `hits` object will be ignored. The parameter `size` will be always be set to
374
634
  0 regardless of the default or user-defined value set in this plugin.
@@ -407,6 +667,19 @@ for example: "* * * * *" (execute query every minute, on the minute)
407
667
  There is no schedule by default. If no schedule is given, then the statement is run
408
668
  exactly once.
409
669
 
670
+ [id="plugins-{type}s-{plugin}-schedule_overlap"]
671
+ ===== `schedule_overlap`
672
+
673
+ * Value type is <<boolean,boolean>>
674
+ * Default value is `true`
675
+
676
+ Whether to allow queuing of a scheduled run if a run is occurring.
677
+ While this is ideal for ensuring a new run happens immediately after the previous on finishes if there
678
+ is a lot of work to do, but given the queue is unbounded it may lead to an out of memory over long periods of time
679
+ if the queue grows continuously.
680
+
681
+ When in doubt, set `schedule_overlap` to false (it may become the default value in the future).
682
+
410
683
  [id="plugins-{type}s-{plugin}-scroll"]
411
684
  ===== `scroll`
412
685
 
@@ -617,6 +890,28 @@ When the `target` is set to a field reference, the `_source` of the hit is place
617
890
  This option can be useful to avoid populating unknown fields when a downstream schema such as ECS is enforced.
618
891
  It is also possible to target an entry in the event's metadata, which will be available during event processing but not exported to your outputs (e.g., `target \=> "[@metadata][_source]"`).
619
892
 
893
+ [id="plugins-{type}s-{plugin}-tracking_field"]
894
+ ===== `tracking_field`
895
+
896
+ * Value type is <<string,string>>
897
+ * There is no default value for this setting.
898
+
899
+ Which field from the last event of a previous run will be used a cursor value for the following run.
900
+ The value of this field is injected into each query if the query uses the placeholder `:last_value`.
901
+ For the first query after a pipeline is started, the value used is either read from <<plugins-{type}s-{plugin}-last_run_metadata_path>> file,
902
+ or taken from <<plugins-{type}s-{plugin}-tracking_field_seed>> setting.
903
+
904
+ Note: The tracking value is updated after each page is read and at the end of each Point in Time. In case of a crash the last saved value will be used so some duplication of data can occur. For this reason the use of unique document IDs for each event is recommended in the downstream destination.
905
+
906
+ [id="plugins-{type}s-{plugin}-tracking_field_seed"]
907
+ ===== `tracking_field_seed`
908
+
909
+ * Value type is <<string,string>>
910
+ * Default value is `"1970-01-01T00:00:00.000000000Z"`
911
+
912
+ The starting value for the <<plugins-{type}s-{plugin}-tracking_field>> if there is no <<plugins-{type}s-{plugin}-last_run_metadata_path>> already.
913
+ This field defaults to the nanosecond precision ISO8601 representation of `epoch`, or "1970-01-01T00:00:00.000000000Z", given nano-second precision timestamps are the
914
+ most reliable data format to use for this feature.
620
915
 
621
916
  [id="plugins-{type}s-{plugin}-user"]
622
917
  ===== `user`
@@ -12,14 +12,9 @@ module LogStash
12
12
  @client = client
13
13
  @plugin_params = plugin.params
14
14
 
15
+ @index = @plugin_params["index"]
15
16
  @size = @plugin_params["size"]
16
- @query = @plugin_params["query"]
17
17
  @retries = @plugin_params["retries"]
18
- @agg_options = {
19
- :index => @plugin_params["index"],
20
- :size => 0
21
- }.merge(:body => @query)
22
-
23
18
  @plugin = plugin
24
19
  end
25
20
 
@@ -33,10 +28,18 @@ module LogStash
33
28
  false
34
29
  end
35
30
 
36
- def do_run(output_queue)
31
+ def aggregation_options(query_object)
32
+ {
33
+ :index => @index,
34
+ :size => 0,
35
+ :body => query_object
36
+ }
37
+ end
38
+
39
+ def do_run(output_queue, query_object)
37
40
  logger.info("Aggregation starting")
38
41
  r = retryable(AGGREGATION_JOB) do
39
- @client.search(@agg_options)
42
+ @client.search(aggregation_options(query_object))
40
43
  end
41
44
  @plugin.push_hit(r, output_queue, 'aggregations') if r
42
45
  end
@@ -0,0 +1,58 @@
1
+ require 'fileutils'
2
+
3
+ module LogStash; module Inputs; class Elasticsearch
4
+ class CursorTracker
5
+ include LogStash::Util::Loggable
6
+
7
+ attr_reader :last_value
8
+
9
+ def initialize(last_run_metadata_path:, tracking_field:, tracking_field_seed:)
10
+ @last_run_metadata_path = last_run_metadata_path
11
+ @last_value_hashmap = Java::java.util.concurrent.ConcurrentHashMap.new
12
+ @last_value = IO.read(@last_run_metadata_path) rescue nil || tracking_field_seed
13
+ @tracking_field = tracking_field
14
+ logger.info "Starting value for cursor field \"#{@tracking_field}\": #{@last_value}"
15
+ @mutex = Mutex.new
16
+ end
17
+
18
+ def checkpoint_cursor(intermediate: true)
19
+ @mutex.synchronize do
20
+ if intermediate
21
+ # in intermediate checkpoints pick the smallest
22
+ converge_last_value {|v1, v2| v1 < v2 ? v1 : v2}
23
+ else
24
+ # in the last search of a PIT choose the largest
25
+ converge_last_value {|v1, v2| v1 > v2 ? v1 : v2}
26
+ @last_value_hashmap.clear
27
+ end
28
+ IO.write(@last_run_metadata_path, @last_value)
29
+ end
30
+ end
31
+
32
+ def converge_last_value(&block)
33
+ return if @last_value_hashmap.empty?
34
+ new_last_value = @last_value_hashmap.reduceValues(1000, &block)
35
+ logger.debug? && logger.debug("converge_last_value: got #{@last_value_hashmap.values.inspect}. won: #{new_last_value}")
36
+ return if new_last_value == @last_value
37
+ @last_value = new_last_value
38
+ logger.info "New cursor value for field \"#{@tracking_field}\" is: #{new_last_value}"
39
+ end
40
+
41
+ def record_last_value(event)
42
+ value = event.get(@tracking_field)
43
+ logger.trace? && logger.trace("storing last_value if #{@tracking_field} for #{Thread.current.object_id}: #{value}")
44
+ @last_value_hashmap.put(Thread.current.object_id, value)
45
+ end
46
+
47
+ def inject_cursor(query_json)
48
+ # ":present" means "now - 30s" to avoid grabbing partially visible data in the PIT
49
+ result = query_json.gsub(":last_value", @last_value.to_s).gsub(":present", now_minus_30s)
50
+ logger.debug("inject_cursor: injected values for ':last_value' and ':present'", :query => result)
51
+ result
52
+ end
53
+
54
+ def now_minus_30s
55
+ Java::java.time.Instant.now.minusSeconds(30).to_s
56
+ end
57
+ end
58
+ end; end; end
@@ -0,0 +1,153 @@
1
+ require 'logstash/helpers/loggable_try'
2
+
3
+ module LogStash
4
+ module Inputs
5
+ class Elasticsearch
6
+ class Esql
7
+ include LogStash::Util::Loggable
8
+
9
+ ESQL_JOB = "ES|QL job"
10
+
11
+ ESQL_PARSERS_BY_TYPE = Hash.new(lambda { |x| x }).merge(
12
+ 'date' => ->(value) { value && LogStash::Timestamp.new(value) },
13
+ )
14
+
15
+ # Initialize the ESQL query executor
16
+ # @param client [Elasticsearch::Client] The Elasticsearch client instance
17
+ # @param plugin [LogStash::Inputs::Elasticsearch] The parent plugin instance
18
+ def initialize(client, plugin)
19
+ @client = client
20
+ @event_decorator = plugin.method(:decorate_event)
21
+ @retries = plugin.params["retries"]
22
+
23
+ target_field = plugin.params["target"]
24
+ if target_field
25
+ def self.apply_target(path); "[#{target_field}][#{path}]"; end
26
+ else
27
+ def self.apply_target(path); path; end
28
+ end
29
+
30
+ @query = plugin.params["query"]
31
+ unless @query.include?('METADATA')
32
+ logger.info("`METADATA` not found the query. `_id`, `_version` and `_index` will not be available in the result", {:query => @query})
33
+ end
34
+ logger.debug("ES|QL executor initialized with", {:query => @query})
35
+ end
36
+
37
+ # Execute the ESQL query and process results
38
+ # @param output_queue [Queue] The queue to push processed events to
39
+ # @param query A query (to obey interface definition)
40
+ def do_run(output_queue, query)
41
+ logger.info("ES|QL executor has started")
42
+ response = retryable(ESQL_JOB) do
43
+ @client.esql.query({ body: { query: @query }, format: 'json', drop_null_columns: true })
44
+ end
45
+ # retriable already printed error details
46
+ return if response == false
47
+
48
+ if response&.headers&.dig("warning")
49
+ logger.warn("ES|QL executor received warning", {:warning_message => response.headers["warning"]})
50
+ end
51
+ columns = response['columns']&.freeze
52
+ values = response['values']&.freeze
53
+ logger.debug("ES|QL query response size: #{values&.size}")
54
+
55
+ process_response(columns, values, output_queue) if columns && values
56
+ end
57
+
58
+ # Execute a retryable operation with proper error handling
59
+ # @param job_name [String] Name of the job for logging purposes
60
+ # @yield The block to execute
61
+ # @return [Boolean] true if successful, false otherwise
62
+ def retryable(job_name, &block)
63
+ stud_try = ::LogStash::Helpers::LoggableTry.new(logger, job_name)
64
+ stud_try.try((@retries + 1).times) { yield }
65
+ rescue => e
66
+ error_details = {:message => e.message, :cause => e.cause}
67
+ error_details[:backtrace] = e.backtrace if logger.debug?
68
+ logger.error("#{job_name} failed with ", error_details)
69
+ false
70
+ end
71
+
72
+ private
73
+
74
+ # Process the ESQL response and push events to the output queue
75
+ # @param columns [Array[Hash]] The ESQL query response columns
76
+ # @param values [Array[Array]] The ESQL query response hits
77
+ # @param output_queue [Queue] The queue to push processed events to
78
+ def process_response(columns, values, output_queue)
79
+ column_specs = columns.map { |column| ColumnSpec.new(column) }
80
+ sub_element_mark_map = mark_sub_elements(column_specs)
81
+ multi_fields = sub_element_mark_map.filter_map { |key, val| key.name if val == true }
82
+ logger.warn("Multi-fields found in ES|QL result and they will not be available in the event. Please use `RENAME` command if you want to include them.", { :detected_multi_fields => multi_fields }) if multi_fields.any?
83
+
84
+ values.each do |row|
85
+ event = column_specs.zip(row).each_with_object(LogStash::Event.new) do |(column, value), event|
86
+ # `unless value.nil?` is a part of `drop_null_columns` that if some of columns' values are not `nil`, `nil` values appear
87
+ # we should continuously filter out them to achieve full `drop_null_columns` on each individual row (ideal `LIMIT 1` result)
88
+ # we also exclude sub-elements of main field
89
+ if value && sub_element_mark_map[column] == false
90
+ field_reference = apply_target(column.field_reference)
91
+ event.set(field_reference, ESQL_PARSERS_BY_TYPE[column.type].call(value))
92
+ end
93
+ end
94
+ @event_decorator.call(event)
95
+ output_queue << event
96
+ rescue => e
97
+ # if event creation fails with whatever reason, inform user and tag with failure and return entry as it is
98
+ logger.warn("Event creation error, ", message: e.message, exception: e.class, data: { "columns" => columns, "values" => [row] })
99
+ failed_event = LogStash::Event.new("columns" => columns, "values" => [row], "tags" => ['_elasticsearch_input_failure'])
100
+ output_queue << failed_event
101
+ end
102
+ end
103
+
104
+ # Determines whether each column in a collection is a nested sub-element (example "user.age")
105
+ # of another column in the same collection (example "user").
106
+ #
107
+ # @param columns [Array<ColumnSpec>] An array of objects with a `name` attribute representing field paths.
108
+ # @return [Hash<ColumnSpec, Boolean>] A hash mapping each column to `true` if it is a sub-element of another field, `false` otherwise.
109
+ # Time complexity: (O(NlogN+N*K)) where K is the number of conflict depth
110
+ # without (`prefix_set`) memoization, it would be O(N^2)
111
+ def mark_sub_elements(columns)
112
+ # Sort columns by name length (ascending)
113
+ sorted_columns = columns.sort_by { |c| c.name.length }
114
+ prefix_set = Set.new # memoization set
115
+
116
+ sorted_columns.each_with_object({}) do |column, memo|
117
+ # Split the column name into parts (e.g., "user.profile.age" → ["user", "profile", "age"])
118
+ parts = column.name.split('.')
119
+
120
+ # Generate all possible parent prefixes (e.g., "user", "user.profile")
121
+ # and check if any parent prefix exists in the set
122
+ parent_prefixes = (0...parts.size - 1).map { |i| parts[0..i].join('.') }
123
+ memo[column] = parent_prefixes.any? { |prefix| prefix_set.include?(prefix) }
124
+ prefix_set.add(column.name)
125
+ end
126
+ end
127
+ end
128
+
129
+ # Class representing a column specification in the ESQL response['columns']
130
+ # The class's main purpose is to provide a structure for the event key
131
+ # columns is an array with `name` and `type` pair (example: `{"name"=>"@timestamp", "type"=>"date"}`)
132
+ # @attr_reader :name [String] The name of the column
133
+ # @attr_reader :type [String] The type of the column
134
+ class ColumnSpec
135
+ attr_reader :name, :type
136
+
137
+ def initialize(spec)
138
+ @name = isolate(spec.fetch('name'))
139
+ @type = isolate(spec.fetch('type'))
140
+ end
141
+
142
+ def field_reference
143
+ @_field_reference ||= '[' + name.gsub('.', '][') + ']'
144
+ end
145
+
146
+ private
147
+ def isolate(value)
148
+ value.frozen? ? value : value.clone.freeze
149
+ end
150
+ end
151
+ end
152
+ end
153
+ end