logstash-input-elasticsearch 5.0.2 → 5.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 435dd3535dc912d60a6b363a3f04d61261106e844b21fc250b2b68867c12e3e6
4
- data.tar.gz: 2b1b4677f75e491f70cfcf119c78576c5060dcfb3bb88aa4d8ac6953ec188f43
3
+ metadata.gz: 718adf02c14b980691bd1572ac7e46b14f0f7850d82cd267fa52ddbde8289892
4
+ data.tar.gz: e10f582747a7ae11d707c4268ffce6485ed35453afbe140d999857485e022cda
5
5
  SHA512:
6
- metadata.gz: cff6d761bea5229ce1bced81889c747a5fe04ac481d7009b1d33110b2fb598c15a0db366c3ce1fb17e26d16ef713281683cfd894b46dfb499d2de3a5af61695d
7
- data.tar.gz: fcd0202d0b949194a361d1b05e813196ea8958147ba6b107b2e68d83861975144ff30139ee5f302e5abce626fbdecf05529a0f95284bdb6c7a7f1f11c0f33325
6
+ metadata.gz: e88b12e47cfad23b4a1128ab05c1510c1f89bd76d20511064ada02999b6fa694d118a8e37ec2fede700d2008beaef045255b3e345d21eadcfa5b492a250d01dd
7
+ data.tar.gz: f533570ba4268088ddbe73572ec1e49924fe7febf3f6dad1c6e957b6eff5d414410970749521b0543b6bd03264776d4ac6ea2082589704cf1e7ea5faf36c1f07
data/CHANGELOG.md CHANGED
@@ -1,3 +1,9 @@
1
+ ## 5.2.0
2
+ - ES|QL support [#233](https://github.com/logstash-plugins/logstash-input-elasticsearch/pull/233)
3
+
4
+ ## 5.1.0
5
+ - Add "cursor"-like index tracking [#205](https://github.com/logstash-plugins/logstash-input-elasticsearch/pull/205)
6
+
1
7
  ## 5.0.2
2
8
  - Add elastic-transport client support used in elasticsearch-ruby 8.x [#223](https://github.com/logstash-plugins/logstash-input-elasticsearch/pull/223)
3
9
 
data/docs/index.asciidoc CHANGED
@@ -48,7 +48,7 @@ This would create an Elasticsearch query with the following format:
48
48
  "sort": [ "_doc" ]
49
49
  }'
50
50
 
51
-
51
+ [id="plugins-{type}s-{plugin}-scheduling"]
52
52
  ==== Scheduling
53
53
 
54
54
  Input from this plugin can be scheduled to run periodically according to a specific
@@ -103,6 +103,237 @@ Common causes are:
103
103
  - When the hit result contains top-level fields that are {logstash-ref}/processing.html#reserved-fields[reserved in Logstash] but do not have the expected shape. Use the <<plugins-{type}s-{plugin}-target>> directive to avoid conflicts with the top-level namespace.
104
104
  - When <<plugins-{type}s-{plugin}-docinfo>> is enabled and the docinfo fields cannot be merged into the hit result. Combine <<plugins-{type}s-{plugin}-target>> and <<plugins-{type}s-{plugin}-docinfo_target>> to avoid conflict.
105
105
 
106
+ [id="plugins-{type}s-{plugin}-cursor"]
107
+ ==== Tracking a field's value across runs
108
+
109
+ .Technical Preview: Tracking a field's value
110
+ ****
111
+ The feature that allows tracking a field's value across runs is in _Technical Preview_.
112
+ Configuration options and implementation details are subject to change in minor releases without being preceded by deprecation warnings.
113
+ ****
114
+
115
+ Some uses cases require tracking the value of a particular field between two jobs.
116
+ Examples include:
117
+
118
+ * avoiding the need to re-process the entire result set of a long query after an unplanned restart
119
+ * grabbing only new data from an index instead of processing the entire set on each job.
120
+
121
+ The Elasticsearch input plugin provides the <<plugins-{type}s-{plugin}-tracking_field>> and <<plugins-{type}s-{plugin}-tracking_field_seed>> options.
122
+ When <<plugins-{type}s-{plugin}-tracking_field>> is set, the plugin records the value of that field for the last document retrieved in a run into
123
+ a file.
124
+ (The file location defaults to <<plugins-{type}s-{plugin}-last_run_metadata_path>>.)
125
+
126
+ You can then inject this value in the query using the placeholder `:last_value`.
127
+ The value will be injected into the query before execution, and then updated after the query completes if new data was found.
128
+
129
+ This feature works best when:
130
+
131
+ * the query sorts by the tracking field,
132
+ * the timestamp field is added by {es}, and
133
+ * the field type has enough resolution so that two events are unlikely to have the same value.
134
+
135
+ Consider using a tracking field whose type is https://www.elastic.co/guide/en/elasticsearch/reference/current/date_nanos.html[date nanoseconds].
136
+ If the tracking field is of this data type, you can use an extra placeholder called `:present` to inject the nano-second based value of "now-30s".
137
+ This placeholder is useful as the right-hand side of a range filter, allowing the collection of
138
+ new data but leaving partially-searchable bulk request data to the next scheduled job.
139
+
140
+ [id="plugins-{type}s-{plugin}-tracking-sample"]
141
+ ===== Sample configuration: Track field value across runs
142
+
143
+ This section contains a series of steps to help you set up the "tailing" of data being written to a set of indices, using a date nanosecond field added by an Elasticsearch ingest pipeline and the `tracking_field` capability of this plugin.
144
+
145
+ . Create ingest pipeline that adds Elasticsearch's `_ingest.timestamp` field to the documents as `event.ingested`:
146
+ +
147
+ [source, json]
148
+ PUT _ingest/pipeline/my-pipeline
149
+ {
150
+ "processors": [
151
+ {
152
+ "script": {
153
+ "lang": "painless",
154
+ "source": "ctx.putIfAbsent(\"event\", [:]); ctx.event.ingested = metadata().now.format(DateTimeFormatter.ISO_INSTANT);"
155
+ }
156
+ }
157
+ ]
158
+ }
159
+
160
+ [start=2]
161
+ . Create an index mapping where the tracking field is of date nanosecond type and invokes the defined pipeline:
162
+ +
163
+ [source, json]
164
+ PUT /_template/my_template
165
+ {
166
+ "index_patterns": ["test-*"],
167
+ "settings": {
168
+ "index.default_pipeline": "my-pipeline",
169
+ },
170
+ "mappings": {
171
+ "properties": {
172
+ "event": {
173
+ "properties": {
174
+ "ingested": {
175
+ "type": "date_nanos",
176
+ "format": "strict_date_optional_time_nanos"
177
+ }
178
+ }
179
+ }
180
+ }
181
+ }
182
+ }
183
+
184
+ [start=3]
185
+ . Define a query that looks at all data of the indices, sorted by the tracking field, and with a range filter since the last value seen until present:
186
+ +
187
+ [source,json]
188
+ {
189
+ "query": {
190
+ "range": {
191
+ "event.ingested": {
192
+ "gt": ":last_value",
193
+ "lt": ":present"
194
+ }
195
+ }
196
+ },
197
+ "sort": [
198
+ {
199
+ "event.ingested": {
200
+ "order": "asc",
201
+ "format": "strict_date_optional_time_nanos",
202
+ "numeric_type": "date_nanos"
203
+ }
204
+ }
205
+ ]
206
+ }
207
+
208
+ [start=4]
209
+ . Configure the Elasticsearch input to query the indices with the query defined above, every minute, and track the `event.ingested` field:
210
+ +
211
+ [source, ruby]
212
+ input {
213
+ elasticsearch {
214
+ id => tail_test_index
215
+ hosts => [ 'https://..']
216
+ api_key => '....'
217
+ index => 'test-*'
218
+ query => '{ "query": { "range": { "event.ingested": { "gt": ":last_value", "lt": ":present"}}}, "sort": [ { "event.ingested": {"order": "asc", "format": "strict_date_optional_time_nanos", "numeric_type" : "date_nanos" } } ] }'
219
+ tracking_field => "[event][ingested]"
220
+ slices => 5 # optional use of slices to speed data processing, should be equal to or less than number of primary shards
221
+ schedule => '* * * * *' # every minute
222
+ schedule_overlap => false # don't accumulate jobs if one takes longer than 1 minute
223
+ }
224
+ }
225
+
226
+ With this sample setup, new documents are indexed into a `test-*` index.
227
+ The next scheduled run:
228
+
229
+ * selects all new documents since the last observed value of the tracking field,
230
+ * uses {ref}/point-in-time-api.html#point-in-time-api[Point in time (PIT)] + {ref}/paginate-search-results.html#search-after[Search after] to paginate through all the data, and
231
+ * updates the value of the field at the end of the pagination.
232
+
233
+ [id="plugins-{type}s-{plugin}-esql"]
234
+ ==== {esql} support
235
+
236
+ .Technical Preview
237
+ ****
238
+ The {esql} feature that allows using ES|QL queries with this plugin is in Technical Preview.
239
+ Configuration options and implementation details are subject to change in minor releases without being preceded by deprecation warnings.
240
+ ****
241
+
242
+ {es} Query Language ({esql}) provides a SQL-like interface for querying your {es} data.
243
+
244
+ To use {esql}, this plugin needs to be installed in {ls} 8.17.4 or newer, and must be connected to {es} 8.11 or newer.
245
+
246
+ To configure {esql} query in the plugin, set the `query_type` to `esql` and provide your {esql} query in the `query` parameter.
247
+
248
+ IMPORTANT: {esql} is evolving and may still have limitations with regard to result size or supported field types. We recommend understanding https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-limitations.html[ES|QL current limitations] before using it in production environments.
249
+
250
+ The following is a basic scheduled {esql} query that runs hourly:
251
+ [source, ruby]
252
+ input {
253
+ elasticsearch {
254
+ id => hourly_cron_job
255
+ hosts => [ 'https://..']
256
+ api_key => '....'
257
+ query_type => 'esql'
258
+ query => '
259
+ FROM food-index
260
+ | WHERE spicy_level = "hot" AND @timestamp > NOW() - 1 hour
261
+ | LIMIT 500
262
+ '
263
+ schedule => '0 * * * *' # every hour at min 0
264
+ }
265
+ }
266
+
267
+ Set `config.support_escapes: true` in `logstash.yml` if you need to escape special chars in the query.
268
+
269
+ NOTE: With {esql} query, {ls} doesn't generate `event.original`.
270
+
271
+ [id="plugins-{type}s-{plugin}-esql-event-mapping"]
272
+ ===== Mapping {esql} result to {ls} event
273
+ {esql} returns query results in a structured tabular format, where data is organized into _columns_ (fields) and _values_ (entries).
274
+ The plugin maps each value entry to an event, populating corresponding fields.
275
+ For example, a query might produce a table like:
276
+
277
+ [cols="2,1,1,1,2",options="header"]
278
+ |===
279
+ |`timestamp` |`user_id` | `action` | `status.code` | `status.desc`
280
+
281
+ |2025-04-10T12:00:00 |123 |login |200 | Success
282
+ |2025-04-10T12:05:00 |456 |purchase |403 | Forbidden (unauthorized user)
283
+ |===
284
+
285
+ For this case, the plugin emits two events look like
286
+ [source, json]
287
+ [
288
+ {
289
+ "timestamp": "2025-04-10T12:00:00",
290
+ "user_id": 123,
291
+ "action": "login",
292
+ "status": {
293
+ "code": 200,
294
+ "desc": "Success"
295
+ }
296
+ },
297
+ {
298
+ "timestamp": "2025-04-10T12:05:00",
299
+ "user_id": 456,
300
+ "action": "purchase",
301
+ "status": {
302
+ "code": 403,
303
+ "desc": "Forbidden (unauthorized user)"
304
+ }
305
+ }
306
+ ]
307
+
308
+ NOTE: If your index has a mapping with sub-objects where `status.code` and `status.desc` actually dotted fields, they appear in {ls} events as a nested structure.
309
+
310
+ [id="plugins-{type}s-{plugin}-esql-multifields"]
311
+ ===== Conflict on multi-fields
312
+
313
+ {esql} query fetches all parent and sub-fields fields if your {es} index has https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/multi-fields[multi-fields] or https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/subobjects[subobjects].
314
+ Since {ls} events cannot contain parent field's concrete value and sub-field values together, the plugin ignores sub-fields with warning and includes parent.
315
+ We recommend using the `RENAME` (or `DROP` to avoid warnings) keyword in your {esql} query explicitly rename the fields to include sub-fields into the event.
316
+
317
+ This a common occurrence if your template or mapping follows the pattern of always indexing strings as "text" (`field`) + " keyword" (`field.keyword`) multi-field.
318
+ In this case it's recommended to do `KEEP field` if the string is identical and there is only one subfield as the engine will optimize and retrieve the keyword, otherwise you can do `KEEP field.keyword | RENAME field.keyword as field`.
319
+
320
+ To illustrate the situation with example, assuming your mapping has a time `time` field with `time.min` and `time.max` sub-fields as following:
321
+ [source, ruby]
322
+ "properties": {
323
+ "time": { "type": "long" },
324
+ "time.min": { "type": "long" },
325
+ "time.max": { "type": "long" }
326
+ }
327
+
328
+ The {esql} result will contain all three fields but the plugin cannot map them into {ls} event.
329
+ To avoid this, you can use the `RENAME` keyword to rename the `time` parent field to get all three fields with unique fields.
330
+ [source, ruby]
331
+ ...
332
+ query => 'FROM my-index | RENAME time AS time.current'
333
+ ...
334
+
335
+ For comprehensive {esql} syntax reference and best practices, see the https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-syntax.html[{esql} documentation].
336
+
106
337
  [id="plugins-{type}s-{plugin}-options"]
107
338
  ==== Elasticsearch Input configuration options
108
339
 
@@ -126,12 +357,15 @@ Please check out <<plugins-{type}s-{plugin}-obsolete-options>> for details.
126
357
  | <<plugins-{type}s-{plugin}-ecs_compatibility>> |<<string,string>>|No
127
358
  | <<plugins-{type}s-{plugin}-hosts>> |<<array,array>>|No
128
359
  | <<plugins-{type}s-{plugin}-index>> |<<string,string>>|No
360
+ | <<plugins-{type}s-{plugin}-last_run_metadata_path>> |<<string,string>>|No
129
361
  | <<plugins-{type}s-{plugin}-password>> |<<password,password>>|No
130
362
  | <<plugins-{type}s-{plugin}-proxy>> |<<uri,uri>>|No
131
363
  | <<plugins-{type}s-{plugin}-query>> |<<string,string>>|No
364
+ | <<plugins-{type}s-{plugin}-query_type>> |<<string,string>>, one of `["dsl","esql"]`|No
132
365
  | <<plugins-{type}s-{plugin}-response_type>> |<<string,string>>, one of `["hits","aggregations"]`|No
133
366
  | <<plugins-{type}s-{plugin}-request_timeout_seconds>> | <<number,number>>|No
134
367
  | <<plugins-{type}s-{plugin}-schedule>> |<<string,string>>|No
368
+ | <<plugins-{type}s-{plugin}-schedule_overlap>> |<<boolean,boolean>>|No
135
369
  | <<plugins-{type}s-{plugin}-scroll>> |<<string,string>>|No
136
370
  | <<plugins-{type}s-{plugin}-search_api>> |<<string,string>>, one of `["auto", "search_after", "scroll"]`|No
137
371
  | <<plugins-{type}s-{plugin}-size>> |<<number,number>>|No
@@ -151,6 +385,8 @@ Please check out <<plugins-{type}s-{plugin}-obsolete-options>> for details.
151
385
  | <<plugins-{type}s-{plugin}-ssl_verification_mode>> |<<string,string>>, one of `["full", "none"]`|No
152
386
  | <<plugins-{type}s-{plugin}-socket_timeout_seconds>> | <<number,number>>|No
153
387
  | <<plugins-{type}s-{plugin}-target>> | {logstash-ref}/field-references-deepdive.html[field reference] | No
388
+ | <<plugins-{type}s-{plugin}-tracking_field>> |<<string,string>>|No
389
+ | <<plugins-{type}s-{plugin}-tracking_field_seed>> |<<string,string>>|No
154
390
  | <<plugins-{type}s-{plugin}-retries>> | <<number,number>>|No
155
391
  | <<plugins-{type}s-{plugin}-user>> |<<string,string>>|No
156
392
  |=======================================================================
@@ -330,6 +566,17 @@ Check out {ref}/api-conventions.html#api-multi-index[Multi Indices
330
566
  documentation] in the Elasticsearch documentation for info on
331
567
  referencing multiple indices.
332
568
 
569
+ [id="plugins-{type}s-{plugin}-last_run_metadata_path"]
570
+ ===== `last_run_metadata_path`
571
+
572
+ * Value type is <<string,string>>
573
+ * There is no default value for this setting.
574
+
575
+ The path to store the last observed value of the tracking field, when used.
576
+ By default this file is stored as `<path.data>/plugins/inputs/elasticsearch/<pipeline_id>/last_run_value`.
577
+
578
+ This setting should point to file, not a directory, and Logstash must have read+write access to this file.
579
+
333
580
  [id="plugins-{type}s-{plugin}-password"]
334
581
  ===== `password`
335
582
 
@@ -356,22 +603,35 @@ environment variables e.g. `proxy => '${LS_PROXY:}'`.
356
603
  * Value type is <<string,string>>
357
604
  * Default value is `'{ "sort": [ "_doc" ] }'`
358
605
 
359
- The query to be executed. Read the {ref}/query-dsl.html[Elasticsearch query DSL
360
- documentation] for more information.
606
+ The query to be executed.
607
+ Accepted query shape is DSL or {esql} (when `query_type => 'esql'`).
608
+ Read the {ref}/query-dsl.html[{es} query DSL documentation] or {ref}/esql.html[{esql} documentation] for more information.
361
609
 
362
610
  When <<plugins-{type}s-{plugin}-search_api>> resolves to `search_after` and the query does not specify `sort`,
363
611
  the default sort `'{ "sort": { "_shard_doc": "asc" } }'` will be added to the query. Please refer to the {ref}/paginate-search-results.html#search-after[Elasticsearch search_after] parameter to know more.
364
612
 
613
+ [id="plugins-{type}s-{plugin}-query_type"]
614
+ ===== `query_type`
615
+
616
+ * Value can be `dsl` or `esql`
617
+ * Default value is `dsl`
618
+
619
+ Defines the <<plugins-{type}s-{plugin}-query>> shape.
620
+ When `dsl`, the query shape must be valid {es} JSON-style string.
621
+ When `esql`, the query shape must be a valid {esql} string and `index`, `size`, `slices`, `search_api`, `docinfo`, `docinfo_target`, `docinfo_fields`, `response_type` and `tracking_field` parameters are not allowed.
622
+
365
623
  [id="plugins-{type}s-{plugin}-response_type"]
366
624
  ===== `response_type`
367
625
 
368
- * Value can be any of: `hits`, `aggregations`
626
+ * Value can be any of: `hits`, `aggregations`, `esql`
369
627
  * Default value is `hits`
370
628
 
371
629
  Which part of the result to transform into Logstash events when processing the
372
630
  response from the query.
631
+
373
632
  The default `hits` will generate one event per returned document (i.e. "hit").
374
- When set to `aggregations`, a single Logstash event will be generated with the
633
+
634
+ When set to `aggregations`, a single {ls} event will be generated with the
375
635
  contents of the `aggregations` object of the query's response. In this case the
376
636
  `hits` object will be ignored. The parameter `size` will be always be set to
377
637
  0 regardless of the default or user-defined value set in this plugin.
@@ -410,6 +670,19 @@ for example: "* * * * *" (execute query every minute, on the minute)
410
670
  There is no schedule by default. If no schedule is given, then the statement is run
411
671
  exactly once.
412
672
 
673
+ [id="plugins-{type}s-{plugin}-schedule_overlap"]
674
+ ===== `schedule_overlap`
675
+
676
+ * Value type is <<boolean,boolean>>
677
+ * Default value is `true`
678
+
679
+ Whether to allow queuing of a scheduled run if a run is occurring.
680
+ While this is ideal for ensuring a new run happens immediately after the previous on finishes if there
681
+ is a lot of work to do, but given the queue is unbounded it may lead to an out of memory over long periods of time
682
+ if the queue grows continuously.
683
+
684
+ When in doubt, set `schedule_overlap` to false (it may become the default value in the future).
685
+
413
686
  [id="plugins-{type}s-{plugin}-scroll"]
414
687
  ===== `scroll`
415
688
 
@@ -622,6 +895,28 @@ When the `target` is set to a field reference, the `_source` of the hit is place
622
895
  This option can be useful to avoid populating unknown fields when a downstream schema such as ECS is enforced.
623
896
  It is also possible to target an entry in the event's metadata, which will be available during event processing but not exported to your outputs (e.g., `target \=> "[@metadata][_source]"`).
624
897
 
898
+ [id="plugins-{type}s-{plugin}-tracking_field"]
899
+ ===== `tracking_field`
900
+
901
+ * Value type is <<string,string>>
902
+ * There is no default value for this setting.
903
+
904
+ Which field from the last event of a previous run will be used a cursor value for the following run.
905
+ The value of this field is injected into each query if the query uses the placeholder `:last_value`.
906
+ For the first query after a pipeline is started, the value used is either read from <<plugins-{type}s-{plugin}-last_run_metadata_path>> file,
907
+ or taken from <<plugins-{type}s-{plugin}-tracking_field_seed>> setting.
908
+
909
+ Note: The tracking value is updated after each page is read and at the end of each Point in Time. In case of a crash the last saved value will be used so some duplication of data can occur. For this reason the use of unique document IDs for each event is recommended in the downstream destination.
910
+
911
+ [id="plugins-{type}s-{plugin}-tracking_field_seed"]
912
+ ===== `tracking_field_seed`
913
+
914
+ * Value type is <<string,string>>
915
+ * Default value is `"1970-01-01T00:00:00.000000000Z"`
916
+
917
+ The starting value for the <<plugins-{type}s-{plugin}-tracking_field>> if there is no <<plugins-{type}s-{plugin}-last_run_metadata_path>> already.
918
+ This field defaults to the nanosecond precision ISO8601 representation of `epoch`, or "1970-01-01T00:00:00.000000000Z", given nano-second precision timestamps are the
919
+ most reliable data format to use for this feature.
625
920
 
626
921
  [id="plugins-{type}s-{plugin}-user"]
627
922
  ===== `user`
@@ -12,14 +12,9 @@ module LogStash
12
12
  @client = client
13
13
  @plugin_params = plugin.params
14
14
 
15
+ @index = @plugin_params["index"]
15
16
  @size = @plugin_params["size"]
16
- @query = @plugin_params["query"]
17
17
  @retries = @plugin_params["retries"]
18
- @agg_options = {
19
- :index => @plugin_params["index"],
20
- :size => 0
21
- }.merge(:body => @query)
22
-
23
18
  @plugin = plugin
24
19
  end
25
20
 
@@ -33,10 +28,18 @@ module LogStash
33
28
  false
34
29
  end
35
30
 
36
- def do_run(output_queue)
31
+ def aggregation_options(query_object)
32
+ {
33
+ :index => @index,
34
+ :size => 0,
35
+ :body => query_object
36
+ }
37
+ end
38
+
39
+ def do_run(output_queue, query_object)
37
40
  logger.info("Aggregation starting")
38
41
  r = retryable(AGGREGATION_JOB) do
39
- @client.search(@agg_options)
42
+ @client.search(aggregation_options(query_object))
40
43
  end
41
44
  @plugin.push_hit(r, output_queue, 'aggregations') if r
42
45
  end
@@ -0,0 +1,58 @@
1
+ require 'fileutils'
2
+
3
+ module LogStash; module Inputs; class Elasticsearch
4
+ class CursorTracker
5
+ include LogStash::Util::Loggable
6
+
7
+ attr_reader :last_value
8
+
9
+ def initialize(last_run_metadata_path:, tracking_field:, tracking_field_seed:)
10
+ @last_run_metadata_path = last_run_metadata_path
11
+ @last_value_hashmap = Java::java.util.concurrent.ConcurrentHashMap.new
12
+ @last_value = IO.read(@last_run_metadata_path) rescue nil || tracking_field_seed
13
+ @tracking_field = tracking_field
14
+ logger.info "Starting value for cursor field \"#{@tracking_field}\": #{@last_value}"
15
+ @mutex = Mutex.new
16
+ end
17
+
18
+ def checkpoint_cursor(intermediate: true)
19
+ @mutex.synchronize do
20
+ if intermediate
21
+ # in intermediate checkpoints pick the smallest
22
+ converge_last_value {|v1, v2| v1 < v2 ? v1 : v2}
23
+ else
24
+ # in the last search of a PIT choose the largest
25
+ converge_last_value {|v1, v2| v1 > v2 ? v1 : v2}
26
+ @last_value_hashmap.clear
27
+ end
28
+ IO.write(@last_run_metadata_path, @last_value)
29
+ end
30
+ end
31
+
32
+ def converge_last_value(&block)
33
+ return if @last_value_hashmap.empty?
34
+ new_last_value = @last_value_hashmap.reduceValues(1000, &block)
35
+ logger.debug? && logger.debug("converge_last_value: got #{@last_value_hashmap.values.inspect}. won: #{new_last_value}")
36
+ return if new_last_value == @last_value
37
+ @last_value = new_last_value
38
+ logger.info "New cursor value for field \"#{@tracking_field}\" is: #{new_last_value}"
39
+ end
40
+
41
+ def record_last_value(event)
42
+ value = event.get(@tracking_field)
43
+ logger.trace? && logger.trace("storing last_value if #{@tracking_field} for #{Thread.current.object_id}: #{value}")
44
+ @last_value_hashmap.put(Thread.current.object_id, value)
45
+ end
46
+
47
+ def inject_cursor(query_json)
48
+ # ":present" means "now - 30s" to avoid grabbing partially visible data in the PIT
49
+ result = query_json.gsub(":last_value", @last_value.to_s).gsub(":present", now_minus_30s)
50
+ logger.debug("inject_cursor: injected values for ':last_value' and ':present'", :query => result)
51
+ result
52
+ end
53
+
54
+ def now_minus_30s
55
+ Java::java.time.Instant.now.minusSeconds(30).to_s
56
+ end
57
+ end
58
+ end; end; end
@@ -0,0 +1,153 @@
1
+ require 'logstash/helpers/loggable_try'
2
+
3
+ module LogStash
4
+ module Inputs
5
+ class Elasticsearch
6
+ class Esql
7
+ include LogStash::Util::Loggable
8
+
9
+ ESQL_JOB = "ES|QL job"
10
+
11
+ ESQL_PARSERS_BY_TYPE = Hash.new(lambda { |x| x }).merge(
12
+ 'date' => ->(value) { value && LogStash::Timestamp.new(value) },
13
+ )
14
+
15
+ # Initialize the ESQL query executor
16
+ # @param client [Elasticsearch::Client] The Elasticsearch client instance
17
+ # @param plugin [LogStash::Inputs::Elasticsearch] The parent plugin instance
18
+ def initialize(client, plugin)
19
+ @client = client
20
+ @event_decorator = plugin.method(:decorate_event)
21
+ @retries = plugin.params["retries"]
22
+
23
+ target_field = plugin.params["target"]
24
+ if target_field
25
+ def self.apply_target(path); "[#{target_field}][#{path}]"; end
26
+ else
27
+ def self.apply_target(path); path; end
28
+ end
29
+
30
+ @query = plugin.params["query"]
31
+ unless @query.include?('METADATA')
32
+ logger.info("`METADATA` not found the query. `_id`, `_version` and `_index` will not be available in the result", {:query => @query})
33
+ end
34
+ logger.debug("ES|QL executor initialized with", {:query => @query})
35
+ end
36
+
37
+ # Execute the ESQL query and process results
38
+ # @param output_queue [Queue] The queue to push processed events to
39
+ # @param query A query (to obey interface definition)
40
+ def do_run(output_queue, query)
41
+ logger.info("ES|QL executor has started")
42
+ response = retryable(ESQL_JOB) do
43
+ @client.esql.query({ body: { query: @query }, format: 'json', drop_null_columns: true })
44
+ end
45
+ # retriable already printed error details
46
+ return if response == false
47
+
48
+ if response&.headers&.dig("warning")
49
+ logger.warn("ES|QL executor received warning", {:warning_message => response.headers["warning"]})
50
+ end
51
+ columns = response['columns']&.freeze
52
+ values = response['values']&.freeze
53
+ logger.debug("ES|QL query response size: #{values&.size}")
54
+
55
+ process_response(columns, values, output_queue) if columns && values
56
+ end
57
+
58
+ # Execute a retryable operation with proper error handling
59
+ # @param job_name [String] Name of the job for logging purposes
60
+ # @yield The block to execute
61
+ # @return [Boolean] true if successful, false otherwise
62
+ def retryable(job_name, &block)
63
+ stud_try = ::LogStash::Helpers::LoggableTry.new(logger, job_name)
64
+ stud_try.try((@retries + 1).times) { yield }
65
+ rescue => e
66
+ error_details = {:message => e.message, :cause => e.cause}
67
+ error_details[:backtrace] = e.backtrace if logger.debug?
68
+ logger.error("#{job_name} failed with ", error_details)
69
+ false
70
+ end
71
+
72
+ private
73
+
74
+ # Process the ESQL response and push events to the output queue
75
+ # @param columns [Array[Hash]] The ESQL query response columns
76
+ # @param values [Array[Array]] The ESQL query response hits
77
+ # @param output_queue [Queue] The queue to push processed events to
78
+ def process_response(columns, values, output_queue)
79
+ column_specs = columns.map { |column| ColumnSpec.new(column) }
80
+ sub_element_mark_map = mark_sub_elements(column_specs)
81
+ multi_fields = sub_element_mark_map.filter_map { |key, val| key.name if val == true }
82
+ logger.warn("Multi-fields found in ES|QL result and they will not be available in the event. Please use `RENAME` command if you want to include them.", { :detected_multi_fields => multi_fields }) if multi_fields.any?
83
+
84
+ values.each do |row|
85
+ event = column_specs.zip(row).each_with_object(LogStash::Event.new) do |(column, value), event|
86
+ # `unless value.nil?` is a part of `drop_null_columns` that if some of columns' values are not `nil`, `nil` values appear
87
+ # we should continuously filter out them to achieve full `drop_null_columns` on each individual row (ideal `LIMIT 1` result)
88
+ # we also exclude sub-elements of main field
89
+ if value && sub_element_mark_map[column] == false
90
+ field_reference = apply_target(column.field_reference)
91
+ event.set(field_reference, ESQL_PARSERS_BY_TYPE[column.type].call(value))
92
+ end
93
+ end
94
+ @event_decorator.call(event)
95
+ output_queue << event
96
+ rescue => e
97
+ # if event creation fails with whatever reason, inform user and tag with failure and return entry as it is
98
+ logger.warn("Event creation error, ", message: e.message, exception: e.class, data: { "columns" => columns, "values" => [row] })
99
+ failed_event = LogStash::Event.new("columns" => columns, "values" => [row], "tags" => ['_elasticsearch_input_failure'])
100
+ output_queue << failed_event
101
+ end
102
+ end
103
+
104
+ # Determines whether each column in a collection is a nested sub-element (example "user.age")
105
+ # of another column in the same collection (example "user").
106
+ #
107
+ # @param columns [Array<ColumnSpec>] An array of objects with a `name` attribute representing field paths.
108
+ # @return [Hash<ColumnSpec, Boolean>] A hash mapping each column to `true` if it is a sub-element of another field, `false` otherwise.
109
+ # Time complexity: (O(NlogN+N*K)) where K is the number of conflict depth
110
+ # without (`prefix_set`) memoization, it would be O(N^2)
111
+ def mark_sub_elements(columns)
112
+ # Sort columns by name length (ascending)
113
+ sorted_columns = columns.sort_by { |c| c.name.length }
114
+ prefix_set = Set.new # memoization set
115
+
116
+ sorted_columns.each_with_object({}) do |column, memo|
117
+ # Split the column name into parts (e.g., "user.profile.age" → ["user", "profile", "age"])
118
+ parts = column.name.split('.')
119
+
120
+ # Generate all possible parent prefixes (e.g., "user", "user.profile")
121
+ # and check if any parent prefix exists in the set
122
+ parent_prefixes = (0...parts.size - 1).map { |i| parts[0..i].join('.') }
123
+ memo[column] = parent_prefixes.any? { |prefix| prefix_set.include?(prefix) }
124
+ prefix_set.add(column.name)
125
+ end
126
+ end
127
+ end
128
+
129
+ # Class representing a column specification in the ESQL response['columns']
130
+ # The class's main purpose is to provide a structure for the event key
131
+ # columns is an array with `name` and `type` pair (example: `{"name"=>"@timestamp", "type"=>"date"}`)
132
+ # @attr_reader :name [String] The name of the column
133
+ # @attr_reader :type [String] The type of the column
134
+ class ColumnSpec
135
+ attr_reader :name, :type
136
+
137
+ def initialize(spec)
138
+ @name = isolate(spec.fetch('name'))
139
+ @type = isolate(spec.fetch('type'))
140
+ end
141
+
142
+ def field_reference
143
+ @_field_reference ||= '[' + name.gsub('.', '][') + ']'
144
+ end
145
+
146
+ private
147
+ def isolate(value)
148
+ value.frozen? ? value : value.clone.freeze
149
+ end
150
+ end
151
+ end
152
+ end
153
+ end