logstash-input-azure_blob_storage 0.11.5 → 0.11.6

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 3d446aed971a95e6e17a27ed1e9ec8b141f939b53697fb9c332cfb130404745a
4
- data.tar.gz: 4a1321f6c6a30f6787d2133642ca23840371d6f4e18102cb775d345b09eb176a
3
+ metadata.gz: ececd96b04d2cab60eca54a0fe2a98c9ed093da2227e3568d4feea09264912fa
4
+ data.tar.gz: 7bcd39bc38d26a05da1275e5fb2317e41b5c2cddc6541535c7d166a69bb3cf62
5
5
  SHA512:
6
- metadata.gz: b4f48a0bebcd6e3594584a4473b223838359d44e9ef591f958aa4c80c4c22953f6b0f708b19faeaf0517c66f47185bda4de75ab4e3618b23e2e7f23f71cb4bee
7
- data.tar.gz: 508cd39ea159a4655e590f46ad0108c3b6e6de95ed575c4456da0230bae73fb384ecb7697ed710e7afb1542fe01cbd8a62130acedcbf0ba9c3040ace1f9d76d0
6
+ metadata.gz: 1bcbfab30de973e9eafee295221dc816411dca0e0f747a01c62bb48ec5c46eaf4db4162fdd5283611cd79da59910daab9e7c6e234df47f5ce7f320e65f7b8c69
7
+ data.tar.gz: 7bbdab8694d024b9c08cc89e13bc86aa8b90a536f5615565333593e0da7c3073d7c4cf3ad3f2b4005a90541de9693a93826158a18fbf9015234bee1812b3d46c
data/CHANGELOG.md CHANGED
@@ -1,6 +1,14 @@
1
+ ## 0.11.6
2
+ - fix in json head and tail learning the max_results
3
+ - broke out connection setup in order to call it again if connection exceptions come
4
+ - deal better with skipping of empty files.
5
+
1
6
  ## 0.11.5
2
- - Added optional filename into the message
3
- - plumbing for emulator, start_over not learning from registry
7
+ - added optional addfilename to add filename in message
8
+ - NSGFLOWLOG version 2 uses 0 as value instead of NULL in src and dst values
9
+ - added connection exception handling when full_read files
10
+ - rewritten json header footer learning to ignore learning from registry
11
+ - plumbing for emulator
4
12
 
5
13
  ## 0.11.4
6
14
  - fixed listing 3 times, rather than retrying to list max 3 times
data/README.md CHANGED
@@ -1,30 +1,34 @@
1
- # Logstash Plugin
1
+ # Logstash
2
2
 
3
- This is a plugin for [Logstash](https://github.com/elastic/logstash).
3
+ This is a plugin for [Logstash](https://github.com/elastic/logstash). It is fully free and fully open source. The license is Apache 2.0, meaning you are pretty much free to use it however you want in whatever way. All logstash plugin documentation are placed under one [central location](http://www.elastic.co/guide/en/logstash/current/). Need generic logstash help? Try #logstash on freenode IRC or the https://discuss.elastic.co/c/logstash discussion forum.
4
4
 
5
- It is fully free and fully open source. The license is Apache 2.0, meaning you are pretty much free to use it however you want in whatever way.
5
+ For problems or feature requests with this specific plugin, raise a github issue [GITHUB/janmg/logstash-input-azure_blob_storage/](https://github.com/janmg/logstash-input-azure_blob_storage). Pull requests will also be welcomed after discussion through an issue.
6
6
 
7
- ## Documentation
8
-
9
- All logstash plugin documentation are placed under one [central location](http://www.elastic.co/guide/en/logstash/current/).
7
+ ## Purpose
8
+ This plugin can read from Azure Storage Blobs, for instance JSON diagnostics logs for NSG flow logs or LINE based accesslogs from App Services.
9
+ [Azure Blob Storage](https://azure.microsoft.com/en-us/services/storage/blobs/)
10
10
 
11
- ## Need Help?
11
+ The plugin depends on the [Ruby library azure-storage-blon](https://rubygems.org/gems/azure-storage-blob/versions/1.1.0) from Microsoft, that depends on Faraday for the HTTPS connection to Azure.
12
12
 
13
- Need help? Try #logstash on freenode IRC or the https://discuss.elastic.co/c/logstash discussion forum. For real problems or feature requests, raise a github issue [GITHUB/janmg/logstash-input-azure_blob_storage/](https://github.com/janmg/logstash-input-azure_blob_storage). Pull requests will ionly be merged after discussion through an issue.
13
+ The plugin executes the following steps
14
+ 1. Lists all the files in the azure storage account. where the path of the files are matching pathprefix
15
+ 2. Filters on path_filters to only include files that match the directory and file glob (e.g. **/*.json)
16
+ 3. Save the listed files in a registry of known files and filesizes. (data/registry.dat on azure, or in a file on the logstash instance)
17
+ 4. List all the files again and compare the registry with the new filelist and put the delta in a worklist
18
+ 5. Process the worklist and put all events in the logstash queue.
19
+ 6. if there is time left, sleep to complete the interval. If processing takes more than an inteval, save the registry and continue processing.
20
+ 7. If logstash is stopped, a stop signal will try to finish the current file, save the registry and than quit
14
21
 
15
- ## Purpose
16
- This plugin can read from Azure Storage Blobs, for instance diagnostics logs for NSG flow logs or accesslogs from App Services.
17
- [Azure Blob Storage](https://azure.microsoft.com/en-us/services/storage/blobs/)
18
- This
19
22
  ## Installation
20
23
  This plugin can be installed through logstash-plugin
21
24
  ```
22
- logstash-plugin install logstash-input-azure_blob_storage
25
+ /usr/share/logstash/bin/logstash-plugin install logstash-input-azure_blob_storage
23
26
  ```
24
27
 
25
28
  ## Minimal Configuration
26
29
  The minimum configuration required as input is storageaccount, access_key and container.
27
30
 
31
+ /etc/logstash/conf.d/test.conf
28
32
  ```
29
33
  input {
30
34
  azure_blob_storage {
@@ -36,27 +40,29 @@ input {
36
40
  ```
37
41
 
38
42
  ## Additional Configuration
39
- The registry_create_policy is used when the pipeline is started to either resume from the last known unprocessed file, or to start_fresh ignoring old files or start_over to process all the files from the beginning.
43
+ The registry keeps track of files in the storage account, their size and how many bytes have been processed. Files can grow and the added part will be processed as a partial file. The registry is saved todisk every interval.
40
44
 
41
- interval defines the minimum time the registry should be saved to the registry file (by default 'data/registry.dat'), this is only needed in case the pipeline dies unexpectedly. During a normal shutdown the registry is also saved.
45
+ The registry_create_policy determines at the start of the pipeline if processing should resume from the last known unprocessed file, or to start_fresh ignoring old files and start only processing new events that came after the start of the pipeline. Or start_over to process all the files ignoring the registry.
42
46
 
43
- When registry_local_path is set to a directory, the registry is save on the logstash server in that directory. The filename is the pipe.id
47
+ interval defines the minimum time the registry should be saved to the registry file (by default to 'data/registry.dat'), this is only needed in case the pipeline dies unexpectedly. During a normal shutdown the registry is also saved.
44
48
 
45
- with registry_create_policy set to resume and the registry_local_path set to a directory where the registry isn't yet created, should load from the storage account and save the registry on the local server
49
+ When registry_local_path is set to a directory, the registry is saved on the logstash server in that directory. The filename is the pipe.id
46
50
 
47
- During the pipeline start for JSON codec, the plugin uses one file to learn how the JSON header and tail look like, they can also be configured manually.
51
+ with registry_create_policy set to resume and the registry_local_path set to a directory where the registry isn't yet created, should load the registry from the storage account and save the registry on the local server. This allows for a migration to localstorage
52
+
53
+ For pipelines that use the JSON codec or the JSON_LINE codec, the plugin uses one file to learn how the JSON header and tail look like, they can also be configured manually. Using skip_learning the learning can be disabled.
48
54
 
49
55
  ## Running the pipeline
50
56
  The pipeline can be started in several ways.
51
57
  - On the commandline
52
58
  ```
53
- /usr/share/logstash/bin/logtash -f /etc/logstash/pipeline.d/test.yml
59
+ /usr/share/logstash/bin/logtash -f /etc/logstash/conf.d/test.conf
54
60
  ```
55
61
  - In the pipeline.yml
56
62
  ```
57
63
  /etc/logstash/pipeline.yml
58
64
  pipe.id = test
59
- pipe.path = /etc/logstash/pipeline.d/test.yml
65
+ pipe.path = /etc/logstash/conf.d/test.conf
60
66
  ```
61
67
  - As managed pipeline from Kibana
62
68
 
@@ -95,7 +101,9 @@ The log level of the plugin can be put into DEBUG through
95
101
  curl -XPUT 'localhost:9600/_node/logging?pretty' -H 'Content-Type: application/json' -d'{"logger.logstash.inputs.azureblobstorage" : "DEBUG"}'
96
102
  ```
97
103
 
98
- because debug also makes logstash chatty, there are also debug_timer and debug_until that can be used to print additional informantion on what the pipeline is doing and how long it takes. debug_until is for the number of events until debug is disabled.
104
+ Because logstash debug makes logstash very chatty, the option debug_until will for a number of processed events and stops debuging. One file can easily contain thousands of events. The debug_until is useful to monitor the start of the plugin and the processing of the first files.
105
+
106
+ debug_timer will show detailed information on how much time listing of files took and how long the plugin will sleep to fill the interval and the listing and processing starts again.
99
107
 
100
108
  ## Other Configuration Examples
101
109
  For nsgflowlogs, a simple configuration looks like this
@@ -121,6 +129,10 @@ filter {
121
129
  }
122
130
  }
123
131
 
132
+ output {
133
+ stdout { }
134
+ }
135
+
124
136
  output {
125
137
  elasticsearch {
126
138
  hosts => "elasticsearch"
@@ -128,21 +140,35 @@ output {
128
140
  }
129
141
  }
130
142
  ```
131
-
143
+ A more elaborate input configuration example
132
144
  ```
133
145
  input {
134
146
  azure_blob_storage {
147
+ codec => "json"
135
148
  storageaccount => "yourstorageaccountname"
136
149
  access_key => "Ba5e64c0d3=="
137
150
  container => "insights-logs-networksecuritygroupflowevent"
138
- codec => "json"
139
151
  logtype => "nsgflowlog"
140
152
  prefix => "resourceId=/"
153
+ path_filters => ['**/*.json']
154
+ addfilename => true
141
155
  registry_create_policy => "resume"
156
+ registry_local_path => "/usr/share/logstash/plugin"
142
157
  interval => 300
158
+ debug_timer => true
159
+ debug_until => 100
160
+ }
161
+ }
162
+
163
+ output {
164
+ elasticsearch {
165
+ hosts => "elasticsearch"
166
+ index => "nsg-flow-logs-%{+xxxx.ww}"
143
167
  }
144
168
  }
145
169
  ```
170
+ The configuration documentation is in the first 100 lines of the code
171
+ [GITHUB/janmg/logstash-input-azure_blob_storage/blob/master/lib/logstash/inputs/azure_blob_storage.rb](https://github.com/janmg/logstash-input-azure_blob_storage/blob/master/lib/logstash/inputs/azure_blob_storage.rb)
146
172
 
147
173
  For WAD IIS and App Services the HTTP AccessLogs can be retrieved from a storage account as line based events and parsed through GROK. The date stamp can also be parsed with %{TIMESTAMP_ISO8601:log_timestamp}. For WAD IIS logfiles the container is wad-iis-logfiles. In the future grokking may happen already by the plugin.
148
174
  ```
@@ -61,7 +61,9 @@ config :registry_create_policy, :validate => ['resume','start_over','start_fresh
61
61
  # Z00000000000000000000000000000000 2 ]}
62
62
  config :interval, :validate => :number, :default => 60
63
63
 
64
+ # add the filename into the events
64
65
  config :addfilename, :validate => :boolean, :default => false, :required => false
66
+
65
67
  # debug_until will for a maximum amount of processed messages shows 3 types of log printouts including processed filenames. This is a lightweight alternative to switching the loglevel from info to debug or even trace
66
68
  config :debug_until, :validate => :number, :default => 0, :required => false
67
69
 
@@ -71,6 +73,9 @@ config :debug_timer, :validate => :boolean, :default => false, :required => fals
71
73
  # WAD IIS Grok Pattern
72
74
  #config :grokpattern, :validate => :string, :required => false, :default => '%{TIMESTAMP_ISO8601:log_timestamp} %{NOTSPACE:instanceId} %{NOTSPACE:instanceId2} %{IPORHOST:ServerIP} %{WORD:httpMethod} %{URIPATH:requestUri} %{NOTSPACE:requestQuery} %{NUMBER:port} %{NOTSPACE:username} %{IPORHOST:clientIP} %{NOTSPACE:httpVersion} %{NOTSPACE:userAgent} %{NOTSPACE:cookie} %{NOTSPACE:referer} %{NOTSPACE:host} %{NUMBER:httpStatus} %{NUMBER:subresponse} %{NUMBER:win32response} %{NUMBER:sentBytes:int} %{NUMBER:receivedBytes:int} %{NUMBER:timeTaken:int}'
73
75
 
76
+ # skip learning if you use json and don't want to learn the head and tail, but use either the defaults or configure them.
77
+ config :skip_learning, :validate => :boolean, :default => false, :required => false
78
+
74
79
  # The string that starts the JSON. Only needed when the codec is JSON. When partial file are read, the result will not be valid JSON unless the start and end are put back. the file_head and file_tail are learned at startup, by reading the first file in the blob_list and taking the first and last block, this would work for blobs that are appended like nsgflowlogs. The configuration can be set to override the learning. In case learning fails and the option is not set, the default is to use the 'records' as set by nsgflowlogs.
75
80
  config :file_head, :validate => :string, :required => false, :default => '{"records":['
76
81
  # The string that ends the JSON
@@ -113,34 +118,7 @@ def run(queue)
113
118
  @processed = 0
114
119
  @regsaved = @processed
115
120
 
116
- # Try in this order to access the storageaccount
117
- # 1. storageaccount / sas_token
118
- # 2. connection_string
119
- # 3. storageaccount / access_key
120
-
121
- unless connection_string.nil?
122
- conn = connection_string.value
123
- end
124
- unless sas_token.nil?
125
- unless sas_token.value.start_with?('?')
126
- conn = "BlobEndpoint=https://#{storageaccount}.#{dns_suffix};SharedAccessSignature=#{sas_token.value}"
127
- else
128
- conn = sas_token.value
129
- end
130
- end
131
- unless conn.nil?
132
- @blob_client = Azure::Storage::Blob::BlobService.create_from_connection_string(conn)
133
- else
134
- # unless use_development_storage?
135
- @blob_client = Azure::Storage::Blob::BlobService.create(
136
- storage_account_name: storageaccount,
137
- storage_dns_suffix: dns_suffix,
138
- storage_access_key: access_key.value,
139
- )
140
- # else
141
- # @logger.info("not yet implemented")
142
- # end
143
- end
121
+ connect
144
122
 
145
123
  @registry = Hash.new
146
124
  if registry_create_policy == "resume"
@@ -175,7 +153,7 @@ def run(queue)
175
153
  if registry_create_policy == "start_fresh"
176
154
  @registry = list_blobs(true)
177
155
  save_registry(@registry)
178
- @logger.info("starting fresh, writing a clean the registry to contain #{@registry.size} blobs/files")
156
+ @logger.info("starting fresh, writing a clean registry to contain #{@registry.size} blobs/files")
179
157
  end
180
158
 
181
159
  @is_json = false
@@ -188,12 +166,14 @@ def run(queue)
188
166
  @tail = ''
189
167
  # if codec=json sniff one files blocks A and Z to learn file_head and file_tail
190
168
  if @is_json
191
- learn_encapsulation
192
169
  if file_head
193
- @head = file_head
170
+ @head = file_head
194
171
  end
195
172
  if file_tail
196
- @tail = file_tail
173
+ @tail = file_tail
174
+ end
175
+ if file_head and file_tail and !skip_learning
176
+ learn_encapsulation
197
177
  end
198
178
  @logger.info("head will be: #{@head} and tail is set to #{@tail}")
199
179
  end
@@ -234,6 +214,8 @@ def run(queue)
234
214
  # size nilClass when the list doesn't grow?!
235
215
  # Worklist is the subset of files where the already read offset is smaller than the file size
236
216
  worklist.clear
217
+ chunk = nil
218
+
237
219
  worklist = newreg.select {|name,file| file[:offset] < file[:length]}
238
220
  if (worklist.size > 4) then @logger.info("worklist contains #{worklist.size} blobs") end
239
221
 
@@ -246,17 +228,26 @@ def run(queue)
246
228
  size = 0
247
229
  if file[:offset] == 0
248
230
  # This is where Sera4000 issue starts
249
- begin
250
- chunk = full_read(name)
251
- size=chunk.size
252
- rescue Exception => e
253
- @logger.error("Failed to read #{name} because of: #{e.message} .. will continue and pretend this never happened")
231
+ # For an append blob, reading full and crashing, retry, last_modified? ... lenght? ... committed? ...
232
+ # length and skip reg value
233
+ if (file[:length] > 0)
234
+ begin
235
+ chunk = full_read(name)
236
+ size=chunk.size
237
+ rescue Exception => e
238
+ @logger.error("Failed to read #{name} because of: #{e.message} .. will continue and pretend this never happened")
239
+ end
240
+ else
241
+ @logger.info("found a zero size file #{name}")
242
+ chunk = nil
254
243
  end
255
244
  else
256
245
  chunk = partial_read_json(name, file[:offset], file[:length])
257
246
  @logger.debug("partial file #{name} from #{file[:offset]} to #{file[:length]}")
258
247
  end
259
248
  if logtype == "nsgflowlog" && @is_json
249
+ # skip empty chunks
250
+ unless chunk.nil?
260
251
  res = resource(name)
261
252
  begin
262
253
  fingjson = JSON.parse(chunk)
@@ -265,6 +256,7 @@ def run(queue)
265
256
  rescue JSON::ParserError
266
257
  @logger.error("parse error on #{res[:nsg]} [#{res[:date]}] offset: #{file[:offset]} length: #{file[:length]}")
267
258
  end
259
+ end
268
260
  # TODO: Convert this to line based grokking.
269
261
  # TODO: ECS Compliance?
270
262
  elsif logtype == "wadiis" && !@is_json
@@ -272,7 +264,7 @@ def run(queue)
272
264
  else
273
265
  counter = 0
274
266
  begin
275
- @codec.decode(chunk) do |event|
267
+ @codec.decode(chunk) do |event|
276
268
  counter += 1
277
269
  if @addfilename
278
270
  event.set('filename', name)
@@ -282,6 +274,7 @@ def run(queue)
282
274
  end
283
275
  rescue Exception => e
284
276
  @logger.error("codec exception: #{e.message} .. will continue and pretend this never happened")
277
+ @registry.store(name, { :offset => file[:length], :length => file[:length] })
285
278
  @logger.debug("#{chunk}")
286
279
  end
287
280
  @processed += counter
@@ -321,8 +314,54 @@ end
321
314
 
322
315
 
323
316
  private
317
+ def connect
318
+ # Try in this order to access the storageaccount
319
+ # 1. storageaccount / sas_token
320
+ # 2. connection_string
321
+ # 3. storageaccount / access_key
322
+
323
+ unless connection_string.nil?
324
+ conn = connection_string.value
325
+ end
326
+ unless sas_token.nil?
327
+ unless sas_token.value.start_with?('?')
328
+ conn = "BlobEndpoint=https://#{storageaccount}.#{dns_suffix};SharedAccessSignature=#{sas_token.value}"
329
+ else
330
+ conn = sas_token.value
331
+ end
332
+ end
333
+ unless conn.nil?
334
+ @blob_client = Azure::Storage::Blob::BlobService.create_from_connection_string(conn)
335
+ else
336
+ # unless use_development_storage?
337
+ @blob_client = Azure::Storage::Blob::BlobService.create(
338
+ storage_account_name: storageaccount,
339
+ storage_dns_suffix: dns_suffix,
340
+ storage_access_key: access_key.value,
341
+ )
342
+ # else
343
+ # @logger.info("not yet implemented")
344
+ # end
345
+ end
346
+ end
347
+
324
348
  def full_read(filename)
325
- return @blob_client.get_blob(container, filename)[1]
349
+ tries ||= 2
350
+ begin
351
+ return @blob_client.get_blob(container, filename)[1]
352
+ rescue Exception => e
353
+ @logger.error("caught: #{e.message} for full_read")
354
+ if (tries -= 1) > 0
355
+ if e.message = "Connection reset by peer"
356
+ connect
357
+ end
358
+ retry
359
+ end
360
+ end
361
+ begin
362
+ chuck = @blob_client.get_blob(container, filename)[1]
363
+ end
364
+ return chuck
326
365
  end
327
366
 
328
367
  def partial_read_json(filename, offset, length)
@@ -473,9 +512,10 @@ end
473
512
 
474
513
 
475
514
  def learn_encapsulation
515
+ @logger.info("learn_encapsulation, this can be skipped by setting skip_learning => true. Or set both head_file and tail_file")
476
516
  # From one file, read first block and last block to learn head and tail
477
517
  begin
478
- blobs = @blob_client.list_blobs(container, { maxresults: 3, prefix: @prefix})
518
+ blobs = @blob_client.list_blobs(container, { max_results: 3, prefix: @prefix})
479
519
  blobs.each do |blob|
480
520
  unless blob.name == registry_path
481
521
  begin
@@ -1,6 +1,6 @@
1
1
  Gem::Specification.new do |s|
2
2
  s.name = 'logstash-input-azure_blob_storage'
3
- s.version = '0.11.5'
3
+ s.version = '0.11.6'
4
4
  s.licenses = ['Apache-2.0']
5
5
  s.summary = 'This logstash plugin reads and parses data from Azure Storage Blobs.'
6
6
  s.description = <<-EOF
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: logstash-input-azure_blob_storage
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.11.5
4
+ version: 0.11.6
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jan Geertsma
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-12-19 00:00:00.000000000 Z
11
+ date: 2021-02-11 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  requirement: !ruby/object:Gem::Requirement