logstash-input-azure_blob_storage 0.12.6 → 0.12.8

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: b50189c380606c6fdb8b7f7216fe20d15c0d410f1c1f6670211baf25baa567ca
4
- data.tar.gz: 189c80c15720ec9a85b8bb223a5ae7e4666fd0ebd6a96946f201bee96cf3dafc
3
+ metadata.gz: 6226b48f09b69ea1fe5d5e65197cf87daed475a2dff3aecc1ff30b1c921d4e7e
4
+ data.tar.gz: 9ac324158bddc908f107663925a27ff289eb7b264293da88218a825d66c74d74
5
5
  SHA512:
6
- metadata.gz: 599ca22fd813634d3ffd5fbbef0361605fd7611ea4050bc85e30c06fe97dbfe6dcd879ee092573e8a94229435d25c7cef71255bc72f33ea3d4813de987600e4c
7
- data.tar.gz: 53cc0e73c25323ba891e90a820c679071516187d641ed2c5dd5810a5bbb9654c2cf67c6239d400b58d8786c4cc4737aaa54b0fc1f145b4136ebf1f6b0203a00d
6
+ metadata.gz: 6cdd2d17fd57adc43b0c8e7354cbf396243b4bf691e8ef12d757c2c9dc515f9711ecbe9c64495b0d6f50040a28af98af2b641224c03dc83c3c4db9919ef1fb77
7
+ data.tar.gz: e1a71cfbe35af0d878374dcce499096331c82de867d86fb6ea3f4c876e1cc24f8b0fb59087b112012989c358ec9f238159564d17d9433ca1899a776a1c311683
data/CHANGELOG.md CHANGED
@@ -1,7 +1,17 @@
1
- ## PROBABLY 0.12.4 is the most stable version until I sort out when and why JSON Parse errors happen
2
- Join the discussion if you have something to share!
3
- https://github.com/janmg/logstash-input-azure_blob_storage/issues/34
4
-
1
+ ## 0.12.8
2
+ - support append blob (use codec json_lines and logtype raw)
3
+ - change the default head and tail to an empty string, unless the logtype is nsgflowlog
4
+ - jsonclean configuration parameter to clean the json stream from faulty characters to prevent parse errors
5
+ - catch ContainerNotFound, print error message in log and sleep interval time.
6
+
7
+ ## 0.12.7
8
+ - rewrote partial_read, now the occasional json parse errors should be fixed by reading only commited blocks.
9
+ (This may also have been related to reading a second partial_read, where the offset wasn't updated correctly?)
10
+ - used the new header and tail block name, should now learn header and footer automatically again?
11
+ - added addall to the configurations to add system, mac, category, time, operation to the output
12
+ - added optional environment configuration option
13
+ - removed the date, which was always set to ---
14
+ - made a start on event rewriting to make it ECS compatibility
5
15
 
6
16
  ## 0.12.6
7
17
  - Fixed the 0.12.5 exception handling, it actually caused a warning to become a fatal pipeline crashing error
data/README.md CHANGED
@@ -8,6 +8,14 @@ For problems or feature requests with this specific plugin, raise a github issue
8
8
  This plugin can read from Azure Storage Blobs, for instance JSON diagnostics logs for NSG flow logs or LINE based accesslogs from App Services.
9
9
  [Azure Blob Storage](https://azure.microsoft.com/en-us/services/storage/blobs/)
10
10
 
11
+ ## Alternatives
12
+ This plugin was inspired by the Azure diagnostics tools, but should work better for bigger amounts of files. the configuration is not compatible, the configuration azureblob refers to the diagnostics tools plugin and this plugin uses azure_blob_storage
13
+ https://github.com/Azure/azure-diagnostics-tools/tree/master/Logstash/logstash-input-azureblob
14
+
15
+ There is a Filebeat plugin, that may work in the future
16
+ https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-azure-blob-storage.html
17
+
18
+ ## Innerworking
11
19
  The plugin depends on the [Ruby library azure-storage-blob](https://rubygems.org/gems/azure-storage-blob/versions/1.1.0) from Microsoft, that depends on Faraday for the HTTPS connection to Azure.
12
20
 
13
21
  The plugin executes the following steps
@@ -42,9 +50,11 @@ input {
42
50
  ## Additional Configuration
43
51
  The registry keeps track of files in the storage account, their size and how many bytes have been processed. Files can grow and the added part will be processed as a partial file. The registry is saved todisk every interval.
44
52
 
53
+ The interval is also defines when a new round of listing files and processing data should happen. The NSGFLOWLOG's are written every minute into a new block of the hourly blob. This data can be partially read, because the plugin knows the JSON head and tail and removes the leading comma and fixes the JSON before parsing new events
54
+
45
55
  The registry_create_policy determines at the start of the pipeline if processing should resume from the last known unprocessed file, or to start_fresh ignoring old files and start only processing new events that came after the start of the pipeline. Or start_over to process all the files ignoring the registry.
46
56
 
47
- interval defines the minimum time the registry should be saved to the registry file (by default to 'data/registry.dat'), this is only needed in case the pipeline dies unexpectedly. During a normal shutdown the registry is also saved.
57
+ interval defines the minimum time the registry should be saved to the registry file. By default to 'data/registry.dat' in the storageaccount, but can be also kept on the server running logstash by setting registry_local_path. The registry is kept also in memory, the registry file is only needed in case the pipeline dies unexpectedly. During a normal shutdown the registry is also saved.
48
58
 
49
59
  When registry_local_path is set to a directory, the registry is saved on the logstash server in that directory. The filename is the pipe.id
50
60
 
@@ -66,13 +76,15 @@ The pipeline can be started in several ways.
66
76
  ```
67
77
  - As managed pipeline from Kibana
68
78
 
69
- Logstash itself (so not specific to this plugin) has a feature where multiple instances can run on the same system. The default TCP port is 9600, but if it's already in use it will use 9601 (and up). To update a config file on a running instance on the commandline you can add the argument --config.reload.automatic and if you modify the files that are in the pipeline.yml you can send a SIGHUP channel to reload the pipelines where the config was changed.
79
+ Logstash itself (so not specific to this plugin) has a feature where multiple instances can run on the same system. The default TCP port is 9600, but if it's already in use it will use 9601 (and up), this is probably not true anymore from v8. To update a config file on a running instance on the commandline you can add the argument --config.reload.automatic and if you modify the files that are in the pipeline.yml you can send a SIGHUP channel to reload the pipelines where the config was changed.
70
80
  [https://www.elastic.co/guide/en/logstash/current/reloading-config.html](https://www.elastic.co/guide/en/logstash/current/reloading-config.html)
71
81
 
72
82
  ## Internal Working
73
83
  When the plugin is started, it will read all the filenames and sizes in the blob store excluding the directies of files that are excluded by the "path_filters". After every interval it will write a registry to the storageaccount to save the information of how many bytes per blob (file) are read and processed. After all files are processed and at least one interval has passed a new file list is generated and a worklist is constructed that will be processed. When a file has already been processed before, partial files are read from the offset to the filesize at the time of the file listing. If the codec is JSON partial files will be have the header and tail will be added. They can be configured. If logtype is nsgflowlog, the plugin will process the splitting into individual tuple events. The logtype wadiis may in the future be used to process the grok formats to split into log lines. Any other format is fed into the queue as one event per file or partial file. It's then up to the filter to split and mutate the file format.
74
84
 
75
- By default the root of the json message is named "message" so you can modify the content in the filter block
85
+ By default the root of the json message is named "message", you can modify the content in the filter block
86
+
87
+ Additional fields can be enabled with addfilename and addall, ecs_compatibility is not yet supported.
76
88
 
77
89
  The configurations and the rest of the code are in [https://github.com/janmg/logstash-input-azure_blob_storage/tree/master/lib/logstash/inputs](lib/logstash/inputs) [https://github.com/janmg/logstash-input-azure_blob_storage/blob/master/lib/logstash/inputs/azure_blob_storage.rb#L10](azure_blob_storage.rb)
78
90
 
@@ -130,7 +142,7 @@ filter {
130
142
  }
131
143
 
132
144
  output {
133
- stdout { }
145
+ stdout { codec => rubydebug }
134
146
  }
135
147
 
136
148
  output {
@@ -139,24 +151,37 @@ output {
139
151
  index => "nsg-flow-logs-%{+xxxx.ww}"
140
152
  }
141
153
  }
154
+
155
+ output {
156
+ file {
157
+ path => /tmp/abuse.txt
158
+ codec => line { format => "%{decision} %{flowstate} %{src_ip} ${dst_port}"}
159
+ }
160
+ }
161
+
142
162
  ```
143
163
  A more elaborate input configuration example
144
164
  ```
145
165
  input {
146
166
  azure_blob_storage {
147
167
  codec => "json"
148
- storageaccount => "yourstorageaccountname"
149
- access_key => "Ba5e64c0d3=="
168
+ # storageaccount => "yourstorageaccountname"
169
+ # access_key => "Ba5e64c0d3=="
170
+ connection_string => "DefaultEndpointsProtocol=https;AccountName=yourstorageaccountname;AccountKey=Ba5e64c0d3==;EndpointSuffix=core.windows.net"
150
171
  container => "insights-logs-networksecuritygroupflowevent"
151
172
  logtype => "nsgflowlog"
152
173
  prefix => "resourceId=/"
153
174
  path_filters => ['**/*.json']
154
175
  addfilename => true
176
+ addall => true
177
+ environment => "dev-env"
155
178
  registry_create_policy => "resume"
156
179
  registry_local_path => "/usr/share/logstash/plugin"
157
180
  interval => 300
158
181
  debug_timer => true
159
- debug_until => 100
182
+ debug_until => 1000
183
+ addall => true
184
+ registry_create_policy => "start_over"
160
185
  }
161
186
  }
162
187
 
@@ -167,6 +192,20 @@ output {
167
192
  }
168
193
  }
169
194
  ```
195
+
196
+ Another for json_lines on append_blobs
197
+ ```
198
+ input {
199
+ azure_blob_storage {
200
+ codec => json_lines {
201
+ delimiter => "\n"
202
+ charset => "UTF-8"
203
+ }
204
+ # below options are optional
205
+ logtype => "raw"
206
+ append => true
207
+ cleanjson => true
208
+ ```
170
209
  The configuration documentation is in the first 100 lines of the code
171
210
  [GITHUB/janmg/logstash-input-azure_blob_storage/blob/master/lib/logstash/inputs/azure_blob_storage.rb](https://github.com/janmg/logstash-input-azure_blob_storage/blob/master/lib/logstash/inputs/azure_blob_storage.rb)
172
211
 
@@ -211,5 +250,9 @@ filter {
211
250
  remove_field => ["timestamp"]
212
251
  }
213
252
  }
253
+
254
+ output {
255
+ stdout { codec => rubydebug }
256
+ }
214
257
  ```
215
258
 
@@ -17,14 +17,16 @@ require 'json'
17
17
  # D672f4bbd95a04209b00dc05d899e3cce 2576 json objects for 1st minute
18
18
  # D7fe0d4f275a84c32982795b0e5c7d3a1 2312 json objects for 2nd minute
19
19
  # Z00000000000000000000000000000000 2 ]}
20
-
20
+ #
21
+ # The azure-storage-ruby connects to the storageaccount and the files are read through get_blob. For partial read the options with start and end ar used.
22
+ # https://github.com/Azure/azure-storage-ruby/blob/master/blob/lib/azure/storage/blob/blob.rb#L89
23
+ #
21
24
  # A storage account has by default a globally unique name, {storageaccount}.blob.core.windows.net which is a CNAME to Azures blob servers blob.*.store.core.windows.net. A storageaccount has an container and those have a directory and blobs (like files). Blobs have one or more blocks. After writing the blocks, they can be committed. Some Azure diagnostics can send events to an EventHub that can be parse through the plugin logstash-input-azure_event_hubs, but for the events that are only stored in an storage account, use this plugin. The original logstash-input-azureblob from azure-diagnostics-tools is great for low volumes, but it suffers from outdated client, slow reads, lease locking issues and json parse errors.
22
25
 
23
-
24
26
  class LogStash::Inputs::AzureBlobStorage < LogStash::Inputs::Base
25
27
  config_name "azure_blob_storage"
26
28
 
27
- # If undefined, Logstash will complain, even if codec is unused. The codec for nsgflowlog is "json" and the for WADIIS and APPSERVICE is "line".
29
+ # If undefined, Logstash will complain, even if codec is unused. The codec for nsgflowlog is "json", "json_line" works and the for WADIIS and APPSERVICE is "line".
28
30
  default :codec, "json"
29
31
 
30
32
  # logtype can be nsgflowlog, wadiis, appservice or raw. The default is raw, where files are read and added as one event. If the file grows, the next interval the file is read from the offset, so that the delta is sent as another event. In raw mode, further processing has to be done in the filter block. If the logtype is specified, this plugin will split and mutate and add individual events to the queue.
@@ -66,7 +68,7 @@ class LogStash::Inputs::AzureBlobStorage < LogStash::Inputs::Base
66
68
  # when set to `start_fresh`, it will read log files that are created or appended since this start of the pipeline.
67
69
  config :registry_create_policy, :validate => ['resume','start_over','start_fresh'], :required => false, :default => 'resume'
68
70
 
69
- # The interval is used to save the registry regularly, when new events have have been processed. It is also used to wait before listing the files again and substracting the registry of already processed files to determine the worklist.
71
+ # The interval is used to save the registry regularly, when new events have have been processed. It is also used to wait before listing the files again and substracting the registry of already processed files to determine the worklist.
70
72
  # waiting time in seconds until processing the next batch. NSGFLOWLOGS append a block per minute, so use multiples of 60 seconds, 300 for 5 minutes, 600 for 10 minutes. The registry is also saved after every interval.
71
73
  # Partial reading starts from the offset and reads until the end, so the starting tag is prepended
72
74
  config :interval, :validate => :number, :default => 60
@@ -74,6 +76,12 @@ class LogStash::Inputs::AzureBlobStorage < LogStash::Inputs::Base
74
76
  # add the filename as a field into the events
75
77
  config :addfilename, :validate => :boolean, :default => false, :required => false
76
78
 
79
+ # add environment
80
+ config :environment, :validate => :string, :required => false
81
+
82
+ # add all resource details
83
+ config :addall, :validate => :boolean, :default => false, :required => false
84
+
77
85
  # debug_until will at the creation of the pipeline for a maximum amount of processed messages shows 3 types of log printouts including processed filenames. After a number of events, the plugin will stop logging the events and continue silently. This is a lightweight alternative to switching the loglevel from info to debug or even trace to see what the plugin is doing and how fast at the start of the plugin. A good value would be approximately 3x the amount of events per file. For instance 6000 events.
78
86
  config :debug_until, :validate => :number, :default => 0, :required => false
79
87
 
@@ -87,10 +95,14 @@ class LogStash::Inputs::AzureBlobStorage < LogStash::Inputs::Base
87
95
  config :skip_learning, :validate => :boolean, :default => false, :required => false
88
96
 
89
97
  # The string that starts the JSON. Only needed when the codec is JSON. When partial file are read, the result will not be valid JSON unless the start and end are put back. the file_head and file_tail are learned at startup, by reading the first file in the blob_list and taking the first and last block, this would work for blobs that are appended like nsgflowlogs. The configuration can be set to override the learning. In case learning fails and the option is not set, the default is to use the 'records' as set by nsgflowlogs.
90
- config :file_head, :validate => :string, :required => false, :default => '{"records":['
98
+ config :file_head, :validate => :string, :required => false, :default => ''
91
99
  # The string that ends the JSON
92
- config :file_tail, :validate => :string, :required => false, :default => ']}'
100
+ config :file_tail, :validate => :string, :required => false, :default => ''
93
101
 
102
+ # inspect the bytes and remove faulty characters
103
+ config :cleanjson, :validate => :boolean, :default => false, :required => false
104
+
105
+ config :append, :validate => :boolean, :default => false, :required => false
94
106
  # By default it will watch every file in the storage container. The prefix option is a simple filter that only processes files with a path that starts with that value.
95
107
  # For NSGFLOWLOGS a path starts with "resourceId=/". This would only be needed to exclude other paths that may be written in the same container. The registry file will be excluded.
96
108
  # You may also configure multiple paths. See an example on the <<array,Logstash configuration page>>.
@@ -110,6 +122,7 @@ public
110
122
  @logger.info("If this plugin doesn't work, please raise an issue in https://github.com/janmg/logstash-input-azure_blob_storage")
111
123
  @busy_writing_registry = Mutex.new
112
124
  # TODO: consider multiple readers, so add pipeline @id or use logstash-to-logstash communication?
125
+ # For now it's difficult because the plugin would then have to synchronize the worklist
113
126
  end
114
127
 
115
128
 
@@ -120,41 +133,10 @@ public
120
133
  @regsaved = @processed
121
134
 
122
135
  connect
123
-
124
136
  @registry = Hash.new
125
- if registry_create_policy == "resume"
126
- for counter in 1..3
127
- begin
128
- if (!@registry_local_path.nil?)
129
- unless File.file?(@registry_local_path+"/"+@pipe_id)
130
- @registry = Marshal.load(@blob_client.get_blob(container, registry_path)[1])
131
- #[0] headers [1] responsebody
132
- @logger.info("migrating from remote registry #{registry_path}")
133
- else
134
- if !Dir.exist?(@registry_local_path)
135
- FileUtils.mkdir_p(@registry_local_path)
136
- end
137
- @registry = Marshal.load(File.read(@registry_local_path+"/"+@pipe_id))
138
- @logger.info("resuming from local registry #{registry_local_path+"/"+@pipe_id}")
139
- end
140
- else
141
- @registry = Marshal.load(@blob_client.get_blob(container, registry_path)[1])
142
- #[0] headers [1] responsebody
143
- @logger.info("resuming from remote registry #{registry_path}")
144
- end
145
- break
146
- rescue Exception => e
147
- @logger.error("caught: #{e.message}")
148
- @registry.clear
149
- @logger.error("loading registry failed for attempt #{counter} of 3")
150
- end
151
- end
152
- end
153
- # read filelist and set offsets to file length to mark all the old files as done
154
- if registry_create_policy == "start_fresh"
155
- @registry = list_blobs(true)
156
- save_registry()
157
- @logger.info("starting fresh, writing a clean registry to contain #{@registry.size} blobs/files")
137
+ load_registry()
138
+ @registry.each do |name, file|
139
+ @logger.info("offset: #{file[:offset]} length: #{file[:length]}")
158
140
  end
159
141
 
160
142
  @is_json = false
@@ -166,22 +148,29 @@ public
166
148
  @is_json_line = true
167
149
  end
168
150
  end
151
+
152
+
169
153
  @head = ''
170
154
  @tail = ''
171
- # if codec=json sniff one files blocks A and Z to learn file_head and file_tail
172
155
  if @is_json
156
+ # if codec=json sniff one files blocks A and Z to learn file_head and file_tail
157
+ if @logtype == 'nsgflowlog'
158
+ @head = '{"records":['
159
+ @tail = ']}'
160
+ end
173
161
  if file_head
174
162
  @head = file_head
175
163
  end
176
164
  if file_tail
177
165
  @tail = file_tail
178
166
  end
179
- if file_head and file_tail and !skip_learning
167
+ if !skip_learning
180
168
  learn_encapsulation
181
169
  end
182
- @logger.info("head will be: #{@head} and tail is set to #{@tail}")
170
+ @logger.info("head will be: '#{@head}' and tail is set to: '#{@tail}'")
183
171
  end
184
172
 
173
+
185
174
  filelist = Hash.new
186
175
  worklist = Hash.new
187
176
  @last = start = Time.now.to_i
@@ -198,24 +187,27 @@ public
198
187
  # load the registry, compare it's offsets to file list, set offset to 0 for new files, process the whole list and if finished within the interval wait for next loop,
199
188
  # TODO: sort by timestamp ?
200
189
  #filelist.sort_by(|k,v|resource(k)[:date])
201
- worklist.clear
202
190
  filelist.clear
203
191
 
204
192
  # Listing all the files
205
193
  filelist = list_blobs(false)
194
+ if (@debug_until > @processed) then
195
+ @registry.each do |name, file|
196
+ @logger.info("#{name} offset: #{file[:offset]} length: #{file[:length]}")
197
+ end
198
+ end
206
199
  filelist.each do |name, file|
207
200
  off = 0
208
201
  if @registry.key?(name) then
209
- begin
210
- off = @registry[name][:offset]
211
- rescue Exception => e
212
- @logger.error("caught: #{e.message} while reading #{name}")
213
- end
202
+ begin
203
+ off = @registry[name][:offset]
204
+ rescue Exception => e
205
+ @logger.error("caught: #{e.message} while reading #{name}")
206
+ end
214
207
  end
215
208
  @registry.store(name, { :offset => off, :length => file[:length] })
216
209
  if (@debug_until > @processed) then @logger.info("2: adding offsets: #{name} #{off} #{file[:length]}") end
217
210
  end
218
- # size nilClass when the list doesn't grow?!
219
211
 
220
212
  # clean registry of files that are not in the filelist
221
213
  @registry.each do |name,file|
@@ -234,14 +226,16 @@ public
234
226
 
235
227
  # Start of processing
236
228
  # This would be ideal for threading since it's IO intensive, would be nice with a ruby native ThreadPool
229
+ # pool = Concurrent::FixedThreadPool.new(5) # 5 threads
230
+ #pool.post do
231
+ # some parallel work
232
+ #end
237
233
  if (worklist.size > 0) then
238
234
  worklist.each do |name, file|
239
235
  start = Time.now.to_i
240
236
  if (@debug_until > @processed) then @logger.info("3: processing #{name} from #{file[:offset]} to #{file[:length]}") end
241
237
  size = 0
242
238
  if file[:offset] == 0
243
- # This is where Sera4000 issue starts
244
- # For an append blob, reading full and crashing, retry, last_modified? ... lenght? ... committed? ...
245
239
  # length and skip reg value
246
240
  if (file[:length] > 0)
247
241
  begin
@@ -260,55 +254,72 @@ public
260
254
  delta_size = 0
261
255
  end
262
256
  else
263
- chunk = partial_read_json(name, file[:offset], file[:length])
264
- delta_size = chunk.size
265
- @logger.debug("partial file #{name} from #{file[:offset]} to #{file[:length]}")
257
+ chunk = partial_read(name, file[:offset])
258
+ delta_size = chunk.size - @head.length - 1
266
259
  end
267
260
 
268
- if logtype == "nsgflowlog" && @is_json
269
- # skip empty chunks
270
- unless chunk.nil?
271
- res = resource(name)
272
- begin
273
- fingjson = JSON.parse(chunk)
274
- @processed += nsgflowlog(queue, fingjson, name)
275
- @logger.debug("Processed #{res[:nsg]} [#{res[:date]}] #{@processed} events")
276
- rescue JSON::ParserError => e
277
- @logger.error("parse error #{e.message} on #{res[:nsg]} [#{res[:date]}] offset: #{file[:offset]} length: #{file[:length]}")
278
- if (@debug_until > @processed) then @logger.info("#{chunk}") end
279
- end
280
- end
281
- # TODO: Convert this to line based grokking.
282
- # TODO: ECS Compliance?
283
- elsif logtype == "wadiis" && !@is_json
284
- @processed += wadiislog(queue, name)
285
- else
286
- # Handle JSONLines format
287
- if !@chunk.nil? && @is_json_line
288
- newline_rindex = chunk.rindex("\n")
289
- if newline_rindex.nil?
290
- # No full line in chunk, skip it without updating the registry.
291
- # Expecting that the JSON line would be filled in at a subsequent iteration.
292
- next
293
- end
294
- chunk = chunk[0..newline_rindex]
295
- delta_size = chunk.size
261
+ #
262
+ # TODO! ... split out the logtypes and use individual methods
263
+ # how does a byte array chuck from json_lines get translated to strings/json/events
264
+ # should the byte array be converted to a multiline and then split? drawback need to know characterset and linefeed characters
265
+ # how does the json_line decoder work on byte arrays?
266
+ #
267
+ # so many questions
268
+
269
+ unless chunk.nil?
270
+ counter = 0
271
+ if @is_json
272
+ if logtype == "nsgflowlog"
273
+ res = resource(name)
274
+ begin
275
+ fingjson = JSON.parse(chunk)
276
+ @processed += nsgflowlog(queue, fingjson, name)
277
+ @logger.debug("Processed #{res[:nsg]} #{@processed} events")
278
+ rescue JSON::ParserError => e
279
+ @logger.error("parse error #{e.message} on #{res[:nsg]} offset: #{file[:offset]} length: #{file[:length]}")
280
+ if (@debug_until > @processed) then @logger.info("#{chunk}") end
281
+ end
282
+ else
283
+ begin
284
+ @codec.decode(chunk) do |event|
285
+ counter += 1
286
+ if @addfilename
287
+ event.set('filename', name)
288
+ end
289
+ decorate(event)
290
+ queue << event
291
+ end
292
+ @processed += counter
293
+ rescue Exception => e
294
+ @logger.error("codec exception: #{e.message} .. continue and pretend this never happened")
295
+ end
296
+ end
297
+ end
298
+
299
+ if logtype == "wadiis" && !@is_json
300
+ # TODO: Convert this to line based grokking.
301
+ @processed += wadiislog(queue, name)
296
302
  end
297
303
 
298
- counter = 0
299
- begin
300
- @codec.decode(chunk) do |event|
301
- counter += 1
302
- if @addfilename
303
- event.set('filename', name)
304
+ if @is_json_line
305
+ # parse one line at a time and dump it in the chunk?
306
+ lines = chunk.to_s
307
+ if cleanjson
308
+ @logger.info("cleaning in progress")
309
+ lines.chars.select(&:valid_encoding?).join
310
+ #lines.delete "\\"
311
+ #lines.scrub{|bytes| '<'+bytes.unpack('H*')[0]+'>' }
312
+ end
313
+ begin
314
+ @codec.decode(lines) do |event|
315
+ counter += 1
316
+ queue << event
304
317
  end
305
- decorate(event)
306
- queue << event
318
+ @processed += counter
319
+ rescue Exception => e
320
+ # todo: fix codec_lines exception: no implicit conversion of Array into String
321
+ @logger.error("json_lines codec exception: #{e.message} .. continue and pretend this never happened")
307
322
  end
308
- @processed += counter
309
- rescue Exception => e
310
- @logger.error("codec exception: #{e.message} .. will continue and pretend this never happened")
311
- @logger.debug("#{chunk}")
312
323
  end
313
324
  end
314
325
 
@@ -348,6 +359,24 @@ public
348
359
 
349
360
 
350
361
  private
362
+ def list_files
363
+ filelist = list_blobs(false)
364
+ filelist.each do |name, file|
365
+ off = 0
366
+ if @registry.key?(name) then
367
+ begin
368
+ off = @registry[name][:offset]
369
+ rescue Exception => e
370
+ @logger.error("caught: #{e.message} while reading #{name}")
371
+ end
372
+ end
373
+ @registry.store(name, { :offset => off, :length => file[:length] })
374
+ if (@debug_until > @processed) then @logger.info("2: adding offsets: #{name} #{off} #{file[:length]}") end
375
+ end
376
+ return filelist
377
+ end
378
+ # size nilClass when the list doesn't grow?!
379
+
351
380
  def connect
352
381
  # Try in this order to access the storageaccount
353
382
  # 1. storageaccount / sas_token
@@ -378,11 +407,48 @@ private
378
407
  # end
379
408
  end
380
409
  end
410
+ # @registry_create_policy,@registry_local_path,@container,@registry_path
411
+ def load_registry()
412
+ if @registry_create_policy == "resume"
413
+ for counter in 1..3
414
+ begin
415
+ if (!@registry_local_path.nil?)
416
+ unless File.file?(@registry_local_path+"/"+@pipe_id)
417
+ @registry = Marshal.load(@blob_client.get_blob(@container, path)[1])
418
+ #[0] headers [1] responsebody
419
+ @logger.info("migrating from remote registry #{path}")
420
+ else
421
+ if !Dir.exist?(@registry_local_path)
422
+ FileUtils.mkdir_p(@registry_local_path)
423
+ end
424
+ @registry = Marshal.load(File.read(@registry_local_path+"/"+@pipe_id))
425
+ @logger.info("resuming from local registry #{@registry_local_path+"/"+@pipe_id}")
426
+ end
427
+ else
428
+ @registry = Marshal.load(@blob_client.get_blob(container, path)[1])
429
+ #[0] headers [1] responsebody
430
+ @logger.info("resuming from remote registry #{path}")
431
+ end
432
+ break
433
+ rescue Exception => e
434
+ @logger.error("caught: #{e.message}")
435
+ @registry.clear
436
+ @logger.error("loading registry failed for attempt #{counter} of 3")
437
+ end
438
+ end
439
+ end
440
+ # read filelist and set offsets to file length to mark all the old files as done
441
+ if @registry_create_policy == "start_fresh"
442
+ @registry = list_blobs(true)
443
+ #save_registry()
444
+ @logger.info("starting fresh, with a clean registry containing #{@registry.size} blobs/files")
445
+ end
446
+ end
381
447
 
382
448
  def full_read(filename)
383
449
  tries ||= 2
384
450
  begin
385
- return @blob_client.get_blob(container, filename)[1]
451
+ return @blob_client.get_blob(@container, filename)[1]
386
452
  rescue Exception => e
387
453
  @logger.error("caught: #{e.message} for full_read")
388
454
  if (tries -= 1) > 0
@@ -393,19 +459,56 @@ private
393
459
  end
394
460
  end
395
461
  begin
396
- chuck = @blob_client.get_blob(container, filename)[1]
462
+ chuck = @blob_client.get_blob(@container, filename)[1]
397
463
  end
398
464
  return chuck
399
465
  end
400
466
 
401
- def partial_read_json(filename, offset, length)
402
- content = @blob_client.get_blob(container, filename, start_range: offset-@tail.length, end_range: length-1)[1]
403
- if content.end_with?(@tail)
404
- # the tail is part of the last block, so included in the total length of the get_blob
405
- return @head + strip_comma(content)
406
- else
407
- # when the file has grown between list_blobs and the time of partial reading, the tail will be wrong
408
- return @head + strip_comma(content[0...-@tail.length]) + @tail
467
+ def partial_read(blobname, offset)
468
+ # 1. read committed blocks, calculate length
469
+ # 2. calculate the offset to read
470
+ # 3. strip comma
471
+ # if json strip comma and fix head and tail
472
+ size = 0
473
+
474
+ begin
475
+ if @append
476
+ return @blob_client.get_blob(@container, blobname, start_range: offset-1)[1]
477
+ end
478
+ blocks = @blob_client.list_blob_blocks(@container, blobname)
479
+ blocks[:committed].each do |block|
480
+ size += block.size
481
+ end
482
+ # read the new blob blocks from the offset to the last committed size.
483
+ # if it is json, fix the head and tail
484
+ # crap committed block at the end is the tail, so must be substracted from the read and then comma stripped and tail added.
485
+ # but why did I need a -1 for the length?? probably the offset starts at 0 and ends at size-1
486
+
487
+ # should first check commit, read and the check committed again? no, only read the commited size
488
+ # should read the full content and then substract json tail
489
+
490
+ unless @is_json
491
+ return @blob_client.get_blob(@container, blobname, start_range: offset, end_range: size-1)[1]
492
+ else
493
+ content = @blob_client.get_blob(@container, blobname, start_range: offset-1, end_range: size-1)[1]
494
+ if content.end_with?(@tail)
495
+ return @head + strip_comma(content)
496
+ else
497
+ @logger.info("Fixed a tail! probably new committed blocks started appearing!")
498
+ # substract the length of the tail and add the tail, because the file grew.size was calculated as the block boundary, so replacing the last bytes with the tail should fix the problem
499
+ return @head + strip_comma(content[0...-@tail.length]) + @tail
500
+ end
501
+ end
502
+ rescue InvalidBlobType => ibt
503
+ @logger.error("caught #{ibt.message}. Setting BlobType to append")
504
+ @append = true
505
+ retry
506
+ rescue NoMethodError => nme
507
+ @logger.error("caught #{nme.message}. Setting append to true")
508
+ @append = true
509
+ retry
510
+ rescue Exception => e
511
+ @logger.error("caught #{e.message}")
409
512
  end
410
513
  end
411
514
 
@@ -422,8 +525,9 @@ private
422
525
  count=0
423
526
  begin
424
527
  json["records"].each do |record|
425
- res = resource(record["resourceId"])
426
- resource = { :subscription => res[:subscription], :resourcegroup => res[:resourcegroup], :nsg => res[:nsg] }
528
+ resource = resource(record["resourceId"])
529
+ # resource = { :subscription => res[:subscription], :resourcegroup => res[:resourcegroup], :nsg => res[:nsg] }
530
+ extras = { :time => record["time"], :system => record["systemId"], :mac => record["macAddress"], :category => record["category"], :operation => record["operationName"] }
427
531
  @logger.trace(resource.to_s)
428
532
  record["properties"]["flows"].each do |flows|
429
533
  rule = resource.merge ({ :rule => flows["rule"]})
@@ -442,7 +546,18 @@ private
442
546
  if @addfilename
443
547
  ev.merge!( {:filename => name } )
444
548
  end
549
+ unless @environment.nil?
550
+ ev.merge!( {:environment => environment } )
551
+ end
552
+ if @addall
553
+ ev.merge!( extras )
554
+ end
555
+
556
+ # Add event to logstash queue
445
557
  event = LogStash::Event.new('message' => ev.to_json)
558
+ #if @ecs_compatibility != "disabled"
559
+ # event = ecs(event)
560
+ #end
446
561
  decorate(event)
447
562
  queue << event
448
563
  count+=1
@@ -493,26 +608,31 @@ private
493
608
  nextMarker = nil
494
609
  counter = 1
495
610
  loop do
496
- blobs = @blob_client.list_blobs(container, { marker: nextMarker, prefix: @prefix})
497
- blobs.each do |blob|
498
- # FNM_PATHNAME is required so that "**/test" can match "test" at the root folder
499
- # FNM_EXTGLOB allows you to use "test{a,b,c}" to match either "testa", "testb" or "testc" (closer to shell behavior)
500
- unless blob.name == registry_path
501
- if @path_filters.any? {|path| File.fnmatch?(path, blob.name, File::FNM_PATHNAME | File::FNM_EXTGLOB)}
502
- length = blob.properties[:content_length].to_i
503
- offset = 0
504
- if fill
505
- offset = length
611
+ begin
612
+ blobs = @blob_client.list_blobs(@container, { marker: nextMarker, prefix: @prefix})
613
+ blobs.each do |blob|
614
+ # FNM_PATHNAME is required so that "**/test" can match "test" at the root folder
615
+ # FNM_EXTGLOB allows you to use "test{a,b,c}" to match either "testa", "testb" or "testc" (closer to shell behavior)
616
+ unless blob.name == registry_path
617
+ if @path_filters.any? {|path| File.fnmatch?(path, blob.name, File::FNM_PATHNAME | File::FNM_EXTGLOB)}
618
+ length = blob.properties[:content_length].to_i
619
+ offset = 0
620
+ if fill
621
+ offset = length
622
+ end
623
+ files.store(blob.name, { :offset => offset, :length => length })
624
+ if (@debug_until > @processed) then @logger.info("1: list_blobs #{blob.name} #{offset} #{length}") end
506
625
  end
507
- files.store(blob.name, { :offset => offset, :length => length })
508
- if (@debug_until > @processed) then @logger.info("1: list_blobs #{blob.name} #{offset} #{length}") end
509
626
  end
510
627
  end
628
+ nextMarker = blobs.continuation_token
629
+ break unless nextMarker && !nextMarker.empty?
630
+ if (counter % 10 == 0) then @logger.info(" listing #{counter * 50000} files") end
631
+ counter+=1
632
+ rescue Exception => e
633
+ @logger.error("caught: #{e.message} while trying to list blobs")
634
+ return files
511
635
  end
512
- nextMarker = blobs.continuation_token
513
- break unless nextMarker && !nextMarker.empty?
514
- if (counter % 10 == 0) then @logger.info(" listing #{counter * 50000} files") end
515
- counter+=1
516
636
  end
517
637
  if @debug_timer
518
638
  @logger.info("list_blobs took #{Time.now.to_i - chrono} sec")
@@ -532,7 +652,7 @@ private
532
652
  begin
533
653
  @busy_writing_registry.lock
534
654
  unless (@registry_local_path)
535
- @blob_client.create_block_blob(container, registry_path, regdump)
655
+ @blob_client.create_block_blob(@container, registry_path, regdump)
536
656
  @logger.info("processed #{@processed} events, saving #{regsize} blobs and offsets to remote registry #{registry_path}")
537
657
  else
538
658
  File.open(@registry_local_path+"/"+@pipe_id, 'w') { |file| file.write(regdump) }
@@ -558,20 +678,20 @@ private
558
678
  @logger.info("learn_encapsulation, this can be skipped by setting skip_learning => true. Or set both head_file and tail_file")
559
679
  # From one file, read first block and last block to learn head and tail
560
680
  begin
561
- blobs = @blob_client.list_blobs(container, { max_results: 3, prefix: @prefix})
681
+ blobs = @blob_client.list_blobs(@container, { max_results: 3, prefix: @prefix})
562
682
  blobs.each do |blob|
563
683
  unless blob.name == registry_path
564
684
  begin
565
- blocks = @blob_client.list_blob_blocks(container, blob.name)[:committed]
566
- if blocks.first.name.start_with?('A00')
685
+ blocks = @blob_client.list_blob_blocks(@container, blob.name)[:committed]
686
+ if ['A00000000000000000000000000000000','QTAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAw'].include?(blocks.first.name)
567
687
  @logger.debug("using #{blob.name}/#{blocks.first.name} to learn the json header")
568
- @head = @blob_client.get_blob(container, blob.name, start_range: 0, end_range: blocks.first.size-1)[1]
688
+ @head = @blob_client.get_blob(@container, blob.name, start_range: 0, end_range: blocks.first.size-1)[1]
569
689
  end
570
- if blocks.last.name.start_with?('Z00')
690
+ if ['Z00000000000000000000000000000000','WjAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAw'].include?(blocks.last.name)
571
691
  @logger.debug("using #{blob.name}/#{blocks.last.name} to learn the json footer")
572
692
  length = blob.properties[:content_length].to_i
573
693
  offset = length - blocks.last.size
574
- @tail = @blob_client.get_blob(container, blob.name, start_range: offset, end_range: length-1)[1]
694
+ @tail = @blob_client.get_blob(@container, blob.name, start_range: offset, end_range: length-1)[1]
575
695
  @logger.debug("learned tail: #{@tail}")
576
696
  end
577
697
  rescue Exception => e
@@ -586,15 +706,61 @@ private
586
706
 
587
707
  def resource(str)
588
708
  temp = str.split('/')
589
- date = '---'
590
- unless temp[9].nil?
591
- date = val(temp[9])+'/'+val(temp[10])+'/'+val(temp[11])+'-'+val(temp[12])+':00'
592
- end
593
- return {:subscription=> temp[2], :resourcegroup=>temp[4], :nsg=>temp[8], :date=>date}
709
+ #date = '---'
710
+ #unless temp[9].nil?
711
+ # date = val(temp[9])+'/'+val(temp[10])+'/'+val(temp[11])+'-'+val(temp[12])+':00'
712
+ #end
713
+ return {:subscription=> temp[2], :resourcegroup=>temp[4], :nsg=>temp[8]}
594
714
  end
595
715
 
596
716
  def val(str)
597
717
  return str.split('=')[1]
598
718
  end
599
-
600
719
  end # class LogStash::Inputs::AzureBlobStorage
720
+
721
+ # This is a start towards mapping NSG events to ECS fields ... it's complicated
722
+ =begin
723
+ def ecs(old)
724
+ # https://www.elastic.co/guide/en/ecs/current/ecs-field-reference.html
725
+ ecs = LogStash::Event.new()
726
+ ecs.set("ecs.version", "1.0.0")
727
+ ecs.set("@timestamp", old.timestamp)
728
+ ecs.set("cloud.provider", "azure")
729
+ ecs.set("cloud.account.id", old.get("[subscription]")
730
+ ecs.set("cloud.project.id", old.get("[environment]")
731
+ ecs.set("file.name", old.get("[filename]")
732
+ ecs.set("event.category", "network")
733
+ if old.get("[decision]") == "D"
734
+ ecs.set("event.type", "denied")
735
+ else
736
+ ecs.set("event.type", "allowed")
737
+ end
738
+ ecs.set("event.action", "")
739
+ ecs.set("rule.ruleset", old.get("[nsg]")
740
+ ecs.set("rule.name", old.get("[rule]")
741
+ ecs.set("trace.id", old.get("[protocol]")+"/"+old.get("[src_ip]")+":"+old.get("[src_port]")+"-"+old.get("[dst_ip]")+":"+old.get("[dst_port]")
742
+ # requires logic to match sockets and flip src/dst for outgoing.
743
+ ecs.set("host.mac", old.get("[mac]")
744
+ ecs.set("source.ip", old.get("[src_ip]")
745
+ ecs.set("source.port", old.get("[src_port]")
746
+ ecs.set("source.bytes", old.get("[srcbytes]")
747
+ ecs.set("source.packets", old.get("[src_pack]")
748
+ ecs.set("destination.ip", old.get("[dst_ip]")
749
+ ecs.set("destination.port", old.get("[dst_port]")
750
+ ecs.set("destination.bytes", old.get("[dst_bytes]")
751
+ ecs.set("destination.packets", old.get("[dst_packets]")
752
+ if old.get("[protocol]") = "U"
753
+ ecs.set("network.transport", "udp")
754
+ else
755
+ ecs.set("network.transport", "tcp")
756
+ end
757
+ if old.get("[decision]") == "I"
758
+ ecs.set("network.direction", "incoming")
759
+ else
760
+ ecs.set("network.direction", "outgoing")
761
+ end
762
+ ecs.set("network.bytes", old.get("[src_bytes]")+old.get("[dst_bytes]")
763
+ ecs.set("network.packets", old.get("[src_packets]")+old.get("[dst_packets]")
764
+ return ecs
765
+ end
766
+ =end
@@ -1,6 +1,6 @@
1
1
  Gem::Specification.new do |s|
2
2
  s.name = 'logstash-input-azure_blob_storage'
3
- s.version = '0.12.6'
3
+ s.version = '0.12.8'
4
4
  s.licenses = ['Apache-2.0']
5
5
  s.summary = 'This logstash plugin reads and parses data from Azure Storage Blobs.'
6
6
  s.description = <<-EOF
@@ -24,5 +24,5 @@ EOF
24
24
  s.add_runtime_dependency 'stud', '~> 0.0.23'
25
25
  s.add_runtime_dependency 'azure-storage-blob', '~> 2', '>= 2.0.3'
26
26
  s.add_development_dependency 'logstash-devutils', '~> 2.4'
27
- s.add_development_dependency 'rubocop', '~> 1.48'
27
+ s.add_development_dependency 'rubocop', '~> 1.50'
28
28
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: logstash-input-azure_blob_storage
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.12.6
4
+ version: 0.12.8
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jan Geertsma
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2023-03-17 00:00:00.000000000 Z
11
+ date: 2023-07-15 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  requirement: !ruby/object:Gem::Requirement
@@ -77,7 +77,7 @@ dependencies:
77
77
  requirements:
78
78
  - - "~>"
79
79
  - !ruby/object:Gem::Version
80
- version: '1.48'
80
+ version: '1.50'
81
81
  name: rubocop
82
82
  prerelease: false
83
83
  type: :development
@@ -85,7 +85,7 @@ dependencies:
85
85
  requirements:
86
86
  - - "~>"
87
87
  - !ruby/object:Gem::Version
88
- version: '1.48'
88
+ version: '1.50'
89
89
  description: " This gem is a Logstash plugin. It reads and parses data from Azure\
90
90
  \ Storage Blobs. The azure_blob_storage is a reimplementation to replace azureblob\
91
91
  \ from azure-diagnostics-tools/Logstash. It can deal with larger volumes and partial\