logstash-input-azure_blob_storage 0.12.3 → 0.12.5

Sign up to get free protection for your applications and to get access to all the features.
@@ -1,573 +1,598 @@
1
1
  # encoding: utf-8
2
2
  require 'logstash/inputs/base'
3
+ #require 'logstash/namespace'
3
4
  require 'stud/interval'
4
5
  require 'azure/storage/blob'
5
6
  require 'json'
6
7
 
7
- # This is a logstash input plugin for files in Azure Blob Storage. There is a storage explorer in the portal and an application with the same name https://storageexplorer.com. A storage account has by default a globally unique name, {storageaccount}.blob.core.windows.net which is a CNAME to Azures blob servers blob.*.store.core.windows.net. A storageaccount has an container and those have a directory and blobs (like files). Blobs have one or more blocks. After writing the blocks, they can be committed. Some Azure diagnostics can send events to an EventHub that can be parse through the plugin logstash-input-azure_event_hubs, but for the events that are only stored in an storage account, use this plugin. The original logstash-input-azureblob from azure-diagnostics-tools is great for low volumes, but it suffers from outdated client, slow reads, lease locking issues and json parse errors.
8
- # https://azure.microsoft.com/en-us/services/storage/blobs/
8
+ # This is a logstash input plugin for files in Azure Storage Accounts. There is a storage explorer in the portal and an application with the same name https://storageexplorer.com.
9
+
10
+ # https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction
11
+ # The hierarchy of an Azure block storage is
12
+ # Tenant > Subscription > Account > ResourceGroup > StorageAccount > Container > FileBlobs > Blocks
13
+ # A storage account can store blobs, file shares, queus and tables. This plugin is using the Azure ruby plugin to fetch blobs and process the data in the blocks and dealt with blobs growing over time and ignoring archive blobs
14
+ #
15
+ # block-id bytes content
16
+ # A00000000000000000000000000000000 12 {"records":[
17
+ # D672f4bbd95a04209b00dc05d899e3cce 2576 json objects for 1st minute
18
+ # D7fe0d4f275a84c32982795b0e5c7d3a1 2312 json objects for 2nd minute
19
+ # Z00000000000000000000000000000000 2 ]}
20
+
21
+ # A storage account has by default a globally unique name, {storageaccount}.blob.core.windows.net which is a CNAME to Azures blob servers blob.*.store.core.windows.net. A storageaccount has an container and those have a directory and blobs (like files). Blobs have one or more blocks. After writing the blocks, they can be committed. Some Azure diagnostics can send events to an EventHub that can be parse through the plugin logstash-input-azure_event_hubs, but for the events that are only stored in an storage account, use this plugin. The original logstash-input-azureblob from azure-diagnostics-tools is great for low volumes, but it suffers from outdated client, slow reads, lease locking issues and json parse errors.
22
+
23
+
9
24
  class LogStash::Inputs::AzureBlobStorage < LogStash::Inputs::Base
10
- config_name "azure_blob_storage"
25
+ config_name "azure_blob_storage"
11
26
 
12
- # If undefined, Logstash will complain, even if codec is unused. The codec for nsgflowlog is "json" and the for WADIIS and APPSERVICE is "line".
13
- default :codec, "json"
27
+ # If undefined, Logstash will complain, even if codec is unused. The codec for nsgflowlog is "json" and the for WADIIS and APPSERVICE is "line".
28
+ default :codec, "json"
14
29
 
15
- # logtype can be nsgflowlog, wadiis, appservice or raw. The default is raw, where files are read and added as one event. If the file grows, the next interval the file is read from the offset, so that the delta is sent as another event. In raw mode, further processing has to be done in the filter block. If the logtype is specified, this plugin will split and mutate and add individual events to the queue.
16
- config :logtype, :validate => ['nsgflowlog','wadiis','appservice','raw'], :default => 'raw'
30
+ # logtype can be nsgflowlog, wadiis, appservice or raw. The default is raw, where files are read and added as one event. If the file grows, the next interval the file is read from the offset, so that the delta is sent as another event. In raw mode, further processing has to be done in the filter block. If the logtype is specified, this plugin will split and mutate and add individual events to the queue.
31
+ config :logtype, :validate => ['nsgflowlog','wadiis','appservice','raw'], :default => 'raw'
17
32
 
18
- # The storage account is accessed through Azure::Storage::Blob::BlobService, it needs either a sas_token, connection string or a storageaccount/access_key pair.
19
- # https://github.com/Azure/azure-storage-ruby/blob/master/blob/lib/azure/storage/blob/blob_service.rb#L42
20
- config :connection_string, :validate => :password, :required => false
33
+ # The storage account is accessed through Azure::Storage::Blob::BlobService, it needs either a sas_token, connection string or a storageaccount/access_key pair.
34
+ # https://github.com/Azure/azure-storage-ruby/blob/master/blob/lib/azure/storage/blob/blob_service.rb#L42
35
+ config :connection_string, :validate => :password, :required => false
21
36
 
22
- # The storage account name for the azure storage account.
23
- config :storageaccount, :validate => :string, :required => false
37
+ # The storage account name for the azure storage account.
38
+ config :storageaccount, :validate => :string, :required => false
24
39
 
25
- # DNS Suffix other then blob.core.windows.net
26
- config :dns_suffix, :validate => :string, :required => false, :default => 'core.windows.net'
40
+ # The (primary or secondary) Access Key for the the storage account. The key can be found in the portal.azure.com or through the azure api StorageAccounts/ListKeys. For example the PowerShell command Get-AzStorageAccountKey.
41
+ config :access_key, :validate => :password, :required => false
27
42
 
28
- # For development this can be used to emulate an accountstorage when not available from azure
29
- #config :use_development_storage, :validate => :boolean, :required => false
43
+ # SAS is the Shared Access Signature, that provides restricted access rights. If the sas_token is absent, the access_key is used instead.
44
+ config :sas_token, :validate => :password, :required => false
30
45
 
31
- # The (primary or secondary) Access Key for the the storage account. The key can be found in the portal.azure.com or through the azure api StorageAccounts/ListKeys. For example the PowerShell command Get-AzStorageAccountKey.
32
- config :access_key, :validate => :password, :required => false
46
+ # The container of the blobs.
47
+ config :container, :validate => :string, :default => 'insights-logs-networksecuritygroupflowevent'
33
48
 
34
- # SAS is the Shared Access Signature, that provides restricted access rights. If the sas_token is absent, the access_key is used instead.
35
- config :sas_token, :validate => :password, :required => false
49
+ # DNS Suffix other then blob.core.windows.net, needed for government cloud.
50
+ config :dns_suffix, :validate => :string, :required => false, :default => 'core.windows.net'
36
51
 
37
- # The container of the blobs.
38
- config :container, :validate => :string, :default => 'insights-logs-networksecuritygroupflowevent'
52
+ # For development this can be used to emulate an accountstorage when not available from azure
53
+ #config :use_development_storage, :validate => :boolean, :required => false
39
54
 
40
- # The registry file keeps track of the files that have been processed and until which offset in bytes. It's similar in function
41
- #
42
- # The default, `data/registry`, it contains a Ruby Marshal Serialized Hash of the filename the offset read sofar and the filelength the list time a filelisting was done.
43
- config :registry_path, :validate => :string, :required => false, :default => 'data/registry.dat'
55
+ # The registry keeps track of the files that where already procesed.
56
+ # The registry file keeps track of the files that have been processed and until which offset in bytes. It's similar in function
57
+ #
58
+ # The default, `data/registry`, it contains a Ruby Marshal Serialized Hash of the filename the offset read sofar and the filelength the list time a filelisting was done.
59
+ config :registry_path, :validate => :string, :required => false, :default => 'data/registry.dat'
44
60
 
45
- # If registry_local_path is set to a directory on the local server, the registry is save there instead of the remote blob_storage
46
- config :registry_local_path, :validate => :string, :required => false
61
+ # If registry_local_path is set to a directory on the local server, the registry is save there instead of the remote blob_storage
62
+ config :registry_local_path, :validate => :string, :required => false
47
63
 
48
- # The default, `resume`, will load the registry offsets and will start processing files from the offsets.
49
- # When set to `start_over`, all log files are processed from begining.
50
- # when set to `start_fresh`, it will read log files that are created or appended since this start of the pipeline.
51
- config :registry_create_policy, :validate => ['resume','start_over','start_fresh'], :required => false, :default => 'resume'
64
+ # The default, `resume`, will load the registry offsets and will start processing files from the offsets.
65
+ # When set to `start_over`, all log files are processed from begining.
66
+ # when set to `start_fresh`, it will read log files that are created or appended since this start of the pipeline.
67
+ config :registry_create_policy, :validate => ['resume','start_over','start_fresh'], :required => false, :default => 'resume'
52
68
 
53
- # The registry keeps track of the files that where already procesed. The interval is used to save the registry regularly, when new events have have been processed. It is also used to wait before listing the files again and substraciting the registry of already processed files to determine the worklist.
54
- #
55
- # waiting time in seconds until processing the next batch. NSGFLOWLOGS append a block per minute, so use multiples of 60 seconds, 300 for 5 minutes, 600 for 10 minutes. The registry is also saved after every interval.
56
- # Partial reading starts from the offset and reads until the end, so the starting tag is prepended
57
- #
58
- # A00000000000000000000000000000000 12 {"records":[
59
- # D672f4bbd95a04209b00dc05d899e3cce 2576 json objects for 1st minute
60
- # D7fe0d4f275a84c32982795b0e5c7d3a1 2312 json objects for 2nd minute
61
- # Z00000000000000000000000000000000 2 ]}
62
- config :interval, :validate => :number, :default => 60
69
+ # The interval is used to save the registry regularly, when new events have have been processed. It is also used to wait before listing the files again and substracting the registry of already processed files to determine the worklist.
70
+ # waiting time in seconds until processing the next batch. NSGFLOWLOGS append a block per minute, so use multiples of 60 seconds, 300 for 5 minutes, 600 for 10 minutes. The registry is also saved after every interval.
71
+ # Partial reading starts from the offset and reads until the end, so the starting tag is prepended
72
+ config :interval, :validate => :number, :default => 60
63
73
 
64
- # add the filename into the events
65
- config :addfilename, :validate => :boolean, :default => false, :required => false
74
+ # add the filename as a field into the events
75
+ config :addfilename, :validate => :boolean, :default => false, :required => false
66
76
 
67
- # debug_until will for a maximum amount of processed messages shows 3 types of log printouts including processed filenames. This is a lightweight alternative to switching the loglevel from info to debug or even trace
68
- config :debug_until, :validate => :number, :default => 0, :required => false
77
+ # debug_until will at the creation of the pipeline for a maximum amount of processed messages shows 3 types of log printouts including processed filenames. After a number of events, the plugin will stop logging the events and continue silently. This is a lightweight alternative to switching the loglevel from info to debug or even trace to see what the plugin is doing and how fast at the start of the plugin. A good value would be approximately 3x the amount of events per file. For instance 6000 events.
78
+ config :debug_until, :validate => :number, :default => 0, :required => false
69
79
 
70
- # debug_timer show time spent on activities
71
- config :debug_timer, :validate => :boolean, :default => false, :required => false
80
+ # debug_timer show in the logs, the time spent on activities
81
+ config :debug_timer, :validate => :boolean, :default => false, :required => false
72
82
 
73
- # WAD IIS Grok Pattern
74
- #config :grokpattern, :validate => :string, :required => false, :default => '%{TIMESTAMP_ISO8601:log_timestamp} %{NOTSPACE:instanceId} %{NOTSPACE:instanceId2} %{IPORHOST:ServerIP} %{WORD:httpMethod} %{URIPATH:requestUri} %{NOTSPACE:requestQuery} %{NUMBER:port} %{NOTSPACE:username} %{IPORHOST:clientIP} %{NOTSPACE:httpVersion} %{NOTSPACE:userAgent} %{NOTSPACE:cookie} %{NOTSPACE:referer} %{NOTSPACE:host} %{NUMBER:httpStatus} %{NUMBER:subresponse} %{NUMBER:win32response} %{NUMBER:sentBytes:int} %{NUMBER:receivedBytes:int} %{NUMBER:timeTaken:int}'
83
+ # WAD IIS Grok Pattern
84
+ #config :grokpattern, :validate => :string, :required => false, :default => '%{TIMESTAMP_ISO8601:log_timestamp} %{NOTSPACE:instanceId} %{NOTSPACE:instanceId2} %{IPORHOST:ServerIP} %{WORD:httpMethod} %{URIPATH:requestUri} %{NOTSPACE:requestQuery} %{NUMBER:port} %{NOTSPACE:username} %{IPORHOST:clientIP} %{NOTSPACE:httpVersion} %{NOTSPACE:userAgent} %{NOTSPACE:cookie} %{NOTSPACE:referer} %{NOTSPACE:host} %{NUMBER:httpStatus} %{NUMBER:subresponse} %{NUMBER:win32response} %{NUMBER:sentBytes:int} %{NUMBER:receivedBytes:int} %{NUMBER:timeTaken:int}'
75
85
 
76
- # skip learning if you use json and don't want to learn the head and tail, but use either the defaults or configure them.
77
- config :skip_learning, :validate => :boolean, :default => false, :required => false
86
+ # skip learning if you use json and don't want to learn the head and tail, but use either the defaults or configure them.
87
+ config :skip_learning, :validate => :boolean, :default => false, :required => false
78
88
 
79
- # The string that starts the JSON. Only needed when the codec is JSON. When partial file are read, the result will not be valid JSON unless the start and end are put back. the file_head and file_tail are learned at startup, by reading the first file in the blob_list and taking the first and last block, this would work for blobs that are appended like nsgflowlogs. The configuration can be set to override the learning. In case learning fails and the option is not set, the default is to use the 'records' as set by nsgflowlogs.
80
- config :file_head, :validate => :string, :required => false, :default => '{"records":['
81
- # The string that ends the JSON
82
- config :file_tail, :validate => :string, :required => false, :default => ']}'
89
+ # The string that starts the JSON. Only needed when the codec is JSON. When partial file are read, the result will not be valid JSON unless the start and end are put back. the file_head and file_tail are learned at startup, by reading the first file in the blob_list and taking the first and last block, this would work for blobs that are appended like nsgflowlogs. The configuration can be set to override the learning. In case learning fails and the option is not set, the default is to use the 'records' as set by nsgflowlogs.
90
+ config :file_head, :validate => :string, :required => false, :default => '{"records":['
91
+ # The string that ends the JSON
92
+ config :file_tail, :validate => :string, :required => false, :default => ']}'
83
93
 
84
- # The path(s) to the file(s) to use as an input. By default it will
85
- # watch every files in the storage container.
86
- # You can use filename patterns here, such as `logs/*.log`.
87
- # If you use a pattern like `logs/**/*.log`, a recursive search
88
- # of `logs` will be done for all `*.log` files.
89
- # Do not include a leading `/`, as Azure path look like this:
90
- # `path/to/blob/file.txt`
91
- #
92
- # You may also configure multiple paths. See an example
93
- # on the <<array,Logstash configuration page>>.
94
- # For NSGFLOWLOGS a path starts with "resourceId=/", but this would only be needed to exclude other files that may be written in the same container.
95
- config :prefix, :validate => :string, :required => false
94
+ # By default it will watch every file in the storage container. The prefix option is a simple filter that only processes files with a path that starts with that value.
95
+ # For NSGFLOWLOGS a path starts with "resourceId=/". This would only be needed to exclude other paths that may be written in the same container. The registry file will be excluded.
96
+ # You may also configure multiple paths. See an example on the <<array,Logstash configuration page>>.
97
+ # Do not include a leading `/`, as Azure path look like this: `path/to/blob/file.txt`
98
+ config :prefix, :validate => :string, :required => false
96
99
 
97
- config :path_filters, :validate => :array, :default => ['**/*'], :required => false
100
+ # For filtering on filenames, you can use filename patterns, such as `logs/*.log`. If you use a pattern like `logs/**/*.log`, a recursive search of `logs` will be done for all `*.log` files in the logs directory.
101
+ # For https://www.rubydoc.info/stdlib/core/File.fnmatch
102
+ config :path_filters, :validate => :array, :default => ['**/*'], :required => false
98
103
 
99
104
 
100
105
 
101
106
  public
102
- def register
103
- @pipe_id = Thread.current[:name].split("[").last.split("]").first
104
- @logger.info("=== #{config_name} #{Gem.loaded_specs["logstash-input-"+config_name].version.to_s} / #{@pipe_id} / #{@id[0,6]} / ruby #{ RUBY_VERSION }p#{ RUBY_PATCHLEVEL } ===")
105
- @logger.info("If this plugin doesn't work, please raise an issue in https://github.com/janmg/logstash-input-azure_blob_storage")
106
- @busy_writing_registry = Mutex.new
107
- # TODO: consider multiple readers, so add pipeline @id or use logstash-to-logstash communication?
108
- # TODO: Implement retry ... Error: Connection refused - Failed to open TCP connection to
109
- end
110
-
111
-
112
-
113
- def run(queue)
114
- # counter for all processed events since the start of this pipeline
115
- @processed = 0
116
- @regsaved = @processed
117
-
118
- connect
119
-
120
- @registry = Hash.new
121
- if registry_create_policy == "resume"
122
- for counter in 1..3
123
- begin
124
- if (!@registry_local_path.nil?)
125
- unless File.file?(@registry_local_path+"/"+@pipe_id)
126
- @registry = Marshal.load(@blob_client.get_blob(container, registry_path)[1])
127
- #[0] headers [1] responsebody
128
- @logger.info("migrating from remote registry #{registry_path}")
129
- else
130
- if !Dir.exist?(@registry_local_path)
131
- FileUtils.mkdir_p(@registry_local_path)
132
- end
133
- @registry = Marshal.load(File.read(@registry_local_path+"/"+@pipe_id))
134
- @logger.info("resuming from local registry #{registry_local_path+"/"+@pipe_id}")
135
- end
136
- else
137
- @registry = Marshal.load(@blob_client.get_blob(container, registry_path)[1])
138
- #[0] headers [1] responsebody
139
- @logger.info("resuming from remote registry #{registry_path}")
140
- end
141
- break
142
- rescue Exception => e
143
- @logger.error("caught: #{e.message}")
144
- @registry.clear
145
- @logger.error("loading registry failed for attempt #{counter} of 3")
146
- end
147
- end
148
- end
149
- # read filelist and set offsets to file length to mark all the old files as done
150
- if registry_create_policy == "start_fresh"
151
- @registry = list_blobs(true)
152
- save_registry()
153
- @logger.info("starting fresh, writing a clean registry to contain #{@registry.size} blobs/files")
107
+ def register
108
+ @pipe_id = Thread.current[:name].split("[").last.split("]").first
109
+ @logger.info("=== #{config_name} #{Gem.loaded_specs["logstash-input-"+config_name].version.to_s} / #{@pipe_id} / #{@id[0,6]} / ruby #{ RUBY_VERSION }p#{ RUBY_PATCHLEVEL } ===")
110
+ @logger.info("If this plugin doesn't work, please raise an issue in https://github.com/janmg/logstash-input-azure_blob_storage")
111
+ @busy_writing_registry = Mutex.new
112
+ # TODO: consider multiple readers, so add pipeline @id or use logstash-to-logstash communication?
154
113
  end
155
114
 
156
- @is_json = false
157
- begin
158
- if @codec.class.name.eql?("LogStash::Codecs::JSON")
159
- @is_json = true
160
- end
161
- end
162
- @head = ''
163
- @tail = ''
164
- # if codec=json sniff one files blocks A and Z to learn file_head and file_tail
165
- if @is_json
166
- if file_head
167
- @head = file_head
168
- end
169
- if file_tail
170
- @tail = file_tail
115
+
116
+
117
+ def run(queue)
118
+ # counter for all processed events since the start of this pipeline
119
+ @processed = 0
120
+ @regsaved = @processed
121
+
122
+ connect
123
+
124
+ @registry = Hash.new
125
+ if registry_create_policy == "resume"
126
+ for counter in 1..3
127
+ begin
128
+ if (!@registry_local_path.nil?)
129
+ unless File.file?(@registry_local_path+"/"+@pipe_id)
130
+ @registry = Marshal.load(@blob_client.get_blob(container, registry_path)[1])
131
+ #[0] headers [1] responsebody
132
+ @logger.info("migrating from remote registry #{registry_path}")
133
+ else
134
+ if !Dir.exist?(@registry_local_path)
135
+ FileUtils.mkdir_p(@registry_local_path)
136
+ end
137
+ @registry = Marshal.load(File.read(@registry_local_path+"/"+@pipe_id))
138
+ @logger.info("resuming from local registry #{registry_local_path+"/"+@pipe_id}")
139
+ end
140
+ else
141
+ @registry = Marshal.load(@blob_client.get_blob(container, registry_path)[1])
142
+ #[0] headers [1] responsebody
143
+ @logger.info("resuming from remote registry #{registry_path}")
144
+ end
145
+ break
146
+ rescue Exception => e
147
+ @logger.error("caught: #{e.message}")
148
+ @registry.clear
149
+ @logger.error("loading registry failed for attempt #{counter} of 3")
150
+ end
151
+ end
171
152
  end
172
- if file_head and file_tail and !skip_learning
173
- learn_encapsulation
153
+ # read filelist and set offsets to file length to mark all the old files as done
154
+ if registry_create_policy == "start_fresh"
155
+ @registry = list_blobs(true)
156
+ save_registry()
157
+ @logger.info("starting fresh, writing a clean registry to contain #{@registry.size} blobs/files")
174
158
  end
175
- @logger.info("head will be: #{@head} and tail is set to #{@tail}")
176
- end
177
159
 
178
- filelist = Hash.new
179
- worklist = Hash.new
180
- @last = start = Time.now.to_i
181
-
182
- # This is the main loop, it
183
- # 1. Lists all the files in the remote storage account that match the path prefix
184
- # 2. Filters on path_filters to only include files that match the directory and file glob (**/*.json)
185
- # 3. Save the listed files in a registry of known files and filesizes.
186
- # 4. List all the files again and compare the registry with the new filelist and put the delta in a worklist
187
- # 5. Process the worklist and put all events in the logstash queue.
188
- # 6. if there is time left, sleep to complete the interval. If processing takes more than an inteval, save the registry and continue.
189
- # 7. If stop signal comes, finish the current file, save the registry and quit
190
- while !stop?
191
- # load the registry, compare it's offsets to file list, set offset to 0 for new files, process the whole list and if finished within the interval wait for next loop,
192
- # TODO: sort by timestamp ?
193
- #filelist.sort_by(|k,v|resource(k)[:date])
194
- worklist.clear
195
- filelist.clear
196
-
197
- # Listing all the files
198
- filelist = list_blobs(false)
199
- filelist.each do |name, file|
200
- off = 0
201
- begin
202
- off = @registry[name][:offset]
203
- rescue
204
- off = 0
160
+ @is_json = false
161
+ @is_json_line = false
162
+ begin
163
+ if @codec.class.name.eql?("LogStash::Codecs::JSON")
164
+ @is_json = true
165
+ elsif @codec.class.name.eql?("LogStash::Codecs::JSONLines")
166
+ @is_json_line = true
205
167
  end
206
- @registry.store(name, { :offset => off, :length => file[:length] })
207
- if (@debug_until > @processed) then @logger.info("2: adding offsets: #{name} #{off} #{file[:length]}") end
208
- end
209
- # size nilClass when the list doesn't grow?!
210
-
211
- # clean registry of files that are not in the filelist
212
- @registry.each do |name,file|
213
- unless filelist.include?(name)
214
- @registry.delete(name)
215
- if (@debug_until > @processed) then @logger.info("purging #{name}") end
168
+ end
169
+ @head = ''
170
+ @tail = ''
171
+ # if codec=json sniff one files blocks A and Z to learn file_head and file_tail
172
+ if @is_json
173
+ if file_head
174
+ @head = file_head
175
+ end
176
+ if file_tail
177
+ @tail = file_tail
216
178
  end
179
+ if file_head and file_tail and !skip_learning
180
+ learn_encapsulation
181
+ end
182
+ @logger.info("head will be: #{@head} and tail is set to #{@tail}")
217
183
  end
218
184
 
219
- # Worklist is the subset of files where the already read offset is smaller than the file size
220
- worklist.clear
221
- chunk = nil
222
-
223
- worklist = @registry.select {|name,file| file[:offset] < file[:length]}
224
- if (worklist.size > 4) then @logger.info("worklist contains #{worklist.size} blobs") end
225
-
226
- # Start of processing
227
- # This would be ideal for threading since it's IO intensive, would be nice with a ruby native ThreadPool
228
- if (worklist.size > 0) then
229
- worklist.each do |name, file|
230
- start = Time.now.to_i
231
- if (@debug_until > @processed) then @logger.info("3: processing #{name} from #{file[:offset]} to #{file[:length]}") end
232
- size = 0
233
- if file[:offset] == 0
234
- # This is where Sera4000 issue starts
235
- # For an append blob, reading full and crashing, retry, last_modified? ... lenght? ... committed? ...
236
- # length and skip reg value
237
- if (file[:length] > 0)
238
- begin
239
- chunk = full_read(name)
240
- size=chunk.size
241
- rescue Exception => e
242
- # Azure::Core::Http::HTTPError / undefined method `message='
243
- @logger.error("Failed to read #{name} ... will continue, set file as read and pretend this never happened")
244
- @logger.error("#{size} size and #{file[:length]} file length")
245
- size = file[:length]
246
- end
247
- else
248
- @logger.info("found a zero size file #{name}")
249
- chunk = nil
185
+ filelist = Hash.new
186
+ worklist = Hash.new
187
+ @last = start = Time.now.to_i
188
+
189
+ # This is the main loop, it
190
+ # 1. Lists all the files in the remote storage account that match the path prefix
191
+ # 2. Filters on path_filters to only include files that match the directory and file glob (**/*.json)
192
+ # 3. Save the listed files in a registry of known files and filesizes.
193
+ # 4. List all the files again and compare the registry with the new filelist and put the delta in a worklist
194
+ # 5. Process the worklist and put all events in the logstash queue.
195
+ # 6. if there is time left, sleep to complete the interval. If processing takes more than an inteval, save the registry and continue.
196
+ # 7. If stop signal comes, finish the current file, save the registry and quit
197
+ while !stop?
198
+ # load the registry, compare it's offsets to file list, set offset to 0 for new files, process the whole list and if finished within the interval wait for next loop,
199
+ # TODO: sort by timestamp ?
200
+ #filelist.sort_by(|k,v|resource(k)[:date])
201
+ worklist.clear
202
+ filelist.clear
203
+
204
+ # Listing all the files
205
+ filelist = list_blobs(false)
206
+ filelist.each do |name, file|
207
+ off = 0
208
+ begin
209
+ off = @registry[name][:offset]
210
+ rescue Exception => e
211
+ @logger.error("caught: #{e.message} while reading #{name}")
250
212
  end
251
- else
252
- chunk = partial_read_json(name, file[:offset], file[:length])
253
- @logger.debug("partial file #{name} from #{file[:offset]} to #{file[:length]}")
213
+ @registry.store(name, { :offset => off, :length => file[:length] })
214
+ if (@debug_until > @processed) then @logger.info("2: adding offsets: #{name} #{off} #{file[:length]}") end
254
215
  end
255
- if logtype == "nsgflowlog" && @is_json
256
- # skip empty chunks
257
- unless chunk.nil?
258
- res = resource(name)
259
- begin
260
- fingjson = JSON.parse(chunk)
261
- @processed += nsgflowlog(queue, fingjson, name)
262
- @logger.debug("Processed #{res[:nsg]} [#{res[:date]}] #{@processed} events")
263
- rescue JSON::ParserError
264
- @logger.error("parse error on #{res[:nsg]} [#{res[:date]}] offset: #{file[:offset]} length: #{file[:length]}")
216
+ # size nilClass when the list doesn't grow?!
217
+
218
+ # clean registry of files that are not in the filelist
219
+ @registry.each do |name,file|
220
+ unless filelist.include?(name)
221
+ @registry.delete(name)
222
+ if (@debug_until > @processed) then @logger.info("purging #{name}") end
265
223
  end
266
- end
267
- # TODO: Convert this to line based grokking.
268
- # TODO: ECS Compliance?
269
- elsif logtype == "wadiis" && !@is_json
270
- @processed += wadiislog(queue, name)
271
- else
272
- counter = 0
273
- begin
274
- @codec.decode(chunk) do |event|
275
- counter += 1
276
- if @addfilename
277
- event.set('filename', name)
224
+ end
225
+
226
+ # Worklist is the subset of files where the already read offset is smaller than the file size
227
+ worklist.clear
228
+ chunk = nil
229
+
230
+ worklist = @registry.select {|name,file| file[:offset] < file[:length]}
231
+ if (worklist.size > 4) then @logger.info("worklist contains #{worklist.size} blobs") end
232
+
233
+ # Start of processing
234
+ # This would be ideal for threading since it's IO intensive, would be nice with a ruby native ThreadPool
235
+ if (worklist.size > 0) then
236
+ worklist.each do |name, file|
237
+ start = Time.now.to_i
238
+ if (@debug_until > @processed) then @logger.info("3: processing #{name} from #{file[:offset]} to #{file[:length]}") end
239
+ size = 0
240
+ if file[:offset] == 0
241
+ # This is where Sera4000 issue starts
242
+ # For an append blob, reading full and crashing, retry, last_modified? ... lenght? ... committed? ...
243
+ # length and skip reg value
244
+ if (file[:length] > 0)
245
+ begin
246
+ chunk = full_read(name)
247
+ delta_size = chunk.size
248
+ rescue Exception => e
249
+ # Azure::Core::Http::HTTPError / undefined method `message='
250
+ @logger.error("Failed to read #{name} ... will continue, set file as read and pretend this never happened")
251
+ @logger.error("#{size} size and #{file[:length]} file length")
252
+ chunk = nil
253
+ delta_size = file[:length]
254
+ end
255
+ else
256
+ @logger.info("found a zero size file #{name}")
257
+ chunk = nil
258
+ delta_size = 0
259
+ end
260
+ else
261
+ chunk = partial_read_json(name, file[:offset], file[:length])
262
+ delta_size = chunk.size
263
+ @logger.debug("partial file #{name} from #{file[:offset]} to #{file[:length]}")
264
+ end
265
+
266
+ if logtype == "nsgflowlog" && @is_json
267
+ # skip empty chunks
268
+ unless chunk.nil?
269
+ res = resource(name)
270
+ begin
271
+ fingjson = JSON.parse(chunk)
272
+ @processed += nsgflowlog(queue, fingjson, name)
273
+ @logger.debug("Processed #{res[:nsg]} [#{res[:date]}] #{@processed} events")
274
+ rescue JSON::ParserError
275
+ @logger.error("parse error #{e.message} on #{res[:nsg]} [#{res[:date]}] offset: #{file[:offset]} length: #{file[:length]}")
276
+ @logger.debug("#{chunk}")
277
+ end
278
+ end
279
+ # TODO: Convert this to line based grokking.
280
+ # TODO: ECS Compliance?
281
+ elsif logtype == "wadiis" && !@is_json
282
+ @processed += wadiislog(queue, name)
283
+ else
284
+ # Handle JSONLines format
285
+ if !@chunk.nil? && @is_json_line
286
+ newline_rindex = chunk.rindex("\n")
287
+ if newline_rindex.nil?
288
+ # No full line in chunk, skip it without updating the registry.
289
+ # Expecting that the JSON line would be filled in at a subsequent iteration.
290
+ next
291
+ end
292
+ chunk = chunk[0..newline_rindex]
293
+ delta_size = chunk.size
294
+ end
295
+
296
+ counter = 0
297
+ begin
298
+ @codec.decode(chunk) do |event|
299
+ counter += 1
300
+ if @addfilename
301
+ event.set('filename', name)
302
+ end
303
+ decorate(event)
304
+ queue << event
305
+ end
306
+ @processed += counter
307
+ rescue Exception => e
308
+ @logger.error("codec exception: #{e.message} .. will continue and pretend this never happened")
309
+ @logger.debug("#{chunk}")
310
+ end
311
+ end
312
+
313
+ # Update the size
314
+ size = file[:offset] + delta_size
315
+ @registry.store(name, { :offset => size, :length => file[:length] })
316
+
317
+ #@logger.info("name #{name} size #{size} len #{file[:length]}")
318
+ # if stop? good moment to stop what we're doing
319
+ if stop?
320
+ return
321
+ end
322
+ if ((Time.now.to_i - @last) > @interval)
323
+ save_registry()
278
324
  end
279
- decorate(event)
280
- queue << event
281
- end
282
- @processed += counter
283
- rescue Exception => e
284
- @logger.error("codec exception: #{e.message} .. will continue and pretend this never happened")
285
- @registry.store(name, { :offset => file[:length], :length => file[:length] })
286
- @logger.debug("#{chunk}")
287
325
  end
288
326
  end
289
- @registry.store(name, { :offset => size, :length => file[:length] })
290
- # TODO add input plugin option to prevent connection cache
291
- @blob_client.client.reset_agents!
292
- #@logger.info("name #{name} size #{size} len #{file[:length]}")
293
- # if stop? good moment to stop what we're doing
294
- if stop?
295
- return
296
- end
297
- if ((Time.now.to_i - @last) > @interval)
327
+ # The files that got processed after the last registry save need to be saved too, in case the worklist is empty for some intervals.
328
+ now = Time.now.to_i
329
+ if ((now - @last) > @interval)
298
330
  save_registry()
299
331
  end
300
- end
301
- end
302
- # The files that got processed after the last registry save need to be saved too, in case the worklist is empty for some intervals.
303
- now = Time.now.to_i
304
- if ((now - @last) > @interval)
305
- save_registry()
306
- end
307
- sleeptime = interval - ((now - start) % interval)
308
- if @debug_timer
309
- @logger.info("going to sleep for #{sleeptime} seconds")
332
+ sleeptime = interval - ((now - start) % interval)
333
+ if @debug_timer
334
+ @logger.info("going to sleep for #{sleeptime} seconds")
335
+ end
336
+ Stud.stoppable_sleep(sleeptime) { stop? }
310
337
  end
311
- Stud.stoppable_sleep(sleeptime) { stop? }
312
338
  end
313
- end
314
339
 
315
- def stop
316
- save_registry()
317
- end
318
- def close
319
- save_registry()
320
- end
340
+ def stop
341
+ save_registry()
342
+ end
343
+ def close
344
+ save_registry()
345
+ end
321
346
 
322
347
 
323
348
  private
324
- def connect
325
- # Try in this order to access the storageaccount
326
- # 1. storageaccount / sas_token
327
- # 2. connection_string
328
- # 3. storageaccount / access_key
329
-
330
- unless connection_string.nil?
331
- conn = connection_string.value
332
- end
333
- unless sas_token.nil?
334
- unless sas_token.value.start_with?('?')
335
- conn = "BlobEndpoint=https://#{storageaccount}.#{dns_suffix};SharedAccessSignature=#{sas_token.value}"
349
+ def connect
350
+ # Try in this order to access the storageaccount
351
+ # 1. storageaccount / sas_token
352
+ # 2. connection_string
353
+ # 3. storageaccount / access_key
354
+
355
+ unless connection_string.nil?
356
+ conn = connection_string.value
357
+ end
358
+ unless sas_token.nil?
359
+ unless sas_token.value.start_with?('?')
360
+ conn = "BlobEndpoint=https://#{storageaccount}.#{dns_suffix};SharedAccessSignature=#{sas_token.value}"
361
+ else
362
+ conn = sas_token.value
363
+ end
364
+ end
365
+ unless conn.nil?
366
+ @blob_client = Azure::Storage::Blob::BlobService.create_from_connection_string(conn)
336
367
  else
337
- conn = sas_token.value
368
+ # unless use_development_storage?
369
+ @blob_client = Azure::Storage::Blob::BlobService.create(
370
+ storage_account_name: storageaccount,
371
+ storage_dns_suffix: dns_suffix,
372
+ storage_access_key: access_key.value,
373
+ )
374
+ # else
375
+ # @logger.info("development storage emulator not yet implemented")
376
+ # end
338
377
  end
339
378
  end
340
- unless conn.nil?
341
- @blob_client = Azure::Storage::Blob::BlobService.create_from_connection_string(conn)
342
- else
343
- # unless use_development_storage?
344
- @blob_client = Azure::Storage::Blob::BlobService.create(
345
- storage_account_name: storageaccount,
346
- storage_dns_suffix: dns_suffix,
347
- storage_access_key: access_key.value,
348
- )
349
- # else
350
- # @logger.info("not yet implemented")
351
- # end
352
- end
353
- end
354
-
355
- def full_read(filename)
356
- tries ||= 2
357
- begin
358
- return @blob_client.get_blob(container, filename)[1]
359
- rescue Exception => e
360
- @logger.error("caught: #{e.message} for full_read")
361
- if (tries -= 1) > 0
362
- if e.message = "Connection reset by peer"
363
- connect
364
- end
365
- retry
379
+
380
+ def full_read(filename)
381
+ tries ||= 2
382
+ begin
383
+ return @blob_client.get_blob(container, filename)[1]
384
+ rescue Exception => e
385
+ @logger.error("caught: #{e.message} for full_read")
386
+ if (tries -= 1) > 0
387
+ if e.message = "Connection reset by peer"
388
+ connect
389
+ end
390
+ retry
391
+ end
366
392
  end
393
+ begin
394
+ chuck = @blob_client.get_blob(container, filename)[1]
395
+ end
396
+ return chuck
367
397
  end
368
- begin
369
- chuck = @blob_client.get_blob(container, filename)[1]
370
- end
371
- return chuck
372
- end
373
-
374
- def partial_read_json(filename, offset, length)
375
- content = @blob_client.get_blob(container, filename, start_range: offset-@tail.length, end_range: length-1)[1]
376
- if content.end_with?(@tail)
377
- # the tail is part of the last block, so included in the total length of the get_blob
378
- return @head + strip_comma(content)
379
- else
380
- # when the file has grown between list_blobs and the time of partial reading, the tail will be wrong
381
- return @head + strip_comma(content[0...-@tail.length]) + @tail
398
+
399
+ def partial_read_json(filename, offset, length)
400
+ content = @blob_client.get_blob(container, filename, start_range: offset-@tail.length, end_range: length-1)[1]
401
+ if content.end_with?(@tail)
402
+ # the tail is part of the last block, so included in the total length of the get_blob
403
+ return @head + strip_comma(content)
404
+ else
405
+ # when the file has grown between list_blobs and the time of partial reading, the tail will be wrong
406
+ return @head + strip_comma(content[0...-@tail.length]) + @tail
407
+ end
382
408
  end
383
- end
384
409
 
385
- def strip_comma(str)
386
- # when skipping over the first blocks the json will start with a comma that needs to be stripped. there should not be a trailing comma, but it gets stripped too
387
- if str.start_with?(',')
388
- str[0] = ''
410
+ def strip_comma(str)
411
+ # when skipping over the first blocks the json will start with a comma that needs to be stripped. there should not be a trailing comma, but it gets stripped too
412
+ if str.start_with?(',')
413
+ str[0] = ''
414
+ end
415
+ str.nil? ? nil : str.chomp(",")
389
416
  end
390
- str.nil? ? nil : str.chomp(",")
391
- end
392
-
393
-
394
- def nsgflowlog(queue, json, name)
395
- count=0
396
- begin
397
- json["records"].each do |record|
398
- res = resource(record["resourceId"])
399
- resource = { :subscription => res[:subscription], :resourcegroup => res[:resourcegroup], :nsg => res[:nsg] }
400
- @logger.trace(resource.to_s)
401
- record["properties"]["flows"].each do |flows|
402
- rule = resource.merge ({ :rule => flows["rule"]})
403
- flows["flows"].each do |flowx|
404
- flowx["flowTuples"].each do |tup|
405
- tups = tup.split(',')
406
- ev = rule.merge({:unixtimestamp => tups[0], :src_ip => tups[1], :dst_ip => tups[2], :src_port => tups[3], :dst_port => tups[4], :protocol => tups[5], :direction => tups[6], :decision => tups[7]})
407
- if (record["properties"]["Version"]==2)
408
- tups[9] = 0 if tups[9].nil?
409
- tups[10] = 0 if tups[10].nil?
410
- tups[11] = 0 if tups[11].nil?
411
- tups[12] = 0 if tups[12].nil?
412
- ev.merge!( {:flowstate => tups[8], :src_pack => tups[9], :src_bytes => tups[10], :dst_pack => tups[11], :dst_bytes => tups[12]} )
413
- end
414
- @logger.trace(ev.to_s)
415
- if @addfilename
416
- ev.merge!( {:filename => name } )
417
- end
418
- event = LogStash::Event.new('message' => ev.to_json)
419
- decorate(event)
420
- queue << event
421
- count+=1
422
- end
423
- end
424
- end
417
+
418
+
419
+ def nsgflowlog(queue, json, name)
420
+ count=0
421
+ begin
422
+ json["records"].each do |record|
423
+ res = resource(record["resourceId"])
424
+ resource = { :subscription => res[:subscription], :resourcegroup => res[:resourcegroup], :nsg => res[:nsg] }
425
+ @logger.trace(resource.to_s)
426
+ record["properties"]["flows"].each do |flows|
427
+ rule = resource.merge ({ :rule => flows["rule"]})
428
+ flows["flows"].each do |flowx|
429
+ flowx["flowTuples"].each do |tup|
430
+ tups = tup.split(',')
431
+ ev = rule.merge({:unixtimestamp => tups[0], :src_ip => tups[1], :dst_ip => tups[2], :src_port => tups[3], :dst_port => tups[4], :protocol => tups[5], :direction => tups[6], :decision => tups[7]})
432
+ if (record["properties"]["Version"]==2)
433
+ tups[9] = 0 if tups[9].nil?
434
+ tups[10] = 0 if tups[10].nil?
435
+ tups[11] = 0 if tups[11].nil?
436
+ tups[12] = 0 if tups[12].nil?
437
+ ev.merge!( {:flowstate => tups[8], :src_pack => tups[9], :src_bytes => tups[10], :dst_pack => tups[11], :dst_bytes => tups[12]} )
438
+ end
439
+ @logger.trace(ev.to_s)
440
+ if @addfilename
441
+ ev.merge!( {:filename => name } )
442
+ end
443
+ event = LogStash::Event.new('message' => ev.to_json)
444
+ decorate(event)
445
+ queue << event
446
+ count+=1
447
+ end
448
+ end
449
+ end
450
+ end
451
+ rescue Exception => e
452
+ @logger.error("NSG Flowlog problem for #{name} and error message #{e.message}")
453
+ end
454
+ return count
425
455
  end
426
- rescue Exception => e
427
- @logger.error("NSG Flowlog problem for #{name} and error message #{e.message}")
456
+
457
+ def wadiislog(lines)
458
+ count=0
459
+ lines.each do |line|
460
+ unless line.start_with?('#')
461
+ queue << LogStash::Event.new('message' => ev.to_json)
462
+ count+=1
463
+ end
464
+ end
465
+ return count
466
+ # date {
467
+ # match => [ "log_timestamp", "YYYY-MM-dd HH:mm:ss" ]
468
+ # target => "@timestamp"
469
+ # remove_field => ["log_timestamp"]
470
+ # }
428
471
  end
429
- return count
430
- end
431
-
432
- def wadiislog(lines)
433
- count=0
434
- lines.each do |line|
435
- unless line.start_with?('#')
436
- queue << LogStash::Event.new('message' => ev.to_json)
437
- count+=1
438
- end
439
- end
440
- return count
441
- # date {
442
- # match => [ "log_timestamp", "YYYY-MM-dd HH:mm:ss" ]
443
- # target => "@timestamp"
444
- # remove_field => ["log_timestamp"]
445
- # }
446
- end
447
-
448
- # list all blobs in the blobstore, set the offsets from the registry and return the filelist
449
- # inspired by: https://github.com/Azure-Samples/storage-blobs-ruby-quickstart/blob/master/example.rb
450
- def list_blobs(fill)
451
- tries ||= 3
452
- begin
453
- return try_list_blobs(fill)
454
- rescue Exception => e
455
- @logger.error("caught: #{e.message} for list_blobs retries left #{tries}")
456
- if (tries -= 1) > 0
457
- retry
472
+
473
+ # list all blobs in the blobstore, set the offsets from the registry and return the filelist
474
+ # inspired by: https://github.com/Azure-Samples/storage-blobs-ruby-quickstart/blob/master/example.rb
475
+ def list_blobs(fill)
476
+ tries ||= 3
477
+ begin
478
+ return try_list_blobs(fill)
479
+ rescue Exception => e
480
+ @logger.error("caught: #{e.message} for list_blobs retries left #{tries}")
481
+ if (tries -= 1) > 0
482
+ retry
483
+ end
458
484
  end
459
485
  end
460
- end
461
-
462
- def try_list_blobs(fill)
463
- # inspired by: http://blog.mirthlab.com/2012/05/25/cleanly-retrying-blocks-of-code-after-an-exception-in-ruby/
464
- chrono = Time.now.to_i
465
- files = Hash.new
466
- nextMarker = nil
467
- counter = 1
468
- loop do
469
- blobs = @blob_client.list_blobs(container, { marker: nextMarker, prefix: @prefix})
470
- blobs.each do |blob|
471
- # FNM_PATHNAME is required so that "**/test" can match "test" at the root folder
472
- # FNM_EXTGLOB allows you to use "test{a,b,c}" to match either "testa", "testb" or "testc" (closer to shell behavior)
473
- unless blob.name == registry_path
474
- if @path_filters.any? {|path| File.fnmatch?(path, blob.name, File::FNM_PATHNAME | File::FNM_EXTGLOB)}
475
- length = blob.properties[:content_length].to_i
476
- offset = 0
477
- if fill
478
- offset = length
479
- end
480
- files.store(blob.name, { :offset => offset, :length => length })
481
- if (@debug_until > @processed) then @logger.info("1: list_blobs #{blob.name} #{offset} #{length}") end
482
- end
483
- end
484
- end
485
- nextMarker = blobs.continuation_token
486
- break unless nextMarker && !nextMarker.empty?
487
- if (counter % 10 == 0) then @logger.info(" listing #{counter * 50000} files") end
488
- counter+=1
486
+
487
+ def try_list_blobs(fill)
488
+ # inspired by: http://blog.mirthlab.com/2012/05/25/cleanly-retrying-blocks-of-code-after-an-exception-in-ruby/
489
+ chrono = Time.now.to_i
490
+ files = Hash.new
491
+ nextMarker = nil
492
+ counter = 1
493
+ loop do
494
+ blobs = @blob_client.list_blobs(container, { marker: nextMarker, prefix: @prefix})
495
+ blobs.each do |blob|
496
+ # FNM_PATHNAME is required so that "**/test" can match "test" at the root folder
497
+ # FNM_EXTGLOB allows you to use "test{a,b,c}" to match either "testa", "testb" or "testc" (closer to shell behavior)
498
+ unless blob.name == registry_path
499
+ if @path_filters.any? {|path| File.fnmatch?(path, blob.name, File::FNM_PATHNAME | File::FNM_EXTGLOB)}
500
+ length = blob.properties[:content_length].to_i
501
+ offset = 0
502
+ if fill
503
+ offset = length
504
+ end
505
+ files.store(blob.name, { :offset => offset, :length => length })
506
+ if (@debug_until > @processed) then @logger.info("1: list_blobs #{blob.name} #{offset} #{length}") end
507
+ end
508
+ end
509
+ end
510
+ nextMarker = blobs.continuation_token
511
+ break unless nextMarker && !nextMarker.empty?
512
+ if (counter % 10 == 0) then @logger.info(" listing #{counter * 50000} files") end
513
+ counter+=1
489
514
  end
490
515
  if @debug_timer
491
516
  @logger.info("list_blobs took #{Time.now.to_i - chrono} sec")
492
517
  end
493
- return files
494
- end
495
-
496
- # When events were processed after the last registry save, start a thread to update the registry file.
497
- def save_registry()
498
- unless @processed == @regsaved
499
- unless (@busy_writing_registry.locked?)
500
- # deep_copy hash, to save the registry independant from the variable for thread safety
501
- # if deep_clone uses Marshall to do a copy,
502
- regdump = Marshal.dump(@registry)
503
- regsize = @registry.size
504
- Thread.new {
505
- begin
506
- @busy_writing_registry.lock
507
- unless (@registry_local_path)
508
- @blob_client.create_block_blob(container, registry_path, regdump)
509
- @logger.info("processed #{@processed} events, saving #{regsize} blobs and offsets to remote registry #{registry_path}")
510
- else
511
- File.open(@registry_local_path+"/"+@pipe_id, 'w') { |file| file.write(regdump) }
512
- @logger.info("processed #{@processed} events, saving #{regsize} blobs and offsets to local registry #{registry_local_path+"/"+@pipe_id}")
513
- end
514
- @last = Time.now.to_i
515
- @regsaved = @processed
516
- rescue Exception => e
517
- @logger.error("Oh my, registry write failed")
518
- @logger.error("#{e.message}")
519
- ensure
520
- @busy_writing_registry.unlock
521
- end
522
- }
523
- else
524
- @logger.info("Skipped writing the registry because previous write still in progress, it just takes long or may be hanging!")
525
- end
518
+ return files
526
519
  end
527
- end
528
-
529
-
530
- def learn_encapsulation
531
- @logger.info("learn_encapsulation, this can be skipped by setting skip_learning => true. Or set both head_file and tail_file")
532
- # From one file, read first block and last block to learn head and tail
533
- begin
534
- blobs = @blob_client.list_blobs(container, { max_results: 3, prefix: @prefix})
535
- blobs.each do |blob|
536
- unless blob.name == registry_path
537
- begin
538
- blocks = @blob_client.list_blob_blocks(container, blob.name)[:committed]
539
- if blocks.first.name.start_with?('A00')
540
- @logger.debug("using #{blob.name}/#{blocks.first.name} to learn the json header")
541
- @head = @blob_client.get_blob(container, blob.name, start_range: 0, end_range: blocks.first.size-1)[1]
542
- end
543
- if blocks.last.name.start_with?('Z00')
544
- @logger.debug("using #{blob.name}/#{blocks.last.name} to learn the json footer")
545
- length = blob.properties[:content_length].to_i
546
- offset = length - blocks.last.size
547
- @tail = @blob_client.get_blob(container, blob.name, start_range: offset, end_range: length-1)[1]
548
- @logger.debug("learned tail: #{@tail}")
520
+
521
+ # When events were processed after the last registry save, start a thread to update the registry file.
522
+ def save_registry()
523
+ unless @processed == @regsaved
524
+ unless (@busy_writing_registry.locked?)
525
+ # deep_copy hash, to save the registry independant from the variable for thread safety
526
+ # if deep_clone uses Marshall to do a copy,
527
+ regdump = Marshal.dump(@registry)
528
+ regsize = @registry.size
529
+ Thread.new {
530
+ begin
531
+ @busy_writing_registry.lock
532
+ unless (@registry_local_path)
533
+ @blob_client.create_block_blob(container, registry_path, regdump)
534
+ @logger.info("processed #{@processed} events, saving #{regsize} blobs and offsets to remote registry #{registry_path}")
535
+ else
536
+ File.open(@registry_local_path+"/"+@pipe_id, 'w') { |file| file.write(regdump) }
537
+ @logger.info("processed #{@processed} events, saving #{regsize} blobs and offsets to local registry #{registry_local_path+"/"+@pipe_id}")
538
+ end
539
+ @last = Time.now.to_i
540
+ @regsaved = @processed
541
+ rescue Exception => e
542
+ @logger.error("Oh my, registry write failed")
543
+ @logger.error("#{e.message}")
544
+ ensure
545
+ @busy_writing_registry.unlock
546
+ end
547
+ }
548
+ else
549
+ @logger.info("Skipped writing the registry because previous write still in progress, it just takes long or may be hanging!")
550
+ end
551
+ end
552
+ end
553
+
554
+
555
+ def learn_encapsulation
556
+ @logger.info("learn_encapsulation, this can be skipped by setting skip_learning => true. Or set both head_file and tail_file")
557
+ # From one file, read first block and last block to learn head and tail
558
+ begin
559
+ blobs = @blob_client.list_blobs(container, { max_results: 3, prefix: @prefix})
560
+ blobs.each do |blob|
561
+ unless blob.name == registry_path
562
+ begin
563
+ blocks = @blob_client.list_blob_blocks(container, blob.name)[:committed]
564
+ if blocks.first.name.start_with?('A00')
565
+ @logger.debug("using #{blob.name}/#{blocks.first.name} to learn the json header")
566
+ @head = @blob_client.get_blob(container, blob.name, start_range: 0, end_range: blocks.first.size-1)[1]
567
+ end
568
+ if blocks.last.name.start_with?('Z00')
569
+ @logger.debug("using #{blob.name}/#{blocks.last.name} to learn the json footer")
570
+ length = blob.properties[:content_length].to_i
571
+ offset = length - blocks.last.size
572
+ @tail = @blob_client.get_blob(container, blob.name, start_range: offset, end_range: length-1)[1]
573
+ @logger.debug("learned tail: #{@tail}")
574
+ end
575
+ rescue Exception => e
576
+ @logger.info("learn json one of the attempts failed #{e.message}")
577
+ end
549
578
  end
550
- rescue Exception => e
551
- @logger.info("learn json one of the attempts failed #{e.message}")
552
- end
553
579
  end
580
+ rescue Exception => e
581
+ @logger.info("learn json header and footer failed because #{e.message}")
554
582
  end
555
- rescue Exception => e
556
- @logger.info("learn json header and footer failed because #{e.message}")
557
583
  end
558
- end
559
-
560
- def resource(str)
561
- temp = str.split('/')
562
- date = '---'
563
- unless temp[9].nil?
564
- date = val(temp[9])+'/'+val(temp[10])+'/'+val(temp[11])+'-'+val(temp[12])+':00'
565
- end
566
- return {:subscription=> temp[2], :resourcegroup=>temp[4], :nsg=>temp[8], :date=>date}
567
- end
568
-
569
- def val(str)
570
- return str.split('=')[1]
571
- end
584
+
585
+ def resource(str)
586
+ temp = str.split('/')
587
+ date = '---'
588
+ unless temp[9].nil?
589
+ date = val(temp[9])+'/'+val(temp[10])+'/'+val(temp[11])+'-'+val(temp[12])+':00'
590
+ end
591
+ return {:subscription=> temp[2], :resourcegroup=>temp[4], :nsg=>temp[8], :date=>date}
592
+ end
593
+
594
+ def val(str)
595
+ return str.split('=')[1]
596
+ end
572
597
 
573
598
  end # class LogStash::Inputs::AzureBlobStorage