logstash-input-azure_blob_storage 0.12.6 → 0.12.8
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +14 -4
- data/README.md +50 -7
- data/lib/logstash/inputs/azure_blob_storage.rb +302 -136
- data/logstash-input-azure_blob_storage.gemspec +2 -2
- metadata +4 -4
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 6226b48f09b69ea1fe5d5e65197cf87daed475a2dff3aecc1ff30b1c921d4e7e
|
4
|
+
data.tar.gz: 9ac324158bddc908f107663925a27ff289eb7b264293da88218a825d66c74d74
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 6cdd2d17fd57adc43b0c8e7354cbf396243b4bf691e8ef12d757c2c9dc515f9711ecbe9c64495b0d6f50040a28af98af2b641224c03dc83c3c4db9919ef1fb77
|
7
|
+
data.tar.gz: e1a71cfbe35af0d878374dcce499096331c82de867d86fb6ea3f4c876e1cc24f8b0fb59087b112012989c358ec9f238159564d17d9433ca1899a776a1c311683
|
data/CHANGELOG.md
CHANGED
@@ -1,7 +1,17 @@
|
|
1
|
-
##
|
2
|
-
|
3
|
-
|
4
|
-
|
1
|
+
## 0.12.8
|
2
|
+
- support append blob (use codec json_lines and logtype raw)
|
3
|
+
- change the default head and tail to an empty string, unless the logtype is nsgflowlog
|
4
|
+
- jsonclean configuration parameter to clean the json stream from faulty characters to prevent parse errors
|
5
|
+
- catch ContainerNotFound, print error message in log and sleep interval time.
|
6
|
+
|
7
|
+
## 0.12.7
|
8
|
+
- rewrote partial_read, now the occasional json parse errors should be fixed by reading only commited blocks.
|
9
|
+
(This may also have been related to reading a second partial_read, where the offset wasn't updated correctly?)
|
10
|
+
- used the new header and tail block name, should now learn header and footer automatically again?
|
11
|
+
- added addall to the configurations to add system, mac, category, time, operation to the output
|
12
|
+
- added optional environment configuration option
|
13
|
+
- removed the date, which was always set to ---
|
14
|
+
- made a start on event rewriting to make it ECS compatibility
|
5
15
|
|
6
16
|
## 0.12.6
|
7
17
|
- Fixed the 0.12.5 exception handling, it actually caused a warning to become a fatal pipeline crashing error
|
data/README.md
CHANGED
@@ -8,6 +8,14 @@ For problems or feature requests with this specific plugin, raise a github issue
|
|
8
8
|
This plugin can read from Azure Storage Blobs, for instance JSON diagnostics logs for NSG flow logs or LINE based accesslogs from App Services.
|
9
9
|
[Azure Blob Storage](https://azure.microsoft.com/en-us/services/storage/blobs/)
|
10
10
|
|
11
|
+
## Alternatives
|
12
|
+
This plugin was inspired by the Azure diagnostics tools, but should work better for bigger amounts of files. the configuration is not compatible, the configuration azureblob refers to the diagnostics tools plugin and this plugin uses azure_blob_storage
|
13
|
+
https://github.com/Azure/azure-diagnostics-tools/tree/master/Logstash/logstash-input-azureblob
|
14
|
+
|
15
|
+
There is a Filebeat plugin, that may work in the future
|
16
|
+
https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-azure-blob-storage.html
|
17
|
+
|
18
|
+
## Innerworking
|
11
19
|
The plugin depends on the [Ruby library azure-storage-blob](https://rubygems.org/gems/azure-storage-blob/versions/1.1.0) from Microsoft, that depends on Faraday for the HTTPS connection to Azure.
|
12
20
|
|
13
21
|
The plugin executes the following steps
|
@@ -42,9 +50,11 @@ input {
|
|
42
50
|
## Additional Configuration
|
43
51
|
The registry keeps track of files in the storage account, their size and how many bytes have been processed. Files can grow and the added part will be processed as a partial file. The registry is saved todisk every interval.
|
44
52
|
|
53
|
+
The interval is also defines when a new round of listing files and processing data should happen. The NSGFLOWLOG's are written every minute into a new block of the hourly blob. This data can be partially read, because the plugin knows the JSON head and tail and removes the leading comma and fixes the JSON before parsing new events
|
54
|
+
|
45
55
|
The registry_create_policy determines at the start of the pipeline if processing should resume from the last known unprocessed file, or to start_fresh ignoring old files and start only processing new events that came after the start of the pipeline. Or start_over to process all the files ignoring the registry.
|
46
56
|
|
47
|
-
interval defines the minimum time the registry should be saved to the registry file
|
57
|
+
interval defines the minimum time the registry should be saved to the registry file. By default to 'data/registry.dat' in the storageaccount, but can be also kept on the server running logstash by setting registry_local_path. The registry is kept also in memory, the registry file is only needed in case the pipeline dies unexpectedly. During a normal shutdown the registry is also saved.
|
48
58
|
|
49
59
|
When registry_local_path is set to a directory, the registry is saved on the logstash server in that directory. The filename is the pipe.id
|
50
60
|
|
@@ -66,13 +76,15 @@ The pipeline can be started in several ways.
|
|
66
76
|
```
|
67
77
|
- As managed pipeline from Kibana
|
68
78
|
|
69
|
-
Logstash itself (so not specific to this plugin) has a feature where multiple instances can run on the same system. The default TCP port is 9600, but if it's already in use it will use 9601 (and up). To update a config file on a running instance on the commandline you can add the argument --config.reload.automatic and if you modify the files that are in the pipeline.yml you can send a SIGHUP channel to reload the pipelines where the config was changed.
|
79
|
+
Logstash itself (so not specific to this plugin) has a feature where multiple instances can run on the same system. The default TCP port is 9600, but if it's already in use it will use 9601 (and up), this is probably not true anymore from v8. To update a config file on a running instance on the commandline you can add the argument --config.reload.automatic and if you modify the files that are in the pipeline.yml you can send a SIGHUP channel to reload the pipelines where the config was changed.
|
70
80
|
[https://www.elastic.co/guide/en/logstash/current/reloading-config.html](https://www.elastic.co/guide/en/logstash/current/reloading-config.html)
|
71
81
|
|
72
82
|
## Internal Working
|
73
83
|
When the plugin is started, it will read all the filenames and sizes in the blob store excluding the directies of files that are excluded by the "path_filters". After every interval it will write a registry to the storageaccount to save the information of how many bytes per blob (file) are read and processed. After all files are processed and at least one interval has passed a new file list is generated and a worklist is constructed that will be processed. When a file has already been processed before, partial files are read from the offset to the filesize at the time of the file listing. If the codec is JSON partial files will be have the header and tail will be added. They can be configured. If logtype is nsgflowlog, the plugin will process the splitting into individual tuple events. The logtype wadiis may in the future be used to process the grok formats to split into log lines. Any other format is fed into the queue as one event per file or partial file. It's then up to the filter to split and mutate the file format.
|
74
84
|
|
75
|
-
By default the root of the json message is named "message"
|
85
|
+
By default the root of the json message is named "message", you can modify the content in the filter block
|
86
|
+
|
87
|
+
Additional fields can be enabled with addfilename and addall, ecs_compatibility is not yet supported.
|
76
88
|
|
77
89
|
The configurations and the rest of the code are in [https://github.com/janmg/logstash-input-azure_blob_storage/tree/master/lib/logstash/inputs](lib/logstash/inputs) [https://github.com/janmg/logstash-input-azure_blob_storage/blob/master/lib/logstash/inputs/azure_blob_storage.rb#L10](azure_blob_storage.rb)
|
78
90
|
|
@@ -130,7 +142,7 @@ filter {
|
|
130
142
|
}
|
131
143
|
|
132
144
|
output {
|
133
|
-
stdout { }
|
145
|
+
stdout { codec => rubydebug }
|
134
146
|
}
|
135
147
|
|
136
148
|
output {
|
@@ -139,24 +151,37 @@ output {
|
|
139
151
|
index => "nsg-flow-logs-%{+xxxx.ww}"
|
140
152
|
}
|
141
153
|
}
|
154
|
+
|
155
|
+
output {
|
156
|
+
file {
|
157
|
+
path => /tmp/abuse.txt
|
158
|
+
codec => line { format => "%{decision} %{flowstate} %{src_ip} ${dst_port}"}
|
159
|
+
}
|
160
|
+
}
|
161
|
+
|
142
162
|
```
|
143
163
|
A more elaborate input configuration example
|
144
164
|
```
|
145
165
|
input {
|
146
166
|
azure_blob_storage {
|
147
167
|
codec => "json"
|
148
|
-
storageaccount => "yourstorageaccountname"
|
149
|
-
access_key => "Ba5e64c0d3=="
|
168
|
+
# storageaccount => "yourstorageaccountname"
|
169
|
+
# access_key => "Ba5e64c0d3=="
|
170
|
+
connection_string => "DefaultEndpointsProtocol=https;AccountName=yourstorageaccountname;AccountKey=Ba5e64c0d3==;EndpointSuffix=core.windows.net"
|
150
171
|
container => "insights-logs-networksecuritygroupflowevent"
|
151
172
|
logtype => "nsgflowlog"
|
152
173
|
prefix => "resourceId=/"
|
153
174
|
path_filters => ['**/*.json']
|
154
175
|
addfilename => true
|
176
|
+
addall => true
|
177
|
+
environment => "dev-env"
|
155
178
|
registry_create_policy => "resume"
|
156
179
|
registry_local_path => "/usr/share/logstash/plugin"
|
157
180
|
interval => 300
|
158
181
|
debug_timer => true
|
159
|
-
debug_until =>
|
182
|
+
debug_until => 1000
|
183
|
+
addall => true
|
184
|
+
registry_create_policy => "start_over"
|
160
185
|
}
|
161
186
|
}
|
162
187
|
|
@@ -167,6 +192,20 @@ output {
|
|
167
192
|
}
|
168
193
|
}
|
169
194
|
```
|
195
|
+
|
196
|
+
Another for json_lines on append_blobs
|
197
|
+
```
|
198
|
+
input {
|
199
|
+
azure_blob_storage {
|
200
|
+
codec => json_lines {
|
201
|
+
delimiter => "\n"
|
202
|
+
charset => "UTF-8"
|
203
|
+
}
|
204
|
+
# below options are optional
|
205
|
+
logtype => "raw"
|
206
|
+
append => true
|
207
|
+
cleanjson => true
|
208
|
+
```
|
170
209
|
The configuration documentation is in the first 100 lines of the code
|
171
210
|
[GITHUB/janmg/logstash-input-azure_blob_storage/blob/master/lib/logstash/inputs/azure_blob_storage.rb](https://github.com/janmg/logstash-input-azure_blob_storage/blob/master/lib/logstash/inputs/azure_blob_storage.rb)
|
172
211
|
|
@@ -211,5 +250,9 @@ filter {
|
|
211
250
|
remove_field => ["timestamp"]
|
212
251
|
}
|
213
252
|
}
|
253
|
+
|
254
|
+
output {
|
255
|
+
stdout { codec => rubydebug }
|
256
|
+
}
|
214
257
|
```
|
215
258
|
|
@@ -17,14 +17,16 @@ require 'json'
|
|
17
17
|
# D672f4bbd95a04209b00dc05d899e3cce 2576 json objects for 1st minute
|
18
18
|
# D7fe0d4f275a84c32982795b0e5c7d3a1 2312 json objects for 2nd minute
|
19
19
|
# Z00000000000000000000000000000000 2 ]}
|
20
|
-
|
20
|
+
#
|
21
|
+
# The azure-storage-ruby connects to the storageaccount and the files are read through get_blob. For partial read the options with start and end ar used.
|
22
|
+
# https://github.com/Azure/azure-storage-ruby/blob/master/blob/lib/azure/storage/blob/blob.rb#L89
|
23
|
+
#
|
21
24
|
# A storage account has by default a globally unique name, {storageaccount}.blob.core.windows.net which is a CNAME to Azures blob servers blob.*.store.core.windows.net. A storageaccount has an container and those have a directory and blobs (like files). Blobs have one or more blocks. After writing the blocks, they can be committed. Some Azure diagnostics can send events to an EventHub that can be parse through the plugin logstash-input-azure_event_hubs, but for the events that are only stored in an storage account, use this plugin. The original logstash-input-azureblob from azure-diagnostics-tools is great for low volumes, but it suffers from outdated client, slow reads, lease locking issues and json parse errors.
|
22
25
|
|
23
|
-
|
24
26
|
class LogStash::Inputs::AzureBlobStorage < LogStash::Inputs::Base
|
25
27
|
config_name "azure_blob_storage"
|
26
28
|
|
27
|
-
# If undefined, Logstash will complain, even if codec is unused. The codec for nsgflowlog is "json" and the for WADIIS and APPSERVICE is "line".
|
29
|
+
# If undefined, Logstash will complain, even if codec is unused. The codec for nsgflowlog is "json", "json_line" works and the for WADIIS and APPSERVICE is "line".
|
28
30
|
default :codec, "json"
|
29
31
|
|
30
32
|
# logtype can be nsgflowlog, wadiis, appservice or raw. The default is raw, where files are read and added as one event. If the file grows, the next interval the file is read from the offset, so that the delta is sent as another event. In raw mode, further processing has to be done in the filter block. If the logtype is specified, this plugin will split and mutate and add individual events to the queue.
|
@@ -66,7 +68,7 @@ class LogStash::Inputs::AzureBlobStorage < LogStash::Inputs::Base
|
|
66
68
|
# when set to `start_fresh`, it will read log files that are created or appended since this start of the pipeline.
|
67
69
|
config :registry_create_policy, :validate => ['resume','start_over','start_fresh'], :required => false, :default => 'resume'
|
68
70
|
|
69
|
-
|
71
|
+
# The interval is used to save the registry regularly, when new events have have been processed. It is also used to wait before listing the files again and substracting the registry of already processed files to determine the worklist.
|
70
72
|
# waiting time in seconds until processing the next batch. NSGFLOWLOGS append a block per minute, so use multiples of 60 seconds, 300 for 5 minutes, 600 for 10 minutes. The registry is also saved after every interval.
|
71
73
|
# Partial reading starts from the offset and reads until the end, so the starting tag is prepended
|
72
74
|
config :interval, :validate => :number, :default => 60
|
@@ -74,6 +76,12 @@ class LogStash::Inputs::AzureBlobStorage < LogStash::Inputs::Base
|
|
74
76
|
# add the filename as a field into the events
|
75
77
|
config :addfilename, :validate => :boolean, :default => false, :required => false
|
76
78
|
|
79
|
+
# add environment
|
80
|
+
config :environment, :validate => :string, :required => false
|
81
|
+
|
82
|
+
# add all resource details
|
83
|
+
config :addall, :validate => :boolean, :default => false, :required => false
|
84
|
+
|
77
85
|
# debug_until will at the creation of the pipeline for a maximum amount of processed messages shows 3 types of log printouts including processed filenames. After a number of events, the plugin will stop logging the events and continue silently. This is a lightweight alternative to switching the loglevel from info to debug or even trace to see what the plugin is doing and how fast at the start of the plugin. A good value would be approximately 3x the amount of events per file. For instance 6000 events.
|
78
86
|
config :debug_until, :validate => :number, :default => 0, :required => false
|
79
87
|
|
@@ -87,10 +95,14 @@ class LogStash::Inputs::AzureBlobStorage < LogStash::Inputs::Base
|
|
87
95
|
config :skip_learning, :validate => :boolean, :default => false, :required => false
|
88
96
|
|
89
97
|
# The string that starts the JSON. Only needed when the codec is JSON. When partial file are read, the result will not be valid JSON unless the start and end are put back. the file_head and file_tail are learned at startup, by reading the first file in the blob_list and taking the first and last block, this would work for blobs that are appended like nsgflowlogs. The configuration can be set to override the learning. In case learning fails and the option is not set, the default is to use the 'records' as set by nsgflowlogs.
|
90
|
-
config :file_head, :validate => :string, :required => false, :default => '
|
98
|
+
config :file_head, :validate => :string, :required => false, :default => ''
|
91
99
|
# The string that ends the JSON
|
92
|
-
config :file_tail, :validate => :string, :required => false, :default => '
|
100
|
+
config :file_tail, :validate => :string, :required => false, :default => ''
|
93
101
|
|
102
|
+
# inspect the bytes and remove faulty characters
|
103
|
+
config :cleanjson, :validate => :boolean, :default => false, :required => false
|
104
|
+
|
105
|
+
config :append, :validate => :boolean, :default => false, :required => false
|
94
106
|
# By default it will watch every file in the storage container. The prefix option is a simple filter that only processes files with a path that starts with that value.
|
95
107
|
# For NSGFLOWLOGS a path starts with "resourceId=/". This would only be needed to exclude other paths that may be written in the same container. The registry file will be excluded.
|
96
108
|
# You may also configure multiple paths. See an example on the <<array,Logstash configuration page>>.
|
@@ -110,6 +122,7 @@ public
|
|
110
122
|
@logger.info("If this plugin doesn't work, please raise an issue in https://github.com/janmg/logstash-input-azure_blob_storage")
|
111
123
|
@busy_writing_registry = Mutex.new
|
112
124
|
# TODO: consider multiple readers, so add pipeline @id or use logstash-to-logstash communication?
|
125
|
+
# For now it's difficult because the plugin would then have to synchronize the worklist
|
113
126
|
end
|
114
127
|
|
115
128
|
|
@@ -120,41 +133,10 @@ public
|
|
120
133
|
@regsaved = @processed
|
121
134
|
|
122
135
|
connect
|
123
|
-
|
124
136
|
@registry = Hash.new
|
125
|
-
|
126
|
-
|
127
|
-
|
128
|
-
if (!@registry_local_path.nil?)
|
129
|
-
unless File.file?(@registry_local_path+"/"+@pipe_id)
|
130
|
-
@registry = Marshal.load(@blob_client.get_blob(container, registry_path)[1])
|
131
|
-
#[0] headers [1] responsebody
|
132
|
-
@logger.info("migrating from remote registry #{registry_path}")
|
133
|
-
else
|
134
|
-
if !Dir.exist?(@registry_local_path)
|
135
|
-
FileUtils.mkdir_p(@registry_local_path)
|
136
|
-
end
|
137
|
-
@registry = Marshal.load(File.read(@registry_local_path+"/"+@pipe_id))
|
138
|
-
@logger.info("resuming from local registry #{registry_local_path+"/"+@pipe_id}")
|
139
|
-
end
|
140
|
-
else
|
141
|
-
@registry = Marshal.load(@blob_client.get_blob(container, registry_path)[1])
|
142
|
-
#[0] headers [1] responsebody
|
143
|
-
@logger.info("resuming from remote registry #{registry_path}")
|
144
|
-
end
|
145
|
-
break
|
146
|
-
rescue Exception => e
|
147
|
-
@logger.error("caught: #{e.message}")
|
148
|
-
@registry.clear
|
149
|
-
@logger.error("loading registry failed for attempt #{counter} of 3")
|
150
|
-
end
|
151
|
-
end
|
152
|
-
end
|
153
|
-
# read filelist and set offsets to file length to mark all the old files as done
|
154
|
-
if registry_create_policy == "start_fresh"
|
155
|
-
@registry = list_blobs(true)
|
156
|
-
save_registry()
|
157
|
-
@logger.info("starting fresh, writing a clean registry to contain #{@registry.size} blobs/files")
|
137
|
+
load_registry()
|
138
|
+
@registry.each do |name, file|
|
139
|
+
@logger.info("offset: #{file[:offset]} length: #{file[:length]}")
|
158
140
|
end
|
159
141
|
|
160
142
|
@is_json = false
|
@@ -166,22 +148,29 @@ public
|
|
166
148
|
@is_json_line = true
|
167
149
|
end
|
168
150
|
end
|
151
|
+
|
152
|
+
|
169
153
|
@head = ''
|
170
154
|
@tail = ''
|
171
|
-
# if codec=json sniff one files blocks A and Z to learn file_head and file_tail
|
172
155
|
if @is_json
|
156
|
+
# if codec=json sniff one files blocks A and Z to learn file_head and file_tail
|
157
|
+
if @logtype == 'nsgflowlog'
|
158
|
+
@head = '{"records":['
|
159
|
+
@tail = ']}'
|
160
|
+
end
|
173
161
|
if file_head
|
174
162
|
@head = file_head
|
175
163
|
end
|
176
164
|
if file_tail
|
177
165
|
@tail = file_tail
|
178
166
|
end
|
179
|
-
if
|
167
|
+
if !skip_learning
|
180
168
|
learn_encapsulation
|
181
169
|
end
|
182
|
-
@logger.info("head will be: #{@head} and tail is set to #{@tail}")
|
170
|
+
@logger.info("head will be: '#{@head}' and tail is set to: '#{@tail}'")
|
183
171
|
end
|
184
172
|
|
173
|
+
|
185
174
|
filelist = Hash.new
|
186
175
|
worklist = Hash.new
|
187
176
|
@last = start = Time.now.to_i
|
@@ -198,24 +187,27 @@ public
|
|
198
187
|
# load the registry, compare it's offsets to file list, set offset to 0 for new files, process the whole list and if finished within the interval wait for next loop,
|
199
188
|
# TODO: sort by timestamp ?
|
200
189
|
#filelist.sort_by(|k,v|resource(k)[:date])
|
201
|
-
worklist.clear
|
202
190
|
filelist.clear
|
203
191
|
|
204
192
|
# Listing all the files
|
205
193
|
filelist = list_blobs(false)
|
194
|
+
if (@debug_until > @processed) then
|
195
|
+
@registry.each do |name, file|
|
196
|
+
@logger.info("#{name} offset: #{file[:offset]} length: #{file[:length]}")
|
197
|
+
end
|
198
|
+
end
|
206
199
|
filelist.each do |name, file|
|
207
200
|
off = 0
|
208
201
|
if @registry.key?(name) then
|
209
|
-
|
210
|
-
|
211
|
-
|
212
|
-
|
213
|
-
|
202
|
+
begin
|
203
|
+
off = @registry[name][:offset]
|
204
|
+
rescue Exception => e
|
205
|
+
@logger.error("caught: #{e.message} while reading #{name}")
|
206
|
+
end
|
214
207
|
end
|
215
208
|
@registry.store(name, { :offset => off, :length => file[:length] })
|
216
209
|
if (@debug_until > @processed) then @logger.info("2: adding offsets: #{name} #{off} #{file[:length]}") end
|
217
210
|
end
|
218
|
-
# size nilClass when the list doesn't grow?!
|
219
211
|
|
220
212
|
# clean registry of files that are not in the filelist
|
221
213
|
@registry.each do |name,file|
|
@@ -234,14 +226,16 @@ public
|
|
234
226
|
|
235
227
|
# Start of processing
|
236
228
|
# This would be ideal for threading since it's IO intensive, would be nice with a ruby native ThreadPool
|
229
|
+
# pool = Concurrent::FixedThreadPool.new(5) # 5 threads
|
230
|
+
#pool.post do
|
231
|
+
# some parallel work
|
232
|
+
#end
|
237
233
|
if (worklist.size > 0) then
|
238
234
|
worklist.each do |name, file|
|
239
235
|
start = Time.now.to_i
|
240
236
|
if (@debug_until > @processed) then @logger.info("3: processing #{name} from #{file[:offset]} to #{file[:length]}") end
|
241
237
|
size = 0
|
242
238
|
if file[:offset] == 0
|
243
|
-
# This is where Sera4000 issue starts
|
244
|
-
# For an append blob, reading full and crashing, retry, last_modified? ... lenght? ... committed? ...
|
245
239
|
# length and skip reg value
|
246
240
|
if (file[:length] > 0)
|
247
241
|
begin
|
@@ -260,55 +254,72 @@ public
|
|
260
254
|
delta_size = 0
|
261
255
|
end
|
262
256
|
else
|
263
|
-
chunk =
|
264
|
-
delta_size = chunk.size
|
265
|
-
@logger.debug("partial file #{name} from #{file[:offset]} to #{file[:length]}")
|
257
|
+
chunk = partial_read(name, file[:offset])
|
258
|
+
delta_size = chunk.size - @head.length - 1
|
266
259
|
end
|
267
260
|
|
268
|
-
|
269
|
-
|
270
|
-
|
271
|
-
|
272
|
-
|
273
|
-
|
274
|
-
|
275
|
-
|
276
|
-
|
277
|
-
|
278
|
-
|
279
|
-
|
280
|
-
|
281
|
-
|
282
|
-
|
283
|
-
|
284
|
-
|
285
|
-
|
286
|
-
|
287
|
-
|
288
|
-
|
289
|
-
|
290
|
-
|
291
|
-
|
292
|
-
|
293
|
-
|
294
|
-
|
295
|
-
|
261
|
+
#
|
262
|
+
# TODO! ... split out the logtypes and use individual methods
|
263
|
+
# how does a byte array chuck from json_lines get translated to strings/json/events
|
264
|
+
# should the byte array be converted to a multiline and then split? drawback need to know characterset and linefeed characters
|
265
|
+
# how does the json_line decoder work on byte arrays?
|
266
|
+
#
|
267
|
+
# so many questions
|
268
|
+
|
269
|
+
unless chunk.nil?
|
270
|
+
counter = 0
|
271
|
+
if @is_json
|
272
|
+
if logtype == "nsgflowlog"
|
273
|
+
res = resource(name)
|
274
|
+
begin
|
275
|
+
fingjson = JSON.parse(chunk)
|
276
|
+
@processed += nsgflowlog(queue, fingjson, name)
|
277
|
+
@logger.debug("Processed #{res[:nsg]} #{@processed} events")
|
278
|
+
rescue JSON::ParserError => e
|
279
|
+
@logger.error("parse error #{e.message} on #{res[:nsg]} offset: #{file[:offset]} length: #{file[:length]}")
|
280
|
+
if (@debug_until > @processed) then @logger.info("#{chunk}") end
|
281
|
+
end
|
282
|
+
else
|
283
|
+
begin
|
284
|
+
@codec.decode(chunk) do |event|
|
285
|
+
counter += 1
|
286
|
+
if @addfilename
|
287
|
+
event.set('filename', name)
|
288
|
+
end
|
289
|
+
decorate(event)
|
290
|
+
queue << event
|
291
|
+
end
|
292
|
+
@processed += counter
|
293
|
+
rescue Exception => e
|
294
|
+
@logger.error("codec exception: #{e.message} .. continue and pretend this never happened")
|
295
|
+
end
|
296
|
+
end
|
297
|
+
end
|
298
|
+
|
299
|
+
if logtype == "wadiis" && !@is_json
|
300
|
+
# TODO: Convert this to line based grokking.
|
301
|
+
@processed += wadiislog(queue, name)
|
296
302
|
end
|
297
303
|
|
298
|
-
|
299
|
-
|
300
|
-
|
301
|
-
|
302
|
-
|
303
|
-
|
304
|
+
if @is_json_line
|
305
|
+
# parse one line at a time and dump it in the chunk?
|
306
|
+
lines = chunk.to_s
|
307
|
+
if cleanjson
|
308
|
+
@logger.info("cleaning in progress")
|
309
|
+
lines.chars.select(&:valid_encoding?).join
|
310
|
+
#lines.delete "\\"
|
311
|
+
#lines.scrub{|bytes| '<'+bytes.unpack('H*')[0]+'>' }
|
312
|
+
end
|
313
|
+
begin
|
314
|
+
@codec.decode(lines) do |event|
|
315
|
+
counter += 1
|
316
|
+
queue << event
|
304
317
|
end
|
305
|
-
|
306
|
-
|
318
|
+
@processed += counter
|
319
|
+
rescue Exception => e
|
320
|
+
# todo: fix codec_lines exception: no implicit conversion of Array into String
|
321
|
+
@logger.error("json_lines codec exception: #{e.message} .. continue and pretend this never happened")
|
307
322
|
end
|
308
|
-
@processed += counter
|
309
|
-
rescue Exception => e
|
310
|
-
@logger.error("codec exception: #{e.message} .. will continue and pretend this never happened")
|
311
|
-
@logger.debug("#{chunk}")
|
312
323
|
end
|
313
324
|
end
|
314
325
|
|
@@ -348,6 +359,24 @@ public
|
|
348
359
|
|
349
360
|
|
350
361
|
private
|
362
|
+
def list_files
|
363
|
+
filelist = list_blobs(false)
|
364
|
+
filelist.each do |name, file|
|
365
|
+
off = 0
|
366
|
+
if @registry.key?(name) then
|
367
|
+
begin
|
368
|
+
off = @registry[name][:offset]
|
369
|
+
rescue Exception => e
|
370
|
+
@logger.error("caught: #{e.message} while reading #{name}")
|
371
|
+
end
|
372
|
+
end
|
373
|
+
@registry.store(name, { :offset => off, :length => file[:length] })
|
374
|
+
if (@debug_until > @processed) then @logger.info("2: adding offsets: #{name} #{off} #{file[:length]}") end
|
375
|
+
end
|
376
|
+
return filelist
|
377
|
+
end
|
378
|
+
# size nilClass when the list doesn't grow?!
|
379
|
+
|
351
380
|
def connect
|
352
381
|
# Try in this order to access the storageaccount
|
353
382
|
# 1. storageaccount / sas_token
|
@@ -378,11 +407,48 @@ private
|
|
378
407
|
# end
|
379
408
|
end
|
380
409
|
end
|
410
|
+
# @registry_create_policy,@registry_local_path,@container,@registry_path
|
411
|
+
def load_registry()
|
412
|
+
if @registry_create_policy == "resume"
|
413
|
+
for counter in 1..3
|
414
|
+
begin
|
415
|
+
if (!@registry_local_path.nil?)
|
416
|
+
unless File.file?(@registry_local_path+"/"+@pipe_id)
|
417
|
+
@registry = Marshal.load(@blob_client.get_blob(@container, path)[1])
|
418
|
+
#[0] headers [1] responsebody
|
419
|
+
@logger.info("migrating from remote registry #{path}")
|
420
|
+
else
|
421
|
+
if !Dir.exist?(@registry_local_path)
|
422
|
+
FileUtils.mkdir_p(@registry_local_path)
|
423
|
+
end
|
424
|
+
@registry = Marshal.load(File.read(@registry_local_path+"/"+@pipe_id))
|
425
|
+
@logger.info("resuming from local registry #{@registry_local_path+"/"+@pipe_id}")
|
426
|
+
end
|
427
|
+
else
|
428
|
+
@registry = Marshal.load(@blob_client.get_blob(container, path)[1])
|
429
|
+
#[0] headers [1] responsebody
|
430
|
+
@logger.info("resuming from remote registry #{path}")
|
431
|
+
end
|
432
|
+
break
|
433
|
+
rescue Exception => e
|
434
|
+
@logger.error("caught: #{e.message}")
|
435
|
+
@registry.clear
|
436
|
+
@logger.error("loading registry failed for attempt #{counter} of 3")
|
437
|
+
end
|
438
|
+
end
|
439
|
+
end
|
440
|
+
# read filelist and set offsets to file length to mark all the old files as done
|
441
|
+
if @registry_create_policy == "start_fresh"
|
442
|
+
@registry = list_blobs(true)
|
443
|
+
#save_registry()
|
444
|
+
@logger.info("starting fresh, with a clean registry containing #{@registry.size} blobs/files")
|
445
|
+
end
|
446
|
+
end
|
381
447
|
|
382
448
|
def full_read(filename)
|
383
449
|
tries ||= 2
|
384
450
|
begin
|
385
|
-
return @blob_client.get_blob(container, filename)[1]
|
451
|
+
return @blob_client.get_blob(@container, filename)[1]
|
386
452
|
rescue Exception => e
|
387
453
|
@logger.error("caught: #{e.message} for full_read")
|
388
454
|
if (tries -= 1) > 0
|
@@ -393,19 +459,56 @@ private
|
|
393
459
|
end
|
394
460
|
end
|
395
461
|
begin
|
396
|
-
chuck = @blob_client.get_blob(container, filename)[1]
|
462
|
+
chuck = @blob_client.get_blob(@container, filename)[1]
|
397
463
|
end
|
398
464
|
return chuck
|
399
465
|
end
|
400
466
|
|
401
|
-
def
|
402
|
-
|
403
|
-
|
404
|
-
|
405
|
-
|
406
|
-
|
407
|
-
|
408
|
-
|
467
|
+
def partial_read(blobname, offset)
|
468
|
+
# 1. read committed blocks, calculate length
|
469
|
+
# 2. calculate the offset to read
|
470
|
+
# 3. strip comma
|
471
|
+
# if json strip comma and fix head and tail
|
472
|
+
size = 0
|
473
|
+
|
474
|
+
begin
|
475
|
+
if @append
|
476
|
+
return @blob_client.get_blob(@container, blobname, start_range: offset-1)[1]
|
477
|
+
end
|
478
|
+
blocks = @blob_client.list_blob_blocks(@container, blobname)
|
479
|
+
blocks[:committed].each do |block|
|
480
|
+
size += block.size
|
481
|
+
end
|
482
|
+
# read the new blob blocks from the offset to the last committed size.
|
483
|
+
# if it is json, fix the head and tail
|
484
|
+
# crap committed block at the end is the tail, so must be substracted from the read and then comma stripped and tail added.
|
485
|
+
# but why did I need a -1 for the length?? probably the offset starts at 0 and ends at size-1
|
486
|
+
|
487
|
+
# should first check commit, read and the check committed again? no, only read the commited size
|
488
|
+
# should read the full content and then substract json tail
|
489
|
+
|
490
|
+
unless @is_json
|
491
|
+
return @blob_client.get_blob(@container, blobname, start_range: offset, end_range: size-1)[1]
|
492
|
+
else
|
493
|
+
content = @blob_client.get_blob(@container, blobname, start_range: offset-1, end_range: size-1)[1]
|
494
|
+
if content.end_with?(@tail)
|
495
|
+
return @head + strip_comma(content)
|
496
|
+
else
|
497
|
+
@logger.info("Fixed a tail! probably new committed blocks started appearing!")
|
498
|
+
# substract the length of the tail and add the tail, because the file grew.size was calculated as the block boundary, so replacing the last bytes with the tail should fix the problem
|
499
|
+
return @head + strip_comma(content[0...-@tail.length]) + @tail
|
500
|
+
end
|
501
|
+
end
|
502
|
+
rescue InvalidBlobType => ibt
|
503
|
+
@logger.error("caught #{ibt.message}. Setting BlobType to append")
|
504
|
+
@append = true
|
505
|
+
retry
|
506
|
+
rescue NoMethodError => nme
|
507
|
+
@logger.error("caught #{nme.message}. Setting append to true")
|
508
|
+
@append = true
|
509
|
+
retry
|
510
|
+
rescue Exception => e
|
511
|
+
@logger.error("caught #{e.message}")
|
409
512
|
end
|
410
513
|
end
|
411
514
|
|
@@ -422,8 +525,9 @@ private
|
|
422
525
|
count=0
|
423
526
|
begin
|
424
527
|
json["records"].each do |record|
|
425
|
-
|
426
|
-
resource = { :subscription => res[:subscription], :resourcegroup => res[:resourcegroup], :nsg => res[:nsg] }
|
528
|
+
resource = resource(record["resourceId"])
|
529
|
+
# resource = { :subscription => res[:subscription], :resourcegroup => res[:resourcegroup], :nsg => res[:nsg] }
|
530
|
+
extras = { :time => record["time"], :system => record["systemId"], :mac => record["macAddress"], :category => record["category"], :operation => record["operationName"] }
|
427
531
|
@logger.trace(resource.to_s)
|
428
532
|
record["properties"]["flows"].each do |flows|
|
429
533
|
rule = resource.merge ({ :rule => flows["rule"]})
|
@@ -442,7 +546,18 @@ private
|
|
442
546
|
if @addfilename
|
443
547
|
ev.merge!( {:filename => name } )
|
444
548
|
end
|
549
|
+
unless @environment.nil?
|
550
|
+
ev.merge!( {:environment => environment } )
|
551
|
+
end
|
552
|
+
if @addall
|
553
|
+
ev.merge!( extras )
|
554
|
+
end
|
555
|
+
|
556
|
+
# Add event to logstash queue
|
445
557
|
event = LogStash::Event.new('message' => ev.to_json)
|
558
|
+
#if @ecs_compatibility != "disabled"
|
559
|
+
# event = ecs(event)
|
560
|
+
#end
|
446
561
|
decorate(event)
|
447
562
|
queue << event
|
448
563
|
count+=1
|
@@ -493,26 +608,31 @@ private
|
|
493
608
|
nextMarker = nil
|
494
609
|
counter = 1
|
495
610
|
loop do
|
496
|
-
|
497
|
-
|
498
|
-
|
499
|
-
|
500
|
-
|
501
|
-
|
502
|
-
|
503
|
-
|
504
|
-
|
505
|
-
|
611
|
+
begin
|
612
|
+
blobs = @blob_client.list_blobs(@container, { marker: nextMarker, prefix: @prefix})
|
613
|
+
blobs.each do |blob|
|
614
|
+
# FNM_PATHNAME is required so that "**/test" can match "test" at the root folder
|
615
|
+
# FNM_EXTGLOB allows you to use "test{a,b,c}" to match either "testa", "testb" or "testc" (closer to shell behavior)
|
616
|
+
unless blob.name == registry_path
|
617
|
+
if @path_filters.any? {|path| File.fnmatch?(path, blob.name, File::FNM_PATHNAME | File::FNM_EXTGLOB)}
|
618
|
+
length = blob.properties[:content_length].to_i
|
619
|
+
offset = 0
|
620
|
+
if fill
|
621
|
+
offset = length
|
622
|
+
end
|
623
|
+
files.store(blob.name, { :offset => offset, :length => length })
|
624
|
+
if (@debug_until > @processed) then @logger.info("1: list_blobs #{blob.name} #{offset} #{length}") end
|
506
625
|
end
|
507
|
-
files.store(blob.name, { :offset => offset, :length => length })
|
508
|
-
if (@debug_until > @processed) then @logger.info("1: list_blobs #{blob.name} #{offset} #{length}") end
|
509
626
|
end
|
510
627
|
end
|
628
|
+
nextMarker = blobs.continuation_token
|
629
|
+
break unless nextMarker && !nextMarker.empty?
|
630
|
+
if (counter % 10 == 0) then @logger.info(" listing #{counter * 50000} files") end
|
631
|
+
counter+=1
|
632
|
+
rescue Exception => e
|
633
|
+
@logger.error("caught: #{e.message} while trying to list blobs")
|
634
|
+
return files
|
511
635
|
end
|
512
|
-
nextMarker = blobs.continuation_token
|
513
|
-
break unless nextMarker && !nextMarker.empty?
|
514
|
-
if (counter % 10 == 0) then @logger.info(" listing #{counter * 50000} files") end
|
515
|
-
counter+=1
|
516
636
|
end
|
517
637
|
if @debug_timer
|
518
638
|
@logger.info("list_blobs took #{Time.now.to_i - chrono} sec")
|
@@ -532,7 +652,7 @@ private
|
|
532
652
|
begin
|
533
653
|
@busy_writing_registry.lock
|
534
654
|
unless (@registry_local_path)
|
535
|
-
@blob_client.create_block_blob(container, registry_path, regdump)
|
655
|
+
@blob_client.create_block_blob(@container, registry_path, regdump)
|
536
656
|
@logger.info("processed #{@processed} events, saving #{regsize} blobs and offsets to remote registry #{registry_path}")
|
537
657
|
else
|
538
658
|
File.open(@registry_local_path+"/"+@pipe_id, 'w') { |file| file.write(regdump) }
|
@@ -558,20 +678,20 @@ private
|
|
558
678
|
@logger.info("learn_encapsulation, this can be skipped by setting skip_learning => true. Or set both head_file and tail_file")
|
559
679
|
# From one file, read first block and last block to learn head and tail
|
560
680
|
begin
|
561
|
-
blobs = @blob_client.list_blobs(container, { max_results: 3, prefix: @prefix})
|
681
|
+
blobs = @blob_client.list_blobs(@container, { max_results: 3, prefix: @prefix})
|
562
682
|
blobs.each do |blob|
|
563
683
|
unless blob.name == registry_path
|
564
684
|
begin
|
565
|
-
blocks = @blob_client.list_blob_blocks(container, blob.name)[:committed]
|
566
|
-
if blocks.first.name
|
685
|
+
blocks = @blob_client.list_blob_blocks(@container, blob.name)[:committed]
|
686
|
+
if ['A00000000000000000000000000000000','QTAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAw'].include?(blocks.first.name)
|
567
687
|
@logger.debug("using #{blob.name}/#{blocks.first.name} to learn the json header")
|
568
|
-
@head = @blob_client.get_blob(container, blob.name, start_range: 0, end_range: blocks.first.size-1)[1]
|
688
|
+
@head = @blob_client.get_blob(@container, blob.name, start_range: 0, end_range: blocks.first.size-1)[1]
|
569
689
|
end
|
570
|
-
if blocks.last.name
|
690
|
+
if ['Z00000000000000000000000000000000','WjAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAw'].include?(blocks.last.name)
|
571
691
|
@logger.debug("using #{blob.name}/#{blocks.last.name} to learn the json footer")
|
572
692
|
length = blob.properties[:content_length].to_i
|
573
693
|
offset = length - blocks.last.size
|
574
|
-
@tail = @blob_client.get_blob(container, blob.name, start_range: offset, end_range: length-1)[1]
|
694
|
+
@tail = @blob_client.get_blob(@container, blob.name, start_range: offset, end_range: length-1)[1]
|
575
695
|
@logger.debug("learned tail: #{@tail}")
|
576
696
|
end
|
577
697
|
rescue Exception => e
|
@@ -586,15 +706,61 @@ private
|
|
586
706
|
|
587
707
|
def resource(str)
|
588
708
|
temp = str.split('/')
|
589
|
-
date = '---'
|
590
|
-
unless temp[9].nil?
|
591
|
-
|
592
|
-
end
|
593
|
-
return {:subscription=> temp[2], :resourcegroup=>temp[4], :nsg=>temp[8]
|
709
|
+
#date = '---'
|
710
|
+
#unless temp[9].nil?
|
711
|
+
# date = val(temp[9])+'/'+val(temp[10])+'/'+val(temp[11])+'-'+val(temp[12])+':00'
|
712
|
+
#end
|
713
|
+
return {:subscription=> temp[2], :resourcegroup=>temp[4], :nsg=>temp[8]}
|
594
714
|
end
|
595
715
|
|
596
716
|
def val(str)
|
597
717
|
return str.split('=')[1]
|
598
718
|
end
|
599
|
-
|
600
719
|
end # class LogStash::Inputs::AzureBlobStorage
|
720
|
+
|
721
|
+
# This is a start towards mapping NSG events to ECS fields ... it's complicated
|
722
|
+
=begin
|
723
|
+
def ecs(old)
|
724
|
+
# https://www.elastic.co/guide/en/ecs/current/ecs-field-reference.html
|
725
|
+
ecs = LogStash::Event.new()
|
726
|
+
ecs.set("ecs.version", "1.0.0")
|
727
|
+
ecs.set("@timestamp", old.timestamp)
|
728
|
+
ecs.set("cloud.provider", "azure")
|
729
|
+
ecs.set("cloud.account.id", old.get("[subscription]")
|
730
|
+
ecs.set("cloud.project.id", old.get("[environment]")
|
731
|
+
ecs.set("file.name", old.get("[filename]")
|
732
|
+
ecs.set("event.category", "network")
|
733
|
+
if old.get("[decision]") == "D"
|
734
|
+
ecs.set("event.type", "denied")
|
735
|
+
else
|
736
|
+
ecs.set("event.type", "allowed")
|
737
|
+
end
|
738
|
+
ecs.set("event.action", "")
|
739
|
+
ecs.set("rule.ruleset", old.get("[nsg]")
|
740
|
+
ecs.set("rule.name", old.get("[rule]")
|
741
|
+
ecs.set("trace.id", old.get("[protocol]")+"/"+old.get("[src_ip]")+":"+old.get("[src_port]")+"-"+old.get("[dst_ip]")+":"+old.get("[dst_port]")
|
742
|
+
# requires logic to match sockets and flip src/dst for outgoing.
|
743
|
+
ecs.set("host.mac", old.get("[mac]")
|
744
|
+
ecs.set("source.ip", old.get("[src_ip]")
|
745
|
+
ecs.set("source.port", old.get("[src_port]")
|
746
|
+
ecs.set("source.bytes", old.get("[srcbytes]")
|
747
|
+
ecs.set("source.packets", old.get("[src_pack]")
|
748
|
+
ecs.set("destination.ip", old.get("[dst_ip]")
|
749
|
+
ecs.set("destination.port", old.get("[dst_port]")
|
750
|
+
ecs.set("destination.bytes", old.get("[dst_bytes]")
|
751
|
+
ecs.set("destination.packets", old.get("[dst_packets]")
|
752
|
+
if old.get("[protocol]") = "U"
|
753
|
+
ecs.set("network.transport", "udp")
|
754
|
+
else
|
755
|
+
ecs.set("network.transport", "tcp")
|
756
|
+
end
|
757
|
+
if old.get("[decision]") == "I"
|
758
|
+
ecs.set("network.direction", "incoming")
|
759
|
+
else
|
760
|
+
ecs.set("network.direction", "outgoing")
|
761
|
+
end
|
762
|
+
ecs.set("network.bytes", old.get("[src_bytes]")+old.get("[dst_bytes]")
|
763
|
+
ecs.set("network.packets", old.get("[src_packets]")+old.get("[dst_packets]")
|
764
|
+
return ecs
|
765
|
+
end
|
766
|
+
=end
|
@@ -1,6 +1,6 @@
|
|
1
1
|
Gem::Specification.new do |s|
|
2
2
|
s.name = 'logstash-input-azure_blob_storage'
|
3
|
-
s.version = '0.12.
|
3
|
+
s.version = '0.12.8'
|
4
4
|
s.licenses = ['Apache-2.0']
|
5
5
|
s.summary = 'This logstash plugin reads and parses data from Azure Storage Blobs.'
|
6
6
|
s.description = <<-EOF
|
@@ -24,5 +24,5 @@ EOF
|
|
24
24
|
s.add_runtime_dependency 'stud', '~> 0.0.23'
|
25
25
|
s.add_runtime_dependency 'azure-storage-blob', '~> 2', '>= 2.0.3'
|
26
26
|
s.add_development_dependency 'logstash-devutils', '~> 2.4'
|
27
|
-
s.add_development_dependency 'rubocop', '~> 1.
|
27
|
+
s.add_development_dependency 'rubocop', '~> 1.50'
|
28
28
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: logstash-input-azure_blob_storage
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.12.
|
4
|
+
version: 0.12.8
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jan Geertsma
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2023-
|
11
|
+
date: 2023-07-15 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
requirement: !ruby/object:Gem::Requirement
|
@@ -77,7 +77,7 @@ dependencies:
|
|
77
77
|
requirements:
|
78
78
|
- - "~>"
|
79
79
|
- !ruby/object:Gem::Version
|
80
|
-
version: '1.
|
80
|
+
version: '1.50'
|
81
81
|
name: rubocop
|
82
82
|
prerelease: false
|
83
83
|
type: :development
|
@@ -85,7 +85,7 @@ dependencies:
|
|
85
85
|
requirements:
|
86
86
|
- - "~>"
|
87
87
|
- !ruby/object:Gem::Version
|
88
|
-
version: '1.
|
88
|
+
version: '1.50'
|
89
89
|
description: " This gem is a Logstash plugin. It reads and parses data from Azure\
|
90
90
|
\ Storage Blobs. The azure_blob_storage is a reimplementation to replace azureblob\
|
91
91
|
\ from azure-diagnostics-tools/Logstash. It can deal with larger volumes and partial\
|