logstash-input-azure_blob_storage 0.11.2 → 0.11.7

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: b721a6aa74f4e9df285f62f47efa42112e540d9836391b31e74daf6544e1087d
4
- data.tar.gz: 5d22a077d53698807a51dde75ac6c7deb273f0fe68d7ea05a46651b5e0c9e577
3
+ metadata.gz: 0dd48413c8fc381dc144c1a4a58c82906533e4f94b36d631c597ecc766aa8edf
4
+ data.tar.gz: 3fb23ac270d539092ca52d73418c710e6d12816635b066da004728bdaef3cc9b
5
5
  SHA512:
6
- metadata.gz: fb35924d7f18579977fa8257a722aa136ca3d9d6a48cb1aecc3aa9f768a4d4b682d5a86c455b634b19d40c6dad9359a54d5b4906ef6952fff8ebc7166c90a808
7
- data.tar.gz: abddf838e31d981dc2da2b84bf825cb7981c610da1331a7aba1cf13f2de6e9ce7c644649f586ad6ef9c2630888900f3f8620fec7b47cddbc6b91c927e44c9b72
6
+ metadata.gz: 84c46edd2afbfe316c2fd5a3b8601f8841308270c3050425a62def1305065aae7868f8be14a98fcc1a7b98f3ccfeee3c9f8a9f9652bf7e799f8d8c5ba1016334
7
+ data.tar.gz: f2a2a27e068f8d384be829abde10935d0cc323da3eb943264ce05e11d9ee1f383fcfac47cbad51a909623638477b0bef5323d2a5b2def1d8666444d08511aa24
data/CHANGELOG.md CHANGED
@@ -1,34 +1,63 @@
1
+ ## 0.11.6
2
+ - fix in json head and tail learning the max_results
3
+ - broke out connection setup in order to call it again if connection exceptions come
4
+ - deal better with skipping of empty files.
5
+
6
+ ## 0.11.5
7
+ - added optional addfilename to add filename in message
8
+ - NSGFLOWLOG version 2 uses 0 as value instead of NULL in src and dst values
9
+ - added connection exception handling when full_read files
10
+ - rewritten json header footer learning to ignore learning from registry
11
+ - plumbing for emulator
12
+
13
+ ## 0.11.4
14
+ - fixed listing 3 times, rather than retrying to list max 3 times
15
+ - added option to migrate/save to using local registry
16
+ - rewrote interval timing
17
+ - reduced saving of registry to maximum once per interval, protect duplicate simultanious writes
18
+ - added debug_timer for better tracing how long operations take
19
+ - removing pipeline name from logfiles, logstash 7.6 and up have this in the log4j2 by default now
20
+ - moved initialization from register to run. should make logs more readable
21
+
22
+ ## 0.11.3
23
+ - don't crash on failed codec, e.g. gzip_lines could sometimes have a corrupted file?
24
+ - fix nextmarker loop so that more than 5000 files (or 15000 if faraday doesn't crash)
25
+
26
+ ## 0.11.2
27
+ - implemented path_filters to to use path filtering like this **/*.log
28
+ - implemented debug_until to debug only at the start of a pipeline until it processed enough messages
29
+
1
30
  ## 0.11.1
2
31
  - copied changes from irnc fork (danke!)
3
- - Fixed trying to load the registry, three time is the charm
32
+ - fixed trying to load the registry, three time is the charm
4
33
  - logs are less chatty, changed info to debug
5
34
 
6
35
  ## 0.11.0
7
- - Implemented start_fresh to skip all previous logs and start monitoring new entries
8
- - Fixed the timer, now properly sleep the interval and check again
9
- - Work around for a Faraday Middleware v.s. Azure Storage Account bug in follow_redirect
36
+ - implemented start_fresh to skip all previous logs and start monitoring new entries
37
+ - fixed the timer, now properly sleep the interval and check again
38
+ - work around for a Faraday Middleware v.s. Azure Storage Account bug in follow_redirect
10
39
 
11
40
  ## 0.10.6
12
- - Fixed the rootcause of the checking the codec. Now compare the classname.
41
+ - fixed the rootcause of the checking the codec. Now compare the classname.
13
42
 
14
43
  ## 0.10.5
15
- - Previous fix broke codec = "line"
44
+ - previous fix broke codec = "line"
16
45
 
17
46
  ## 0.10.4
18
- - Fixed JSON parsing error for partial files because somehow (logstash 7?) @codec.is_a? doesn't work anymore
47
+ - fixed JSON parsing error for partial files because somehow (logstash 7?) @codec.is_a? doesn't work anymore
19
48
 
20
49
  ## 0.10.3
21
- - Fixed issue-1 where iplookup confguration was removed, but still used
50
+ - fixed issue-1 where iplookup confguration was removed, but still used
22
51
  - iplookup is now done by a separate plugin named logstash-filter-weblookup
23
52
 
24
53
  ## 0.10.2
25
54
  - moved iplookup to own plugin logstash-filter-lookup
26
55
 
27
56
  ## 0.10.1
28
- - Implemented iplookup
29
- - Fixed sas tokens (maybe)
30
- - Introduced dns_suffix
57
+ - implemented iplookup
58
+ - fixed sas tokens (maybe)
59
+ - introduced dns_suffix
31
60
 
32
61
  ## 0.10.0
33
- - Plugin created with the logstash plugin generator
34
- - Reimplemented logstash-input-azureblob with incompatible config and data/registry
62
+ - plugin created with the logstash plugin generator
63
+ - reimplemented logstash-input-azureblob with incompatible config and data/registry
data/README.md CHANGED
@@ -1,29 +1,81 @@
1
- # Logstash Plugin
1
+ # Logstash
2
2
 
3
- This is a plugin for [Logstash](https://github.com/elastic/logstash).
3
+ This is a plugin for [Logstash](https://github.com/elastic/logstash). It is fully free and fully open source. The license is Apache 2.0, meaning you are pretty much free to use it however you want in whatever way. All logstash plugin documentation are placed under one [central location](http://www.elastic.co/guide/en/logstash/current/). Need generic logstash help? Try #logstash on freenode IRC or the https://discuss.elastic.co/c/logstash discussion forum.
4
4
 
5
- It is fully free and fully open source. The license is Apache 2.0, meaning you are pretty much free to use it however you want in whatever way.
6
-
7
- ## Documentation
8
-
9
- All plugin documentation are placed under one [central location](http://www.elastic.co/guide/en/logstash/current/).
10
-
11
- ## Need Help?
12
-
13
- Need help? Try #logstash on freenode IRC or the https://discuss.elastic.co/c/logstash discussion forum. For real problems or feature requests, raise a github issue [GITHUB/janmg/logstash-input-azure_blob_storage/](https://github.com/janmg/logstash-input-azure_blob_storage). Pull requests will ionly be merged after discussion through an issue.
5
+ For problems or feature requests with this specific plugin, raise a github issue [GITHUB/janmg/logstash-input-azure_blob_storage/](https://github.com/janmg/logstash-input-azure_blob_storage). Pull requests will also be welcomed after discussion through an issue.
14
6
 
15
7
  ## Purpose
16
- This plugin can read from Azure Storage Blobs, for instance diagnostics logs for NSG flow logs or accesslogs from App Services.
8
+ This plugin can read from Azure Storage Blobs, for instance JSON diagnostics logs for NSG flow logs or LINE based accesslogs from App Services.
17
9
  [Azure Blob Storage](https://azure.microsoft.com/en-us/services/storage/blobs/)
18
10
 
19
- After every interval it will write a registry to the storageaccount to save the information of how many bytes per blob (file) are read and processed. After all files are processed and at least one interval has passed a new file list is generated and a worklist is constructed that will be processed. When a file has already been processed before, partial files are read from the offset to the filesize at the time of the file listing. If the codec is JSON partial files will be have the header and tail will be added. They can be configured. If logtype is nsgflowlog, the plugin will process the splitting into individual tuple events. The logtype wadiis may in the future be used to process the grok formats to split into log lines. Any other format is fed into the queue as one event per file or partial file. It's then up to the filter to split and mutate the file format. use source => message in the filter {} block.
11
+ The plugin depends on the [Ruby library azure-storage-blon](https://rubygems.org/gems/azure-storage-blob/versions/1.1.0) from Microsoft, that depends on Faraday for the HTTPS connection to Azure.
12
+
13
+ The plugin executes the following steps
14
+ 1. Lists all the files in the azure storage account. where the path of the files are matching pathprefix
15
+ 2. Filters on path_filters to only include files that match the directory and file glob (e.g. **/*.json)
16
+ 3. Save the listed files in a registry of known files and filesizes. (data/registry.dat on azure, or in a file on the logstash instance)
17
+ 4. List all the files again and compare the registry with the new filelist and put the delta in a worklist
18
+ 5. Process the worklist and put all events in the logstash queue.
19
+ 6. if there is time left, sleep to complete the interval. If processing takes more than an inteval, save the registry and continue processing.
20
+ 7. If logstash is stopped, a stop signal will try to finish the current file, save the registry and than quit
20
21
 
21
22
  ## Installation
22
23
  This plugin can be installed through logstash-plugin
23
24
  ```
24
- logstash-plugin install logstash-input-azure_blob_storage
25
+ /usr/share/logstash/bin/logstash-plugin install logstash-input-azure_blob_storage
26
+ ```
27
+
28
+ ## Minimal Configuration
29
+ The minimum configuration required as input is storageaccount, access_key and container.
30
+
31
+ /etc/logstash/conf.d/test.conf
32
+ ```
33
+ input {
34
+ azure_blob_storage {
35
+ storageaccount => "yourstorageaccountname"
36
+ access_key => "Ba5e64c0d3=="
37
+ container => "insights-logs-networksecuritygroupflowevent"
38
+ }
39
+ }
25
40
  ```
26
41
 
42
+ ## Additional Configuration
43
+ The registry keeps track of files in the storage account, their size and how many bytes have been processed. Files can grow and the added part will be processed as a partial file. The registry is saved todisk every interval.
44
+
45
+ The registry_create_policy determines at the start of the pipeline if processing should resume from the last known unprocessed file, or to start_fresh ignoring old files and start only processing new events that came after the start of the pipeline. Or start_over to process all the files ignoring the registry.
46
+
47
+ interval defines the minimum time the registry should be saved to the registry file (by default to 'data/registry.dat'), this is only needed in case the pipeline dies unexpectedly. During a normal shutdown the registry is also saved.
48
+
49
+ When registry_local_path is set to a directory, the registry is saved on the logstash server in that directory. The filename is the pipe.id
50
+
51
+ with registry_create_policy set to resume and the registry_local_path set to a directory where the registry isn't yet created, should load the registry from the storage account and save the registry on the local server. This allows for a migration to localstorage
52
+
53
+ For pipelines that use the JSON codec or the JSON_LINE codec, the plugin uses one file to learn how the JSON header and tail look like, they can also be configured manually. Using skip_learning the learning can be disabled.
54
+
55
+ ## Running the pipeline
56
+ The pipeline can be started in several ways.
57
+ - On the commandline
58
+ ```
59
+ /usr/share/logstash/bin/logtash -f /etc/logstash/conf.d/test.conf
60
+ ```
61
+ - In the pipeline.yml
62
+ ```
63
+ /etc/logstash/pipeline.yml
64
+ pipe.id = test
65
+ pipe.path = /etc/logstash/conf.d/test.conf
66
+ ```
67
+ - As managed pipeline from Kibana
68
+
69
+ Logstash itself (so not specific to this plugin) has a feature where multiple instances can run on the same system. The default TCP port is 9600, but if it's already in use it will use 9601 (and up). To update a config file on a running instance on the commandline you can add the argument --config.reload.automatic and if you modify the files that are in the pipeline.yml you can send a SIGHUP channel to reload the pipelines where the config was changed.
70
+ [https://www.elastic.co/guide/en/logstash/current/reloading-config.html](https://www.elastic.co/guide/en/logstash/current/reloading-config.html)
71
+
72
+ ## Internal Working
73
+ When the plugin is started, it will read all the filenames and sizes in the blob store excluding the directies of files that are excluded by the "path_filters". After every interval it will write a registry to the storageaccount to save the information of how many bytes per blob (file) are read and processed. After all files are processed and at least one interval has passed a new file list is generated and a worklist is constructed that will be processed. When a file has already been processed before, partial files are read from the offset to the filesize at the time of the file listing. If the codec is JSON partial files will be have the header and tail will be added. They can be configured. If logtype is nsgflowlog, the plugin will process the splitting into individual tuple events. The logtype wadiis may in the future be used to process the grok formats to split into log lines. Any other format is fed into the queue as one event per file or partial file. It's then up to the filter to split and mutate the file format.
74
+
75
+ By default the root of the json message is named "message" so you can modify the content in the filter block
76
+
77
+ The configurations and the rest of the code are in [https://github.com/janmg/logstash-input-azure_blob_storage/tree/master/lib/logstash/inputs](lib/logstash/inputs) [https://github.com/janmg/logstash-input-azure_blob_storage/blob/master/lib/logstash/inputs/azure_blob_storage.rb#L10](azure_blob_storage.rb)
78
+
27
79
  ## Enabling NSG Flowlogs
28
80
  1. Enable Network Watcher in your regions
29
81
  2. Create Storage account per region
@@ -39,7 +91,6 @@ logstash-plugin install logstash-input-azure_blob_storage
39
91
  - Access key (key1 or key2)
40
92
 
41
93
  ## Troubleshooting
42
-
43
94
  The default loglevel can be changed in global logstash.yml. On the info level, the plugin save offsets to the registry every interval and will log statistics of processed events (one ) plugin will print for each pipeline the first 6 characters of the ID, in DEBUG the yml log level debug shows details of number of events per (partial) files that are read.
44
95
  ```
45
96
  log.level
@@ -50,10 +101,11 @@ The log level of the plugin can be put into DEBUG through
50
101
  curl -XPUT 'localhost:9600/_node/logging?pretty' -H 'Content-Type: application/json' -d'{"logger.logstash.inputs.azureblobstorage" : "DEBUG"}'
51
102
  ```
52
103
 
104
+ Because logstash debug makes logstash very chatty, the option debug_until will for a number of processed events and stops debuging. One file can easily contain thousands of events. The debug_until is useful to monitor the start of the plugin and the processing of the first files.
53
105
 
54
- ## Configuration Examples
55
- The minimum configuration required as input is storageaccount, access_key and container.
106
+ debug_timer will show detailed information on how much time listing of files took and how long the plugin will sleep to fill the interval and the listing and processing starts again.
56
107
 
108
+ ## Other Configuration Examples
57
109
  For nsgflowlogs, a simple configuration looks like this
58
110
  ```
59
111
  input {
@@ -77,6 +129,10 @@ filter {
77
129
  }
78
130
  }
79
131
 
132
+ output {
133
+ stdout { }
134
+ }
135
+
80
136
  output {
81
137
  elasticsearch {
82
138
  hosts => "elasticsearch"
@@ -84,22 +140,35 @@ output {
84
140
  }
85
141
  }
86
142
  ```
87
-
88
- It's possible to specify the optional parameters to overwrite the defaults. The iplookup, use_redis and iplist parameters are used for additional information about the source and destination ip address. Redis can be used for caching the results and iplist is to configure an array of ip addresses.
143
+ A more elaborate input configuration example
89
144
  ```
90
145
  input {
91
146
  azure_blob_storage {
147
+ codec => "json"
92
148
  storageaccount => "yourstorageaccountname"
93
149
  access_key => "Ba5e64c0d3=="
94
150
  container => "insights-logs-networksecuritygroupflowevent"
95
- codec => "json"
96
151
  logtype => "nsgflowlog"
97
152
  prefix => "resourceId=/"
153
+ path_filters => ['**/*.json']
154
+ addfilename => true
98
155
  registry_create_policy => "resume"
156
+ registry_local_path => "/usr/share/logstash/plugin"
99
157
  interval => 300
158
+ debug_timer => true
159
+ debug_until => 100
160
+ }
161
+ }
162
+
163
+ output {
164
+ elasticsearch {
165
+ hosts => "elasticsearch"
166
+ index => "nsg-flow-logs-%{+xxxx.ww}"
100
167
  }
101
168
  }
102
169
  ```
170
+ The configuration documentation is in the first 100 lines of the code
171
+ [GITHUB/janmg/logstash-input-azure_blob_storage/blob/master/lib/logstash/inputs/azure_blob_storage.rb](https://github.com/janmg/logstash-input-azure_blob_storage/blob/master/lib/logstash/inputs/azure_blob_storage.rb)
103
172
 
104
173
  For WAD IIS and App Services the HTTP AccessLogs can be retrieved from a storage account as line based events and parsed through GROK. The date stamp can also be parsed with %{TIMESTAMP_ISO8601:log_timestamp}. For WAD IIS logfiles the container is wad-iis-logfiles. In the future grokking may happen already by the plugin.
105
174
  ```
@@ -138,7 +207,7 @@ filter {
138
207
  remove_field => ["subresponse"]
139
208
  remove_field => ["username"]
140
209
  remove_field => ["clientPort"]
141
- remove_field => ["port"]
210
+ remove_field => ["port"]:0
142
211
  remove_field => ["timestamp"]
143
212
  }
144
213
  }
@@ -25,6 +25,9 @@ config :storageaccount, :validate => :string, :required => false
25
25
  # DNS Suffix other then blob.core.windows.net
26
26
  config :dns_suffix, :validate => :string, :required => false, :default => 'core.windows.net'
27
27
 
28
+ # For development this can be used to emulate an accountstorage when not available from azure
29
+ #config :use_development_storage, :validate => :boolean, :required => false
30
+
28
31
  # The (primary or secondary) Access Key for the the storage account. The key can be found in the portal.azure.com or through the azure api StorageAccounts/ListKeys. For example the PowerShell command Get-AzStorageAccountKey.
29
32
  config :access_key, :validate => :password, :required => false
30
33
 
@@ -39,6 +42,9 @@ config :container, :validate => :string, :default => 'insights-logs-networksecur
39
42
  # The default, `data/registry`, it contains a Ruby Marshal Serialized Hash of the filename the offset read sofar and the filelength the list time a filelisting was done.
40
43
  config :registry_path, :validate => :string, :required => false, :default => 'data/registry.dat'
41
44
 
45
+ # If registry_local_path is set to a directory on the local server, the registry is save there instead of the remote blob_storage
46
+ config :registry_local_path, :validate => :string, :required => false
47
+
42
48
  # The default, `resume`, will load the registry offsets and will start processing files from the offsets.
43
49
  # When set to `start_over`, all log files are processed from begining.
44
50
  # when set to `start_fresh`, it will read log files that are created or appended since this start of the pipeline.
@@ -55,12 +61,21 @@ config :registry_create_policy, :validate => ['resume','start_over','start_fresh
55
61
  # Z00000000000000000000000000000000 2 ]}
56
62
  config :interval, :validate => :number, :default => 60
57
63
 
64
+ # add the filename into the events
65
+ config :addfilename, :validate => :boolean, :default => false, :required => false
66
+
58
67
  # debug_until will for a maximum amount of processed messages shows 3 types of log printouts including processed filenames. This is a lightweight alternative to switching the loglevel from info to debug or even trace
59
68
  config :debug_until, :validate => :number, :default => 0, :required => false
60
69
 
70
+ # debug_timer show time spent on activities
71
+ config :debug_timer, :validate => :boolean, :default => false, :required => false
72
+
61
73
  # WAD IIS Grok Pattern
62
74
  #config :grokpattern, :validate => :string, :required => false, :default => '%{TIMESTAMP_ISO8601:log_timestamp} %{NOTSPACE:instanceId} %{NOTSPACE:instanceId2} %{IPORHOST:ServerIP} %{WORD:httpMethod} %{URIPATH:requestUri} %{NOTSPACE:requestQuery} %{NUMBER:port} %{NOTSPACE:username} %{IPORHOST:clientIP} %{NOTSPACE:httpVersion} %{NOTSPACE:userAgent} %{NOTSPACE:cookie} %{NOTSPACE:referer} %{NOTSPACE:host} %{NUMBER:httpStatus} %{NUMBER:subresponse} %{NUMBER:win32response} %{NUMBER:sentBytes:int} %{NUMBER:receivedBytes:int} %{NUMBER:timeTaken:int}'
63
75
 
76
+ # skip learning if you use json and don't want to learn the head and tail, but use either the defaults or configure them.
77
+ config :skip_learning, :validate => :boolean, :default => false, :required => false
78
+
64
79
  # The string that starts the JSON. Only needed when the codec is JSON. When partial file are read, the result will not be valid JSON unless the start and end are put back. the file_head and file_tail are learned at startup, by reading the first file in the blob_list and taking the first and last block, this would work for blobs that are appended like nsgflowlogs. The configuration can be set to override the learning. In case learning fails and the option is not set, the default is to use the 'records' as set by nsgflowlogs.
65
80
  config :file_head, :validate => :string, :required => false, :default => '{"records":['
66
81
  # The string that ends the JSON
@@ -90,59 +105,55 @@ config :path_filters, :validate => :array, :default => ['**/*'], :required => fa
90
105
  public
91
106
  def register
92
107
  @pipe_id = Thread.current[:name].split("[").last.split("]").first
93
- @logger.info("=== "+config_name+" / "+@pipe_id+" / "+@id[0,6]+" ===")
94
- #@logger.info("ruby #{ RUBY_VERSION }p#{ RUBY_PATCHLEVEL } / #{Gem.loaded_specs[config_name].version.to_s}")
108
+ @logger.info("=== #{config_name} #{Gem.loaded_specs["logstash-input-"+config_name].version.to_s} / #{@pipe_id} / #{@id[0,6]} / ruby #{ RUBY_VERSION }p#{ RUBY_PATCHLEVEL } ===")
95
109
  @logger.info("If this plugin doesn't work, please raise an issue in https://github.com/janmg/logstash-input-azure_blob_storage")
96
110
  # TODO: consider multiple readers, so add pipeline @id or use logstash-to-logstash communication?
97
111
  # TODO: Implement retry ... Error: Connection refused - Failed to open TCP connection to
112
+ end
98
113
 
114
+
115
+
116
+ def run(queue)
99
117
  # counter for all processed events since the start of this pipeline
100
118
  @processed = 0
101
119
  @regsaved = @processed
102
120
 
103
- # Try in this order to access the storageaccount
104
- # 1. storageaccount / sas_token
105
- # 2. connection_string
106
- # 3. storageaccount / access_key
107
-
108
- unless connection_string.nil?
109
- conn = connection_string.value
110
- end
111
- unless sas_token.nil?
112
- unless sas_token.value.start_with?('?')
113
- conn = "BlobEndpoint=https://#{storageaccount}.#{dns_suffix};SharedAccessSignature=#{sas_token.value}"
114
- else
115
- conn = sas_token.value
116
- end
117
- end
118
- unless conn.nil?
119
- @blob_client = Azure::Storage::Blob::BlobService.create_from_connection_string(conn)
120
- else
121
- @blob_client = Azure::Storage::Blob::BlobService.create(
122
- storage_account_name: storageaccount,
123
- storage_dns_suffix: dns_suffix,
124
- storage_access_key: access_key.value,
125
- )
126
- end
121
+ connect
127
122
 
128
123
  @registry = Hash.new
129
124
  if registry_create_policy == "resume"
130
- @logger.info(@pipe_id+" resuming from registry")
131
125
  for counter in 1..3
132
126
  begin
133
- @registry = Marshal.load(@blob_client.get_blob(container, registry_path)[1])
134
- #[0] headers [1] responsebody
127
+ if (!@registry_local_path.nil?)
128
+ unless File.file?(@registry_local_path+"/"+@pipe_id)
129
+ @registry = Marshal.load(@blob_client.get_blob(container, registry_path)[1])
130
+ #[0] headers [1] responsebody
131
+ @logger.info("migrating from remote registry #{registry_path}")
132
+ else
133
+ if !Dir.exist?(@registry_local_path)
134
+ FileUtils.mkdir_p(@registry_local_path)
135
+ end
136
+ @registry = Marshal.load(File.read(@registry_local_path+"/"+@pipe_id))
137
+ @logger.info("resuming from local registry #{registry_local_path+"/"+@pipe_id}")
138
+ end
139
+ else
140
+ @registry = Marshal.load(@blob_client.get_blob(container, registry_path)[1])
141
+ #[0] headers [1] responsebody
142
+ @logger.info("resuming from remote registry #{registry_path}")
143
+ end
144
+ break
135
145
  rescue Exception => e
136
- @logger.error(@pipe_id+" caught: #{e.message}")
146
+ @logger.error("caught: #{e.message}")
137
147
  @registry.clear
138
- @logger.error(@pipe_id+" loading registry failed for attempt #{counter} of 3")
148
+ @logger.error("loading registry failed for attempt #{counter} of 3")
139
149
  end
140
150
  end
141
151
  end
142
152
  # read filelist and set offsets to file length to mark all the old files as done
143
153
  if registry_create_policy == "start_fresh"
144
- @logger.info(@pipe_id+" starting fresh")
145
154
  @registry = list_blobs(true)
155
+ save_registry(@registry)
156
+ @logger.info("starting fresh, writing a clean registry to contain #{@registry.size} blobs/files")
146
157
  end
147
158
 
148
159
  @is_json = false
@@ -155,34 +166,41 @@ def register
155
166
  @tail = ''
156
167
  # if codec=json sniff one files blocks A and Z to learn file_head and file_tail
157
168
  if @is_json
158
- learn_encapsulation
159
169
  if file_head
160
- @head = file_head
170
+ @head = file_head
161
171
  end
162
172
  if file_tail
163
- @tail = file_tail
173
+ @tail = file_tail
174
+ end
175
+ if file_head and file_tail and !skip_learning
176
+ learn_encapsulation
164
177
  end
165
- @logger.info(@pipe_id+" head will be: #{@head} and tail is set to #{@tail}")
178
+ @logger.info("head will be: #{@head} and tail is set to #{@tail}")
166
179
  end
167
- end # def register
168
-
169
180
 
170
-
171
- def run(queue)
172
181
  newreg = Hash.new
173
182
  filelist = Hash.new
174
183
  worklist = Hash.new
175
- # we can abort the loop if stop? becomes true
184
+ @last = start = Time.now.to_i
185
+
186
+ # This is the main loop, it
187
+ # 1. Lists all the files in the remote storage account that match the path prefix
188
+ # 2. Filters on path_filters to only include files that match the directory and file glob (**/*.json)
189
+ # 3. Save the listed files in a registry of known files and filesizes.
190
+ # 4. List all the files again and compare the registry with the new filelist and put the delta in a worklist
191
+ # 5. Process the worklist and put all events in the logstash queue.
192
+ # 6. if there is time left, sleep to complete the interval. If processing takes more than an inteval, save the registry and continue.
193
+ # 7. If stop signal comes, finish the current file, save the registry and quit
176
194
  while !stop?
177
- chrono = Time.now.to_i
178
195
  # load the registry, compare it's offsets to file list, set offset to 0 for new files, process the whole list and if finished within the interval wait for next loop,
179
196
  # TODO: sort by timestamp ?
180
197
  #filelist.sort_by(|k,v|resource(k)[:date])
181
198
  worklist.clear
182
199
  filelist.clear
183
200
  newreg.clear
201
+
202
+ # Listing all the files
184
203
  filelist = list_blobs(false)
185
- # registry.merge(filelist) {|key, :offset, :length| :offset.merge :length }
186
204
  filelist.each do |name, file|
187
205
  off = 0
188
206
  begin
@@ -193,62 +211,98 @@ def run(queue)
193
211
  newreg.store(name, { :offset => off, :length => file[:length] })
194
212
  if (@debug_until > @processed) then @logger.info("2: adding offsets: #{name} #{off} #{file[:length]}") end
195
213
  end
196
-
214
+ # size nilClass when the list doesn't grow?!
197
215
  # Worklist is the subset of files where the already read offset is smaller than the file size
198
216
  worklist.clear
217
+ chunk = nil
218
+
199
219
  worklist = newreg.select {|name,file| file[:offset] < file[:length]}
200
- # This would be ideal for threading since it's IO intensive, would be nice with a ruby native ThreadPool
201
- worklist.each do |name, file|
202
- #res = resource(name)
220
+ if (worklist.size > 4) then @logger.info("worklist contains #{worklist.size} blobs") end
221
+
222
+ # Start of processing
223
+ # This would be ideal for threading since it's IO intensive, would be nice with a ruby native ThreadPool
224
+ if (worklist.size > 0) then
225
+ worklist.each do |name, file|
226
+ start = Time.now.to_i
203
227
  if (@debug_until > @processed) then @logger.info("3: processing #{name} from #{file[:offset]} to #{file[:length]}") end
204
228
  size = 0
205
229
  if file[:offset] == 0
206
- chunk = full_read(name)
207
- size=chunk.size
230
+ # This is where Sera4000 issue starts
231
+ # For an append blob, reading full and crashing, retry, last_modified? ... lenght? ... committed? ...
232
+ # length and skip reg value
233
+ if (file[:length] > 0)
234
+ begin
235
+ chunk = full_read(name)
236
+ size=chunk.size
237
+ rescue Exception => e
238
+ @logger.error("Failed to read #{name} because of: #{e.message} .. will continue, set file as read and pretend this never happened")
239
+ @logger.error("#{size} size and #{file[:length]} file length")
240
+ size = file[:length]
241
+ end
242
+ else
243
+ @logger.info("found a zero size file #{name}")
244
+ chunk = nil
245
+ end
208
246
  else
209
247
  chunk = partial_read_json(name, file[:offset], file[:length])
210
- @logger.info(@pipe_id+" partial file #{name} from #{file[:offset]} to #{file[:length]}")
248
+ @logger.debug("partial file #{name} from #{file[:offset]} to #{file[:length]}")
211
249
  end
212
250
  if logtype == "nsgflowlog" && @is_json
251
+ # skip empty chunks
252
+ unless chunk.nil?
213
253
  res = resource(name)
214
254
  begin
215
255
  fingjson = JSON.parse(chunk)
216
- @processed += nsgflowlog(queue, fingjson)
217
- @logger.debug(@pipe_id+" Processed #{res[:nsg]} [#{res[:date]}] #{@processed} events")
256
+ @processed += nsgflowlog(queue, fingjson, name)
257
+ @logger.debug("Processed #{res[:nsg]} [#{res[:date]}] #{@processed} events")
218
258
  rescue JSON::ParserError
219
- @logger.error(@pipe_id+" parse error on #{res[:nsg]} [#{res[:date]}] offset: #{file[:offset]} length: #{file[:length]}")
259
+ @logger.error("parse error on #{res[:nsg]} [#{res[:date]}] offset: #{file[:offset]} length: #{file[:length]}")
220
260
  end
261
+ end
221
262
  # TODO: Convert this to line based grokking.
222
263
  # TODO: ECS Compliance?
223
264
  elsif logtype == "wadiis" && !@is_json
224
265
  @processed += wadiislog(queue, name)
225
266
  else
226
267
  counter = 0
227
- @codec.decode(chunk) do |event|
268
+ begin
269
+ @codec.decode(chunk) do |event|
228
270
  counter += 1
271
+ if @addfilename
272
+ event.set('filename', name)
273
+ end
229
274
  decorate(event)
230
275
  queue << event
276
+ end
277
+ rescue Exception => e
278
+ @logger.error("codec exception: #{e.message} .. will continue and pretend this never happened")
279
+ @registry.store(name, { :offset => file[:length], :length => file[:length] })
280
+ @logger.debug("#{chunk}")
231
281
  end
232
282
  @processed += counter
233
283
  end
234
284
  @registry.store(name, { :offset => size, :length => file[:length] })
235
285
  # TODO add input plugin option to prevent connection cache
236
286
  @blob_client.client.reset_agents!
237
- #@logger.info(@pipe_id+" name #{name} size #{size} len #{file[:length]}")
287
+ #@logger.info("name #{name} size #{size} len #{file[:length]}")
238
288
  # if stop? good moment to stop what we're doing
239
289
  if stop?
240
290
  return
241
291
  end
242
- # save the registry past the regular intervals
243
- now = Time.now.to_i
244
- if ((now - chrono) > interval)
292
+ if ((Time.now.to_i - @last) > @interval)
245
293
  save_registry(@registry)
246
- chrono += interval
247
294
  end
295
+ end
296
+ end
297
+ # The files that got processed after the last registry save need to be saved too, in case the worklist is empty for some intervals.
298
+ now = Time.now.to_i
299
+ if ((now - @last) > @interval)
300
+ save_registry(@registry)
301
+ end
302
+ sleeptime = interval - ((now - start) % interval)
303
+ if @debug_timer
304
+ @logger.info("going to sleep for #{sleeptime} seconds")
248
305
  end
249
- # Save the registry and sleep until the remaining polling interval is over
250
- save_registry(@registry)
251
- sleeptime = interval - (Time.now.to_i - chrono)
252
306
  Stud.stoppable_sleep(sleeptime) { stop? }
253
307
  end
254
308
  end
@@ -262,8 +316,54 @@ end
262
316
 
263
317
 
264
318
  private
319
+ def connect
320
+ # Try in this order to access the storageaccount
321
+ # 1. storageaccount / sas_token
322
+ # 2. connection_string
323
+ # 3. storageaccount / access_key
324
+
325
+ unless connection_string.nil?
326
+ conn = connection_string.value
327
+ end
328
+ unless sas_token.nil?
329
+ unless sas_token.value.start_with?('?')
330
+ conn = "BlobEndpoint=https://#{storageaccount}.#{dns_suffix};SharedAccessSignature=#{sas_token.value}"
331
+ else
332
+ conn = sas_token.value
333
+ end
334
+ end
335
+ unless conn.nil?
336
+ @blob_client = Azure::Storage::Blob::BlobService.create_from_connection_string(conn)
337
+ else
338
+ # unless use_development_storage?
339
+ @blob_client = Azure::Storage::Blob::BlobService.create(
340
+ storage_account_name: storageaccount,
341
+ storage_dns_suffix: dns_suffix,
342
+ storage_access_key: access_key.value,
343
+ )
344
+ # else
345
+ # @logger.info("not yet implemented")
346
+ # end
347
+ end
348
+ end
349
+
265
350
  def full_read(filename)
266
- return @blob_client.get_blob(container, filename)[1]
351
+ tries ||= 2
352
+ begin
353
+ return @blob_client.get_blob(container, filename)[1]
354
+ rescue Exception => e
355
+ @logger.error("caught: #{e.message} for full_read")
356
+ if (tries -= 1) > 0
357
+ if e.message = "Connection reset by peer"
358
+ connect
359
+ end
360
+ retry
361
+ end
362
+ end
363
+ begin
364
+ chuck = @blob_client.get_blob(container, filename)[1]
365
+ end
366
+ return chuck
267
367
  end
268
368
 
269
369
  def partial_read_json(filename, offset, length)
@@ -286,8 +386,7 @@ def strip_comma(str)
286
386
  end
287
387
 
288
388
 
289
-
290
- def nsgflowlog(queue, json)
389
+ def nsgflowlog(queue, json, name)
291
390
  count=0
292
391
  json["records"].each do |record|
293
392
  res = resource(record["resourceId"])
@@ -300,9 +399,16 @@ def nsgflowlog(queue, json)
300
399
  tups = tup.split(',')
301
400
  ev = rule.merge({:unixtimestamp => tups[0], :src_ip => tups[1], :dst_ip => tups[2], :src_port => tups[3], :dst_port => tups[4], :protocol => tups[5], :direction => tups[6], :decision => tups[7]})
302
401
  if (record["properties"]["Version"]==2)
402
+ tups[9] = 0 if tups[9].nil?
403
+ tups[10] = 0 if tups[10].nil?
404
+ tups[11] = 0 if tups[11].nil?
405
+ tups[12] = 0 if tups[12].nil?
303
406
  ev.merge!( {:flowstate => tups[8], :src_pack => tups[9], :src_bytes => tups[10], :dst_pack => tups[11], :dst_bytes => tups[12]} )
304
407
  end
305
408
  @logger.trace(ev.to_s)
409
+ if @addfilename
410
+ ev.merge!( {:filename => name } )
411
+ end
306
412
  event = LogStash::Event.new('message' => ev.to_json)
307
413
  decorate(event)
308
414
  queue << event
@@ -333,66 +439,108 @@ end
333
439
  # list all blobs in the blobstore, set the offsets from the registry and return the filelist
334
440
  # inspired by: https://github.com/Azure-Samples/storage-blobs-ruby-quickstart/blob/master/example.rb
335
441
  def list_blobs(fill)
336
- files = Hash.new
337
- nextMarker = nil
338
- for counter in 1..3
339
- begin
442
+ tries ||= 3
443
+ begin
444
+ return try_list_blobs(fill)
445
+ rescue Exception => e
446
+ @logger.error("caught: #{e.message} for list_blobs retries left #{tries}")
447
+ if (tries -= 1) > 0
448
+ retry
449
+ end
450
+ end
451
+ end
452
+
453
+ def try_list_blobs(fill)
454
+ # inspired by: http://blog.mirthlab.com/2012/05/25/cleanly-retrying-blocks-of-code-after-an-exception-in-ruby/
455
+ chrono = Time.now.to_i
456
+ files = Hash.new
457
+ nextMarker = nil
458
+ counter = 1
459
+ loop do
340
460
  blobs = @blob_client.list_blobs(container, { marker: nextMarker, prefix: @prefix})
341
461
  blobs.each do |blob|
342
462
  # FNM_PATHNAME is required so that "**/test" can match "test" at the root folder
343
463
  # FNM_EXTGLOB allows you to use "test{a,b,c}" to match either "testa", "testb" or "testc" (closer to shell behavior)
344
464
  unless blob.name == registry_path
345
- if @path_filters.any? {|path| File.fnmatch?(path, blob.name, File::FNM_PATHNAME | File::FNM_EXTGLOB)}
465
+ if @path_filters.any? {|path| File.fnmatch?(path, blob.name, File::FNM_PATHNAME | File::FNM_EXTGLOB)}
346
466
  length = blob.properties[:content_length].to_i
347
467
  offset = 0
348
468
  if fill
349
469
  offset = length
350
470
  end
351
471
  files.store(blob.name, { :offset => offset, :length => length })
352
- if (@debug_until > @processed) then @logger.info("1: list_blobs #{blob.name} #{offset} #{length}") end
472
+ if (@debug_until > @processed) then @logger.info("1: list_blobs #{blob.name} #{offset} #{length}") end
353
473
  end
354
474
  end
355
475
  end
356
476
  nextMarker = blobs.continuation_token
357
477
  break unless nextMarker && !nextMarker.empty?
358
- rescue Exception => e
359
- @logger.error(@pipe_id+" caught: #{e.message} for attempt #{counter} of 3")
360
- counter += 1
361
- end
362
- end
478
+ if (counter % 10 == 0) then @logger.info(" listing #{counter * 50000} files") end
479
+ counter+=1
480
+ end
481
+ if @debug_timer
482
+ @logger.info("list_blobs took #{Time.now.to_i - chrono} sec")
483
+ end
363
484
  return files
364
485
  end
365
486
 
366
487
  # When events were processed after the last registry save, start a thread to update the registry file.
367
488
  def save_registry(filelist)
368
- # TODO because of threading, processed values and regsaved are not thread safe, they can change as instance variable @!
489
+ # Because of threading, processed values and regsaved are not thread safe, they can change as instance variable @! Most of the time this is fine because the registry is the last resort, but be careful about corner cases!
369
490
  unless @processed == @regsaved
370
491
  @regsaved = @processed
371
- @logger.info(@pipe_id+" processed #{@processed} events, saving #{filelist.size} blobs and offsets to registry #{registry_path}")
372
- Thread.new {
492
+ unless (@busy_writing_registry)
493
+ Thread.new {
373
494
  begin
374
- @blob_client.create_block_blob(container, registry_path, Marshal.dump(filelist))
495
+ @busy_writing_registry = true
496
+ unless (@registry_local_path)
497
+ @blob_client.create_block_blob(container, registry_path, Marshal.dump(filelist))
498
+ @logger.info("processed #{@processed} events, saving #{filelist.size} blobs and offsets to remote registry #{registry_path}")
499
+ else
500
+ File.open(@registry_local_path+"/"+@pipe_id, 'w') { |file| file.write(Marshal.dump(filelist)) }
501
+ @logger.info("processed #{@processed} events, saving #{filelist.size} blobs and offsets to local registry #{registry_local_path+"/"+@pipe_id}")
502
+ end
503
+ @busy_writing_registry = false
504
+ @last = Time.now.to_i
375
505
  rescue
376
- @logger.error(@pipe_id+" Oh my, registry write failed, do you have write access?")
506
+ @logger.error("Oh my, registry write failed, do you have write access?")
377
507
  end
378
508
  }
509
+ else
510
+ @logger.info("Skipped writing the registry because previous write still in progress, it just takes long or may be hanging!")
511
+ end
379
512
  end
380
513
  end
381
514
 
515
+
382
516
  def learn_encapsulation
517
+ @logger.info("learn_encapsulation, this can be skipped by setting skip_learning => true. Or set both head_file and tail_file")
383
518
  # From one file, read first block and last block to learn head and tail
384
- # If the blobstorage can't be found, an error from farraday middleware will come with the text
385
- # org.jruby.ext.set.RubySet cannot be cast to class org.jruby.RubyFixnum
386
- blob = @blob_client.list_blobs(container, { maxresults: 1, prefix: @prefix }).first
387
- return if blob.nil?
388
- blocks = @blob_client.list_blob_blocks(container, blob.name)[:committed]
389
- @logger.debug(@pipe_id+" using #{blob.name} to learn the json header and tail")
390
- @head = @blob_client.get_blob(container, blob.name, start_range: 0, end_range: blocks.first.size-1)[1]
391
- @logger.debug(@pipe_id+" learned header: #{@head}")
392
- length = blob.properties[:content_length].to_i
393
- offset = length - blocks.last.size
394
- @tail = @blob_client.get_blob(container, blob.name, start_range: offset, end_range: length-1)[1]
395
- @logger.debug(@pipe_id+" learned tail: #{@tail}")
519
+ begin
520
+ blobs = @blob_client.list_blobs(container, { max_results: 3, prefix: @prefix})
521
+ blobs.each do |blob|
522
+ unless blob.name == registry_path
523
+ begin
524
+ blocks = @blob_client.list_blob_blocks(container, blob.name)[:committed]
525
+ if blocks.first.name.start_with?('A00')
526
+ @logger.debug("using #{blob.name}/#{blocks.first.name} to learn the json header")
527
+ @head = @blob_client.get_blob(container, blob.name, start_range: 0, end_range: blocks.first.size-1)[1]
528
+ end
529
+ if blocks.last.name.start_with?('Z00')
530
+ @logger.debug("using #{blob.name}/#{blocks.last.name} to learn the json footer")
531
+ length = blob.properties[:content_length].to_i
532
+ offset = length - blocks.last.size
533
+ @tail = @blob_client.get_blob(container, blob.name, start_range: offset, end_range: length-1)[1]
534
+ @logger.debug("learned tail: #{@tail}")
535
+ end
536
+ rescue Exception => e
537
+ @logger.info("learn json one of the attempts failed #{e.message}")
538
+ end
539
+ end
540
+ end
541
+ rescue Exception => e
542
+ @logger.info("learn json header and footer failed because #{e.message}")
543
+ end
396
544
  end
397
545
 
398
546
  def resource(str)
@@ -1,6 +1,6 @@
1
1
  Gem::Specification.new do |s|
2
2
  s.name = 'logstash-input-azure_blob_storage'
3
- s.version = '0.11.2'
3
+ s.version = '0.11.7'
4
4
  s.licenses = ['Apache-2.0']
5
5
  s.summary = 'This logstash plugin reads and parses data from Azure Storage Blobs.'
6
6
  s.description = <<-EOF
@@ -22,6 +22,6 @@ EOF
22
22
  # Gem dependencies
23
23
  s.add_runtime_dependency 'logstash-core-plugin-api', '~> 2.1'
24
24
  s.add_runtime_dependency 'stud', '~> 0.0.23'
25
- s.add_runtime_dependency 'azure-storage-blob', '~> 1.0'
26
- s.add_development_dependency 'logstash-devutils', '~> 1.0', '>= 1.0.0'
25
+ s.add_runtime_dependency 'azure-storage-blob', '~> 1.1'
26
+ #s.add_development_dependency 'logstash-devutils', '~> 2'
27
27
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: logstash-input-azure_blob_storage
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.11.2
4
+ version: 0.11.7
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jan Geertsma
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2019-12-20 00:00:00.000000000 Z
11
+ date: 2021-05-17 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  requirement: !ruby/object:Gem::Requirement
@@ -17,8 +17,8 @@ dependencies:
17
17
  - !ruby/object:Gem::Version
18
18
  version: '2.1'
19
19
  name: logstash-core-plugin-api
20
- prerelease: false
21
20
  type: :runtime
21
+ prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
24
  - - "~>"
@@ -31,8 +31,8 @@ dependencies:
31
31
  - !ruby/object:Gem::Version
32
32
  version: 0.0.23
33
33
  name: stud
34
- prerelease: false
35
34
  type: :runtime
35
+ prerelease: false
36
36
  version_requirements: !ruby/object:Gem::Requirement
37
37
  requirements:
38
38
  - - "~>"
@@ -43,35 +43,15 @@ dependencies:
43
43
  requirements:
44
44
  - - "~>"
45
45
  - !ruby/object:Gem::Version
46
- version: '1.0'
46
+ version: '1.1'
47
47
  name: azure-storage-blob
48
- prerelease: false
49
48
  type: :runtime
50
- version_requirements: !ruby/object:Gem::Requirement
51
- requirements:
52
- - - "~>"
53
- - !ruby/object:Gem::Version
54
- version: '1.0'
55
- - !ruby/object:Gem::Dependency
56
- requirement: !ruby/object:Gem::Requirement
57
- requirements:
58
- - - ">="
59
- - !ruby/object:Gem::Version
60
- version: 1.0.0
61
- - - "~>"
62
- - !ruby/object:Gem::Version
63
- version: '1.0'
64
- name: logstash-devutils
65
49
  prerelease: false
66
- type: :development
67
50
  version_requirements: !ruby/object:Gem::Requirement
68
51
  requirements:
69
- - - ">="
70
- - !ruby/object:Gem::Version
71
- version: 1.0.0
72
52
  - - "~>"
73
53
  - !ruby/object:Gem::Version
74
- version: '1.0'
54
+ version: '1.1'
75
55
  description: " This gem is a Logstash plugin. It reads and parses data from Azure\
76
56
  \ Storage Blobs. The azure_blob_storage is a reimplementation to replace azureblob\
77
57
  \ from azure-diagnostics-tools/Logstash. It can deal with larger volumes and partial\
@@ -112,8 +92,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
112
92
  - !ruby/object:Gem::Version
113
93
  version: '0'
114
94
  requirements: []
115
- rubyforge_project:
116
- rubygems_version: 2.7.9
95
+ rubygems_version: 3.0.6
117
96
  signing_key:
118
97
  specification_version: 4
119
98
  summary: This logstash plugin reads and parses data from Azure Storage Blobs.