RubyGems - logstash-input-azure_blob_storage - Versions diffs - 0.11.1 → 0.11.6 - Mend

logstash-input-azure_blob_storage 0.11.1 → 0.11.6

Files changed (6) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +46 -12
data/README.md +90 -21
data/lib/logstash/inputs/azure_blob_storage.rb +257 -101
data/logstash-input-azure_blob_storage.gemspec +3 -3
metadata +6 -26

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 5fb68f13f46e7a0455fe4ffd3f6c9e04b136611e01504310bd739bbc6813c6f6
-  data.tar.gz: 3f818813b0b45acac96edb34a4948d01c234946fb2580eefe5ece8e43240c0c1
+  metadata.gz: ececd96b04d2cab60eca54a0fe2a98c9ed093da2227e3568d4feea09264912fa
+  data.tar.gz: 7bcd39bc38d26a05da1275e5fb2317e41b5c2cddc6541535c7d166a69bb3cf62
 SHA512:
-  metadata.gz: b596bbfc6a1e3400c33e54bbfa4adb753ea1c6593ae647da221368a089b25cd650856d4abb78c5f39ae39df67387b3de938962d63a90e06d2b54164599ced0a9
-  data.tar.gz: 6c0eb3959fa0f393f63c0697f26d49b01280604e2443b8c0a17342d768f4c1a9402e4c8f658462cdd1d76cc11fa7a654bb36052ff05fdbec775050ad33539a1c
+  metadata.gz: 1bcbfab30de973e9eafee295221dc816411dca0e0f747a01c62bb48ec5c46eaf4db4162fdd5283611cd79da59910daab9e7c6e234df47f5ce7f320e65f7b8c69
+  data.tar.gz: 7bbdab8694d024b9c08cc89e13bc86aa8b90a536f5615565333593e0da7c3073d7c4cf3ad3f2b4005a90541de9693a93826158a18fbf9015234bee1812b3d46c

data/CHANGELOG.md CHANGED Viewed

@@ -1,29 +1,63 @@
+## 0.11.6
+  - fix in json head and tail learning the max_results
+  - broke out connection setup in order to call it again if connection exceptions come
+  - deal better with skipping of empty files.
+## 0.11.5
+  - added optional addfilename to add filename in message
+  - NSGFLOWLOG version 2 uses 0 as value instead of NULL in src and dst values
+  - added connection exception handling when full_read files
+  - rewritten json header footer learning to ignore learning from registry
+  - plumbing for emulator
+## 0.11.4
+  - fixed listing 3 times, rather than retrying to list max 3 times
+  - added option to migrate/save to using local registry
+  - rewrote interval timing
+  - reduced saving of registry to maximum once per interval, protect duplicate simultanious writes
+  - added debug_timer for better tracing how long operations take
+  - removing pipeline name from logfiles, logstash 7.6 and up have this in the log4j2 by default now
+  - moved initialization from register to run. should make logs more readable
+## 0.11.3
+  - don't crash on failed codec, e.g. gzip_lines could sometimes have a corrupted file?
+  - fix nextmarker loop so that more than 5000 files (or 15000 if faraday doesn't crash)
+## 0.11.2
+  - implemented path_filters to to use path filtering like this **/*.log
+  - implemented debug_until to debug only at the start of a pipeline until it processed enough messages
+## 0.11.1
+  - copied changes from irnc fork (danke!)
+  - fixed trying to load the registry, three time is the charm
+  - logs are less chatty, changed info to debug
 ## 0.11.0
-  - Implemented start_fresh to skip all previous logs and start monitoring new entries
-  - Fixed the timer, now properly sleep the interval and check again
-  - Work around for a Faraday Middleware v.s. Azure Storage Account bug in follow_redirect
+  - implemented start_fresh to skip all previous logs and start monitoring new entries
+  - fixed the timer, now properly sleep the interval and check again
+  - work around for a Faraday Middleware v.s. Azure Storage Account bug in follow_redirect
 ## 0.10.6
-  - Fixed the rootcause of the checking the codec. Now compare the classname.
+  - fixed the rootcause of the checking the codec. Now compare the classname.
 ## 0.10.5
-  - Previous fix broke codec = "line"
+  - previous fix broke codec = "line"
 ## 0.10.4
-  - Fixed JSON parsing error for partial files because somehow (logstash 7?) @codec.is_a? doesn't work anymore
+  - fixed JSON parsing error for partial files because somehow (logstash 7?) @codec.is_a? doesn't work anymore
 ## 0.10.3
-  - Fixed issue-1 where iplookup confguration was removed, but still used
+  - fixed issue-1 where iplookup confguration was removed, but still used
   - iplookup is now done by a separate plugin named logstash-filter-weblookup
 ## 0.10.2
   - moved iplookup to own plugin logstash-filter-lookup
 ## 0.10.1
-  - Implemented iplookup
-  - Fixed sas tokens (maybe)
-  - Introduced dns_suffix
+  - implemented iplookup
+  - fixed sas tokens (maybe)
+  - introduced dns_suffix
 ## 0.10.0
-  - Plugin created with the logstash plugin generator
-  - Reimplemented logstash-input-azureblob with incompatible config and data/registry
+  - plugin created with the logstash plugin generator
+  - reimplemented logstash-input-azureblob with incompatible config and data/registry

data/README.md CHANGED Viewed

@@ -1,29 +1,81 @@
-# Logstash Plugin
+# Logstash
-This is a plugin for [Logstash](https://github.com/elastic/logstash).
+This is a plugin for [Logstash](https://github.com/elastic/logstash). It is fully free and fully open source. The license is Apache 2.0, meaning you are pretty much free to use it however you want in whatever way. All logstash plugin documentation are placed under one [central location](http://www.elastic.co/guide/en/logstash/current/). Need generic logstash help? Try #logstash on freenode IRC or the https://discuss.elastic.co/c/logstash discussion forum.
-It is fully free and fully open source. The license is Apache 2.0, meaning you are pretty much free to use it however you want in whatever way.
-## Documentation
-All plugin documentation are placed under one [central location](http://www.elastic.co/guide/en/logstash/current/).
-## Need Help?
-Need help? Try #logstash on freenode IRC or the https://discuss.elastic.co/c/logstash discussion forum. For real problems or feature requests, raise a github issue [GITHUB/janmg/logstash-input-azure_blob_storage/](https://github.com/janmg/logstash-input-azure_blob_storage). Pull requests will ionly be merged after discussion through an issue.
+For problems or feature requests with this specific plugin, raise a github issue [GITHUB/janmg/logstash-input-azure_blob_storage/](https://github.com/janmg/logstash-input-azure_blob_storage). Pull requests will also be welcomed after discussion through an issue.
 ## Purpose
-This plugin can read from Azure Storage Blobs, for instance diagnostics logs for NSG flow logs or accesslogs from App Services.
+This plugin can read from Azure Storage Blobs, for instance JSON diagnostics logs for NSG flow logs or LINE based accesslogs from App Services.
 [Azure Blob Storage](https://azure.microsoft.com/en-us/services/storage/blobs/)
-After every interval it will write a registry to the storageaccount to save the information of how many bytes per blob (file) are read and processed. After all files are processed and at least one interval has passed a new file list is generated and a worklist is constructed that will be processed. When a file has already been processed before, partial files are read from the offset to the filesize at the time of the file listing. If the codec is JSON partial files will be have the header and tail will be added. They can be configured. If logtype is nsgflowlog, the plugin will process the splitting into individual tuple events. The logtype wadiis may in the future be used to process the grok formats to split into log lines. Any other format is fed into the queue as one event per file or partial file. It's then up to the filter to split and mutate the file format. use source => message in the filter {} block.
+The plugin depends on the [Ruby library azure-storage-blon](https://rubygems.org/gems/azure-storage-blob/versions/1.1.0) from Microsoft, that depends on Faraday for the HTTPS connection to Azure.
+The plugin executes the following steps
+1. Lists all the files in the azure storage account. where the path of the files are matching pathprefix
+2. Filters on path_filters to only include files that match the directory and file glob (e.g. **/*.json)
+3. Save the listed files in a registry of known files and filesizes. (data/registry.dat on azure, or in a file on the logstash instance)
+4. List all the files again and compare the registry with the new filelist and put the delta in a worklist
+5. Process the worklist and put all events in the logstash queue.
+6. if there is time left, sleep to complete the interval. If processing takes more than an inteval, save the registry and continue processing.
+7. If logstash is stopped, a stop signal will try to finish the current file, save the registry and than quit
 ## Installation
 This plugin can be installed through logstash-plugin
 ```
-logstash-plugin install logstash-input-azure_blob_storage
+/usr/share/logstash/bin/logstash-plugin install logstash-input-azure_blob_storage
+```
+## Minimal Configuration
+The minimum configuration required as input is storageaccount, access_key and container.
+/etc/logstash/conf.d/test.conf
+```
+input {
+    azure_blob_storage {
+        storageaccount => "yourstorageaccountname"
+        access_key => "Ba5e64c0d3=="
+        container => "insights-logs-networksecuritygroupflowevent"
+    }
+}
 ```
+## Additional Configuration
+The registry keeps track of files in the storage account, their size and how many bytes have been processed. Files can grow and the added part will be processed as a partial file. The registry is saved todisk every interval.
+The registry_create_policy determines at the start of the pipeline if processing should resume from the last known unprocessed file, or to start_fresh ignoring old files and start only processing new events that came after the start of the pipeline. Or start_over to process all the files ignoring the registry.
+interval defines the minimum time the registry should be saved to the registry file (by default to 'data/registry.dat'), this is only needed in case the pipeline dies unexpectedly. During a normal shutdown the registry is also saved.
+When registry_local_path is set to a directory, the registry is saved on the logstash server in that directory. The filename is the pipe.id
+with registry_create_policy set to resume and the registry_local_path set to a directory where the registry isn't yet created, should load the registry from the storage account and save the registry on the local server. This allows for a migration to localstorage
+For pipelines that use the JSON codec or the JSON_LINE codec, the plugin uses one file to learn how the JSON header and tail look like, they can also be configured manually. Using skip_learning the learning can be disabled.
+## Running the pipeline
+The pipeline can be started in several ways.
+ - On the commandline
+   ```
+   /usr/share/logstash/bin/logtash -f /etc/logstash/conf.d/test.conf
+   ```
+ - In the pipeline.yml
+   ```
+   /etc/logstash/pipeline.yml
+   pipe.id = test
+   pipe.path = /etc/logstash/conf.d/test.conf
+   ```
+ - As managed pipeline from Kibana
+Logstash itself (so not specific to this plugin) has a feature where multiple instances can run on the same system. The default TCP port is 9600, but if it's already in use it will use 9601 (and up). To update a config file on a running instance on the commandline you can add the argument --config.reload.automatic and if you modify the files that are in the pipeline.yml you can send a SIGHUP channel to reload the pipelines where the config was changed.
+[https://www.elastic.co/guide/en/logstash/current/reloading-config.html](https://www.elastic.co/guide/en/logstash/current/reloading-config.html)
+## Internal Working
+When the plugin is started, it will read all the filenames and sizes in the blob store excluding the directies of files that are excluded by the "path_filters". After every interval it will write a registry to the storageaccount to save the information of how many bytes per blob (file) are read and processed. After all files are processed and at least one interval has passed a new file list is generated and a worklist is constructed that will be processed. When a file has already been processed before, partial files are read from the offset to the filesize at the time of the file listing. If the codec is JSON partial files will be have the header and tail will be added. They can be configured. If logtype is nsgflowlog, the plugin will process the splitting into individual tuple events. The logtype wadiis may in the future be used to process the grok formats to split into log lines. Any other format is fed into the queue as one event per file or partial file. It's then up to the filter to split and mutate the file format.
+By default the root of the json message is named "message" so you can modify the content in the filter block
+The configurations and the rest of the code are in [https://github.com/janmg/logstash-input-azure_blob_storage/tree/master/lib/logstash/inputs](lib/logstash/inputs) [https://github.com/janmg/logstash-input-azure_blob_storage/blob/master/lib/logstash/inputs/azure_blob_storage.rb#L10](azure_blob_storage.rb)
 ## Enabling NSG Flowlogs
 1. Enable Network Watcher in your regions
 2. Create Storage account per region
@@ -39,7 +91,6 @@ logstash-plugin install logstash-input-azure_blob_storage
    - Access key (key1 or key2)
 ## Troubleshooting
 The default loglevel can be changed in global logstash.yml. On the info level, the plugin save offsets to the registry every interval and will log statistics of processed events (one ) plugin will print for each pipeline the first 6 characters of the ID, in DEBUG the yml log level debug shows details of number of events per (partial) files that are read.
 ```
 log.level
@@ -50,10 +101,11 @@ The log level of the plugin can be put into DEBUG through
 curl -XPUT 'localhost:9600/_node/logging?pretty' -H 'Content-Type: application/json' -d'{"logger.logstash.inputs.azureblobstorage" : "DEBUG"}'
 ```
+Because logstash debug makes logstash very chatty, the option debug_until will for a number of processed events and stops debuging. One file can easily contain thousands of events. The debug_until is useful to monitor the start of the plugin and the processing of the first files.
-## Configuration Examples
-The minimum configuration required as input is storageaccount, access_key and container.
+debug_timer will show detailed information on how much time listing of files took and how long the plugin will sleep to fill the interval and the listing and processing starts again.
+## Other Configuration Examples
 For nsgflowlogs, a simple configuration looks like this
 ```
 input {
@@ -77,6 +129,10 @@ filter {
     }
 }
+output {
+  stdout { }
+}
 output {
     elasticsearch {
         hosts => "elasticsearch"
@@ -84,22 +140,35 @@ output {
     }
 }
 ```
-It's possible to specify the optional parameters to overwrite the defaults. The iplookup, use_redis and iplist parameters are used for additional information about the source and destination ip address. Redis can be used for caching the results and iplist is to configure an array of ip addresses.
+A more elaborate input configuration example
 ```
 input {
     azure_blob_storage {
+        codec => "json"
         storageaccount => "yourstorageaccountname"
         access_key => "Ba5e64c0d3=="
         container => "insights-logs-networksecuritygroupflowevent"
-        codec => "json"
         logtype => "nsgflowlog"
         prefix => "resourceId=/"
+        path_filters => ['**/*.json']
+        addfilename => true
         registry_create_policy => "resume"
+        registry_local_path => "/usr/share/logstash/plugin"
         interval => 300
+        debug_timer => true
+        debug_until => 100
+    }
+}
+output {
+    elasticsearch {
+        hosts => "elasticsearch"
+        index => "nsg-flow-logs-%{+xxxx.ww}"
     }
 }
 ```
+The configuration documentation is in the first 100 lines of the code
+[GITHUB/janmg/logstash-input-azure_blob_storage/blob/master/lib/logstash/inputs/azure_blob_storage.rb](https://github.com/janmg/logstash-input-azure_blob_storage/blob/master/lib/logstash/inputs/azure_blob_storage.rb)
 For WAD IIS and App Services the HTTP AccessLogs can be retrieved from a storage account as line based events and parsed through GROK. The date stamp can also be parsed with %{TIMESTAMP_ISO8601:log_timestamp}. For WAD IIS logfiles the container is wad-iis-logfiles. In the future grokking may happen already by the plugin.
 ```
@@ -138,7 +207,7 @@ filter {
     remove_field => ["subresponse"]
     remove_field => ["username"]
     remove_field => ["clientPort"]
-    remove_field => ["port"]
+    remove_field => ["port"]:0
     remove_field => ["timestamp"]
   }
 }

data/lib/logstash/inputs/azure_blob_storage.rb CHANGED Viewed

@@ -25,6 +25,9 @@ config :storageaccount, :validate => :string, :required => false
 # DNS Suffix other then blob.core.windows.net
 config :dns_suffix, :validate => :string, :required => false, :default => 'core.windows.net'
+# For development this can be used to emulate an accountstorage when not available from azure
+#config :use_development_storage, :validate => :boolean, :required => false
 # The (primary or secondary) Access Key for the the storage account. The key can be found in the portal.azure.com or through the azure api StorageAccounts/ListKeys. For example the PowerShell command Get-AzStorageAccountKey.
 config :access_key, :validate => :password, :required => false
@@ -39,6 +42,9 @@ config :container, :validate => :string, :default => 'insights-logs-networksecur
 # The default, `data/registry`, it contains a Ruby Marshal Serialized Hash of the filename the offset read sofar and the filelength the list time a filelisting was done.
 config :registry_path, :validate => :string, :required => false, :default => 'data/registry.dat'
+# If registry_local_path is set to a directory on the local server, the registry is save there instead of the remote blob_storage
+config :registry_local_path, :validate => :string, :required => false
 # The default, `resume`, will load the registry offsets and will start processing files from the offsets.
 # When set to `start_over`, all log files are processed from begining.
 # when set to `start_fresh`, it will read log files that are created or appended since this start of the pipeline.
@@ -55,9 +61,21 @@ config :registry_create_policy, :validate => ['resume','start_over','start_fresh
 # Z00000000000000000000000000000000 2     ]}
 config :interval, :validate => :number, :default => 60
+# add the filename into the events
+config :addfilename, :validate => :boolean, :default => false, :required => false
+# debug_until will for a maximum amount of processed messages shows 3 types of log printouts including processed filenames. This is a lightweight alternative to switching the loglevel from info to debug or even trace
+config :debug_until, :validate => :number, :default => 0, :required => false
+# debug_timer show time spent on activities
+config :debug_timer, :validate => :boolean, :default => false, :required => false
 # WAD IIS Grok Pattern
 #config :grokpattern, :validate => :string, :required => false, :default => '%{TIMESTAMP_ISO8601:log_timestamp} %{NOTSPACE:instanceId} %{NOTSPACE:instanceId2} %{IPORHOST:ServerIP} %{WORD:httpMethod} %{URIPATH:requestUri} %{NOTSPACE:requestQuery} %{NUMBER:port} %{NOTSPACE:username} %{IPORHOST:clientIP} %{NOTSPACE:httpVersion} %{NOTSPACE:userAgent} %{NOTSPACE:cookie} %{NOTSPACE:referer} %{NOTSPACE:host} %{NUMBER:httpStatus} %{NUMBER:subresponse} %{NUMBER:win32response} %{NUMBER:sentBytes:int} %{NUMBER:receivedBytes:int} %{NUMBER:timeTaken:int}'
+# skip learning if you use json and don't want to learn the head and tail, but use either the defaults or configure them.
+config :skip_learning, :validate => :boolean, :default => false, :required => false
 # The string that starts the JSON. Only needed when the codec is JSON. When partial file are read, the result will not be valid JSON unless the start and end are put back. the file_head and file_tail are learned at startup, by reading the first file in the blob_list and taking the first and last block, this would work for blobs that are appended like nsgflowlogs. The configuration can be set to override the learning. In case learning fails and the option is not set, the default is to use the 'records' as set by nsgflowlogs.
 config :file_head, :validate => :string, :required => false, :default => '{"records":['
 # The string that ends the JSON
@@ -76,64 +94,66 @@ config :file_tail, :validate => :string, :required => false, :default => ']}'
 # For NSGFLOWLOGS a path starts with "resourceId=/", but this would only be needed to exclude other files that may be written in the same container.
 config :prefix, :validate => :string, :required => false
+config :path_filters, :validate => :array, :default => ['**/*'], :required => false
+# TODO: Other feature requests
+# show file path in logger
+# add filepath as part of log message
+# option to keep registry on local disk
 public
 def register
     @pipe_id = Thread.current[:name].split("[").last.split("]").first
-    @logger.info("=== "+config_name+" / "+@pipe_id+" / "+@id[0,6]+" ===")
-    #@logger.info("ruby #{ RUBY_VERSION }p#{ RUBY_PATCHLEVEL } / #{Gem.loaded_specs[config_name].version.to_s}")
+    @logger.info("=== #{config_name} #{Gem.loaded_specs["logstash-input-"+config_name].version.to_s} / #{@pipe_id} / #{@id[0,6]} / ruby #{ RUBY_VERSION }p#{ RUBY_PATCHLEVEL } ===")
     @logger.info("If this plugin doesn't work, please raise an issue in https://github.com/janmg/logstash-input-azure_blob_storage")
     # TODO: consider multiple readers, so add pipeline @id or use logstash-to-logstash communication?
     # TODO: Implement retry ... Error: Connection refused - Failed to open TCP connection to
+end
+def run(queue)
     # counter for all processed events since the start of this pipeline
     @processed = 0
     @regsaved = @processed
-    # Try in this order to access the storageaccount
-    # 1. storageaccount / sas_token
-    # 2. connection_string
-    # 3. storageaccount / access_key
-    unless connection_string.nil?
-	conn = connection_string.value
-    end
-    unless sas_token.nil?
-        unless sas_token.value.start_with?('?')
-	    conn = "BlobEndpoint=https://#{storageaccount}.#{dns_suffix};SharedAccessSignature=#{sas_token.value}"
-        else
-	    conn = sas_token.value
-    	end
-    end
-    unless conn.nil?
-        @blob_client = Azure::Storage::Blob::BlobService.create_from_connection_string(conn)
-    else
-        @blob_client = Azure::Storage::Blob::BlobService.create(
-            storage_account_name: storageaccount,
-	    storage_dns_suffix: dns_suffix,
-            storage_access_key: access_key.value,
-        )
-    end
+    connect
     @registry = Hash.new
     if registry_create_policy == "resume"
-     @logger.info(@pipe_id+" resuming from registry")
-     for counter in 0..3
+     for counter in 1..3
        begin
-          @registry = Marshal.load(@blob_client.get_blob(container, registry_path)[1])
-          #[0] headers [1] responsebody
+          if (!@registry_local_path.nil?)
+              unless File.file?(@registry_local_path+"/"+@pipe_id)
+                  @registry = Marshal.load(@blob_client.get_blob(container, registry_path)[1])
+                  #[0] headers [1] responsebody
+                  @logger.info("migrating from remote registry #{registry_path}")
+              else
+                  if !Dir.exist?(@registry_local_path)
+                      FileUtils.mkdir_p(@registry_local_path)
+                  end
+                  @registry = Marshal.load(File.read(@registry_local_path+"/"+@pipe_id))
+                  @logger.info("resuming from local registry #{registry_local_path+"/"+@pipe_id}")
+              end
+          else
+              @registry = Marshal.load(@blob_client.get_blob(container, registry_path)[1])
+              #[0] headers [1] responsebody
+              @logger.info("resuming from remote registry #{registry_path}")
+          end
+          break
         rescue Exception => e
-          @logger.error(@pipe_id+" caught: #{e.message}")
+          @logger.error("caught: #{e.message}")
           @registry.clear
-          @logger.error(@pipe_id+" loading registry failed, starting over")
+          @logger.error("loading registry failed for attempt #{counter} of 3")
         end
       end
     end
     # read filelist and set offsets to file length to mark all the old files as done
     if registry_create_policy == "start_fresh"
-        @logger.info(@pipe_id+" starting fresh")
         @registry = list_blobs(true)
+	save_registry(@registry)
+	@logger.info("starting fresh, writing a clean registry to contain #{@registry.size} blobs/files")
     end
     @is_json = false
@@ -146,34 +166,41 @@ def register
     @tail = ''
     # if codec=json sniff one files blocks A and Z to learn file_head and file_tail
     if @is_json
-        learn_encapsulation
         if file_head
-           @head = file_head
+            @head = file_head
         end
         if file_tail
-           @tail = file_tail
+            @tail = file_tail
         end
-        @logger.info(@pipe_id+" head will be: #{@head} and tail is set to #{@tail}")
+        if file_head and file_tail and !skip_learning
+            learn_encapsulation
+        end
+        @logger.info("head will be: #{@head} and tail is set to #{@tail}")
     end
-end # def register
-def run(queue)
     newreg   = Hash.new
     filelist = Hash.new
     worklist = Hash.new
-    # we can abort the loop if stop? becomes true
+    @last = start = Time.now.to_i
+    # This is the main loop, it
+    # 1. Lists all the files in the remote storage account that match the path prefix
+    # 2. Filters on path_filters to only include files that match the directory and file glob (**/*.json)
+    # 3. Save the listed files in a registry of known files and filesizes.
+    # 4. List all the files again and compare the registry with the new filelist and put the delta in a worklist
+    # 5. Process the worklist and put all events in the logstash queue.
+    # 6. if there is time left, sleep to complete the interval. If processing takes more than an inteval, save the registry and continue.
+    # 7. If stop signal comes, finish the current file, save the registry and quit
     while !stop?
-        chrono = Time.now.to_i
         # load the registry, compare it's offsets to file list, set offset to 0 for new files, process the whole list and if finished within the interval wait for next loop,
         # TODO: sort by timestamp ?
         #filelist.sort_by(|k,v|resource(k)[:date])
 	worklist.clear
 	filelist.clear
         newreg.clear
+	# Listing all the files
         filelist = list_blobs(false)
-	# registry.merge(filelist) {|key, :offset, :length| :offset.merge :length }
         filelist.each do |name, file|
             off = 0
             begin
@@ -182,63 +209,98 @@ def run(queue)
                 off = 0
             end
             newreg.store(name, { :offset => off, :length => file[:length] })
+	    if (@debug_until > @processed) then @logger.info("2: adding offsets: #{name} #{off} #{file[:length]}") end
 	end
+        # size nilClass when the list doesn't grow?!
         # Worklist is the subset of files where the already read offset is smaller than the file size
 	worklist.clear
+        chunk = nil
 	worklist = newreg.select {|name,file| file[:offset] < file[:length]}
-        # This would be ideal for threading since it's IO intensive, would be nice with a ruby native ThreadPool
-        worklist.each do |name, file|
-            #res = resource(name)
-            @logger.debug(@pipe_id+" processing #{name} from #{file[:offset]} to #{file[:length]}")
+	if (worklist.size > 4) then @logger.info("worklist contains #{worklist.size} blobs") end
+        # Start of processing
+	# This would be ideal for threading since it's IO intensive, would be nice with a ruby native ThreadPool
+        if (worklist.size > 0) then
+          worklist.each do |name, file|
+            start = Time.now.to_i
+            if (@debug_until > @processed) then @logger.info("3: processing #{name} from #{file[:offset]} to #{file[:length]}") end
             size = 0
             if file[:offset] == 0
-                chunk = full_read(name)
-                size=chunk.size
+                # This is where Sera4000 issue starts
+                # For an append blob, reading full and crashing, retry, last_modified? ... lenght? ... committed? ...
+                # length and skip reg value
+                if (file[:length] > 0)
+                    begin
+                        chunk = full_read(name)
+                        size=chunk.size
+                    rescue Exception => e
+                        @logger.error("Failed to read #{name} because of: #{e.message} .. will continue and pretend this never happened")
+                    end
+                else
+                    @logger.info("found a zero size file #{name}")
+                    chunk = nil
+                end
             else
                 chunk = partial_read_json(name, file[:offset], file[:length])
-                @logger.debug(@pipe_id+" partial file #{name} from #{file[:offset]} to #{file[:length]}")
+                @logger.debug("partial file #{name} from #{file[:offset]} to #{file[:length]}")
             end
             if logtype == "nsgflowlog" && @is_json
+              # skip empty chunks
+              unless chunk.nil?
                 res = resource(name)
                 begin
 		    fingjson = JSON.parse(chunk)
-                    @processed += nsgflowlog(queue, fingjson)
-                    @logger.debug(@pipe_id+" Processed #{res[:nsg]} [#{res[:date]}] #{@processed} events")
+                    @processed += nsgflowlog(queue, fingjson, name)
+                    @logger.debug("Processed #{res[:nsg]} [#{res[:date]}] #{@processed} events")
                 rescue JSON::ParserError
-                    @logger.error(@pipe_id+" parse error on #{res[:nsg]} [#{res[:date]}] offset: #{file[:offset]} length: #{file[:length]}")
+                    @logger.error("parse error on #{res[:nsg]} [#{res[:date]}] offset: #{file[:offset]} length: #{file[:length]}")
                 end
+              end
             # TODO: Convert this to line based grokking.
             # TODO: ECS Compliance?
             elsif logtype == "wadiis" && !@is_json
                 @processed += wadiislog(queue, name)
             else
                 counter = 0
-                @codec.decode(chunk) do |event|
+                begin
+                    @codec.decode(chunk) do |event|
                     counter += 1
+                    if @addfilename
+                      event.set('filename', name)
+                    end
                     decorate(event)
                     queue << event
+                  end
+                rescue Exception => e
+                    @logger.error("codec exception: #{e.message} .. will continue and pretend this never happened")
+                    @registry.store(name, { :offset => file[:length], :length => file[:length] })
+                    @logger.debug("#{chunk}")
                 end
                 @processed += counter
             end
             @registry.store(name, { :offset => size, :length => file[:length] })
             # TODO add input plugin option to prevent connection cache
             @blob_client.client.reset_agents!
-	    #@logger.info(@pipe_id+" name #{name} size #{size} len #{file[:length]}")
+	    #@logger.info("name #{name} size #{size} len #{file[:length]}")
             # if stop? good moment to stop what we're doing
             if stop?
                 return
             end
-            # save the registry past the regular intervals
-            now = Time.now.to_i
-            if ((now - chrono) > interval)
+	    if ((Time.now.to_i - @last) > @interval)
                 save_registry(@registry)
-		chrono += interval
             end
+          end
+        end
+	# The files that got processed after the last registry save need to be saved too, in case the worklist is empty for some intervals.
+        now = Time.now.to_i
+	if ((now - @last) > @interval)
+            save_registry(@registry)
+        end
+        sleeptime = interval - ((now - start) % interval)
+	if @debug_timer
+            @logger.info("going to sleep for #{sleeptime} seconds")
         end
-        # Save the registry and sleep until the remaining polling interval is over
-        save_registry(@registry)
-        sleeptime = interval - (Time.now.to_i - chrono)
         Stud.stoppable_sleep(sleeptime) { stop? }
     end
 end
@@ -252,8 +314,54 @@ end
 private
+def connect
+    # Try in this order to access the storageaccount
+    # 1. storageaccount / sas_token
+    # 2. connection_string
+    # 3. storageaccount / access_key
+    unless connection_string.nil?
+        conn = connection_string.value
+    end
+    unless sas_token.nil?
+        unless sas_token.value.start_with?('?')
+            conn = "BlobEndpoint=https://#{storageaccount}.#{dns_suffix};SharedAccessSignature=#{sas_token.value}"
+        else
+            conn = sas_token.value
+        end
+    end
+    unless conn.nil?
+        @blob_client = Azure::Storage::Blob::BlobService.create_from_connection_string(conn)
+    else
+#        unless use_development_storage?
+        @blob_client = Azure::Storage::Blob::BlobService.create(
+            storage_account_name: storageaccount,
+            storage_dns_suffix: dns_suffix,
+            storage_access_key: access_key.value,
+        )
+#        else
+#            @logger.info("not yet implemented")
+#        end
+    end
+end
 def full_read(filename)
-    return @blob_client.get_blob(container, filename)[1]
+    tries ||= 2
+    begin
+        return @blob_client.get_blob(container, filename)[1]
+    rescue Exception => e
+        @logger.error("caught: #{e.message} for full_read")
+        if (tries -= 1) > 0
+           if e.message = "Connection reset by peer"
+               connect
+           end
+           retry
+        end
+    end
+    begin
+        chuck = @blob_client.get_blob(container, filename)[1]
+    end
+    return chuck
 end
 def partial_read_json(filename, offset, length)
@@ -276,8 +384,7 @@ def strip_comma(str)
 end
-def nsgflowlog(queue, json)
+def nsgflowlog(queue, json, name)
     count=0
     json["records"].each do |record|
       res = resource(record["resourceId"])
@@ -290,9 +397,16 @@ def nsgflowlog(queue, json)
                   tups = tup.split(',')
                   ev = rule.merge({:unixtimestamp => tups[0], :src_ip => tups[1], :dst_ip => tups[2], :src_port => tups[3], :dst_port => tups[4], :protocol => tups[5], :direction => tups[6], :decision => tups[7]})
                   if (record["properties"]["Version"]==2)
+                    tups[9] = 0 if tups[9].nil?
+                    tups[10] = 0 if tups[10].nil?
+                    tups[11] = 0 if tups[11].nil?
+                    tups[12] = 0 if tups[12].nil?
                       ev.merge!( {:flowstate => tups[8], :src_pack => tups[9], :src_bytes => tups[10], :dst_pack => tups[11], :dst_bytes => tups[12]} )
                   end
                   @logger.trace(ev.to_s)
+                  if @addfilename
+                      ev.merge!( {:filename => name } )
+                  end
                   event = LogStash::Event.new('message' => ev.to_json)
                   decorate(event)
                   queue << event
@@ -323,66 +437,108 @@ end
 # list all blobs in the blobstore, set the offsets from the registry and return the filelist
 # inspired by: https://github.com/Azure-Samples/storage-blobs-ruby-quickstart/blob/master/example.rb
 def list_blobs(fill)
-    files = Hash.new
-    nextMarker = nil
-    counter = 0
-    loop do
-      begin
-         if (counter > 10)
-             @logger.error(@pipe_id+" lets try again for the 10th time, why don't faraday and azure storage accounts not play nice together? it has something to do with follow_redirect and a missing authorization header?")
-         end
+    tries ||= 3
+    begin
+        return try_list_blobs(fill)
+    rescue Exception => e
+        @logger.error("caught: #{e.message} for list_blobs retries left #{tries}")
+        if (tries -= 1) > 0
+           retry
+        end
+    end
+end
+def try_list_blobs(fill)
+# inspired by: http://blog.mirthlab.com/2012/05/25/cleanly-retrying-blocks-of-code-after-an-exception-in-ruby/
+   chrono = Time.now.to_i
+   files = Hash.new
+   nextMarker = nil
+   counter = 1
+   loop do
          blobs = @blob_client.list_blobs(container, { marker: nextMarker, prefix: @prefix})
          blobs.each do |blob|
-             # exclude the registry itself
-             unless blob.name == registry_path
+# FNM_PATHNAME is required so that "**/test" can match "test" at the root folder
+# FNM_EXTGLOB allows you to use "test{a,b,c}" to match either "testa", "testb" or "testc" (closer to shell behavior)
+           unless blob.name == registry_path
+             if @path_filters.any? {|path| File.fnmatch?(path, blob.name, File::FNM_PATHNAME | File::FNM_EXTGLOB)}
                  length = blob.properties[:content_length].to_i
-		 offset = 0
+                 offset = 0
                  if fill
                      offset = length
-		 end
+                 end
                  files.store(blob.name, { :offset => offset, :length => length })
+                 if (@debug_until > @processed) then @logger.info("1: list_blobs #{blob.name} #{offset} #{length}") end
              end
+           end
          end
          nextMarker = blobs.continuation_token
          break unless nextMarker && !nextMarker.empty?
-      rescue Exception => e
-        @logger.error(@pipe_id+" caught: #{e.message}")
-	counter += 1
-      end
-    end
+	 if (counter % 10 == 0) then @logger.info(" listing #{counter * 50000} files") end
+         counter+=1
+        end
+        if @debug_timer
+            @logger.info("list_blobs took #{Time.now.to_i - chrono} sec")
+        end
     return files
 end
 # When events were processed after the last registry save, start a thread to update the registry file.
 def save_registry(filelist)
-    # TODO because of threading, processed values and regsaved are not thread safe, they can change as instance variable @!
+    # Because of threading, processed values and regsaved are not thread safe, they can change as instance variable @! Most of the time this is fine because the registry is the last resort, but be careful about corner cases!
     unless @processed == @regsaved
         @regsaved = @processed
-        @logger.info(@pipe_id+" processed #{@processed} events, saving #{filelist.size} blobs and offsets to registry #{registry_path}")
-        Thread.new {
+        unless (@busy_writing_registry)
+	Thread.new {
             begin
-                @blob_client.create_block_blob(container, registry_path, Marshal.dump(filelist))
+                @busy_writing_registry = true
+                unless (@registry_local_path)
+                    @blob_client.create_block_blob(container, registry_path, Marshal.dump(filelist))
+                    @logger.info("processed #{@processed} events, saving #{filelist.size} blobs and offsets to remote registry #{registry_path}")
+                else
+                    File.open(@registry_local_path+"/"+@pipe_id, 'w') { |file| file.write(Marshal.dump(filelist)) }
+                    @logger.info("processed #{@processed} events, saving #{filelist.size} blobs and offsets to local registry #{registry_local_path+"/"+@pipe_id}")
+                end
+                @busy_writing_registry = false
+                @last = Time.now.to_i
             rescue
-                @logger.error(@pipe_id+" Oh my, registry write failed, do you have write access?")
+                @logger.error("Oh my, registry write failed, do you have write access?")
             end
         }
+        else
+            @logger.info("Skipped writing the registry because previous write still in progress, it just takes long or may be hanging!")
+        end
     end
 end
 def learn_encapsulation
+  @logger.info("learn_encapsulation, this can be skipped by setting skip_learning => true. Or set both head_file and tail_file")
     # From one file, read first block and last block to learn head and tail
-    # If the blobstorage can't be found, an error from farraday middleware will come with the text
-    # org.jruby.ext.set.RubySet cannot be cast to class org.jruby.RubyFixnum
-    blob = @blob_client.list_blobs(container, { maxresults: 1, prefix: @prefix }).first
-    return if blob.nil?
-    blocks = @blob_client.list_blob_blocks(container, blob.name)[:committed]
-    @logger.debug(@pipe_id+" using #{blob.name} to learn the json header and tail")
-    @head = @blob_client.get_blob(container, blob.name, start_range: 0, end_range: blocks.first.size-1)[1]
-    @logger.debug(@pipe_id+" learned header: #{@head}")
-    length = blob.properties[:content_length].to_i
-    offset = length - blocks.last.size
-    @tail = @blob_client.get_blob(container, blob.name, start_range: offset, end_range: length-1)[1]
-    @logger.debug(@pipe_id+" learned tail: #{@tail}")
+    begin
+        blobs = @blob_client.list_blobs(container, { max_results: 3, prefix: @prefix})
+        blobs.each do |blob|
+            unless blob.name == registry_path
+              begin
+                blocks = @blob_client.list_blob_blocks(container, blob.name)[:committed]
+                if blocks.first.name.start_with?('A00')
+                  @logger.debug("using #{blob.name}/#{blocks.first.name} to learn the json header")
+                  @head = @blob_client.get_blob(container, blob.name, start_range: 0, end_range: blocks.first.size-1)[1]
+                end
+                if blocks.last.name.start_with?('Z00')
+                  @logger.debug("using #{blob.name}/#{blocks.last.name} to learn the json footer")
+                  length = blob.properties[:content_length].to_i
+                  offset = length - blocks.last.size
+                  @tail = @blob_client.get_blob(container, blob.name, start_range: offset, end_range: length-1)[1]
+                  @logger.debug("learned tail: #{@tail}")
+                end
+              rescue Exception => e
+                @logger.info("learn json one of the attempts failed #{e.message}")
+              end
+            end
+        end
+    rescue Exception => e
+        @logger.info("learn json header and footer failed because #{e.message}")
+    end
 end
 def resource(str)

data/logstash-input-azure_blob_storage.gemspec CHANGED Viewed

@@ -1,6 +1,6 @@
 Gem::Specification.new do |s|
   s.name          = 'logstash-input-azure_blob_storage'
-  s.version       = '0.11.1'
+  s.version       = '0.11.6'
   s.licenses      = ['Apache-2.0']
   s.summary       = 'This logstash plugin reads and parses data from Azure Storage Blobs.'
   s.description   = <<-EOF
@@ -22,6 +22,6 @@ EOF
   # Gem dependencies
   s.add_runtime_dependency 'logstash-core-plugin-api', '~> 2.1'
   s.add_runtime_dependency 'stud', '~> 0.0.23'
-  s.add_runtime_dependency 'azure-storage-blob', '~> 1.0'
-  s.add_development_dependency 'logstash-devutils', '~> 1.0', '>= 1.0.0'
+  s.add_runtime_dependency 'azure-storage-blob', '~> 1.1'
+  #s.add_development_dependency 'logstash-devutils', '~> 2'
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: logstash-input-azure_blob_storage
 version: !ruby/object:Gem::Version
-  version: 0.11.1
+  version: 0.11.6
 platform: ruby
 authors:
 - Jan Geertsma
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2019-11-18 00:00:00.000000000 Z
+date: 2021-02-11 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   requirement: !ruby/object:Gem::Requirement
@@ -17,8 +17,8 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '2.1'
   name: logstash-core-plugin-api
-  prerelease: false
   type: :runtime
+  prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
@@ -31,8 +31,8 @@ dependencies:
       - !ruby/object:Gem::Version
         version: 0.0.23
   name: stud
-  prerelease: false
   type: :runtime
+  prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
@@ -43,35 +43,15 @@ dependencies:
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.0'
+        version: '1.1'
   name: azure-storage-blob
-  prerelease: false
   type: :runtime
-  version_requirements: !ruby/object:Gem::Requirement
-    requirements:
-    - - "~>"
-      - !ruby/object:Gem::Version
-        version: '1.0'
-- !ruby/object:Gem::Dependency
-  requirement: !ruby/object:Gem::Requirement
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: 1.0.0
-    - - "~>"
-      - !ruby/object:Gem::Version
-        version: '1.0'
-  name: logstash-devutils
   prerelease: false
-  type: :development
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: 1.0.0
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.0'
+        version: '1.1'
 description: " This gem is a Logstash plugin. It reads and parses data from Azure\
   \ Storage Blobs. The azure_blob_storage is a reimplementation to replace azureblob\
   \ from azure-diagnostics-tools/Logstash. It can deal with larger volumes and partial\