RubyGems - logstash-input-azure_blob_storage - Versions diffs - 0.12.6 → 0.12.8 - Mend

logstash-input-azure_blob_storage 0.12.6 → 0.12.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +14 -4
data/README.md +50 -7
data/lib/logstash/inputs/azure_blob_storage.rb +302 -136
data/logstash-input-azure_blob_storage.gemspec +2 -2
metadata +4 -4

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: b50189c380606c6fdb8b7f7216fe20d15c0d410f1c1f6670211baf25baa567ca
-  data.tar.gz: 189c80c15720ec9a85b8bb223a5ae7e4666fd0ebd6a96946f201bee96cf3dafc
+  metadata.gz: 6226b48f09b69ea1fe5d5e65197cf87daed475a2dff3aecc1ff30b1c921d4e7e
+  data.tar.gz: 9ac324158bddc908f107663925a27ff289eb7b264293da88218a825d66c74d74
 SHA512:
-  metadata.gz: 599ca22fd813634d3ffd5fbbef0361605fd7611ea4050bc85e30c06fe97dbfe6dcd879ee092573e8a94229435d25c7cef71255bc72f33ea3d4813de987600e4c
-  data.tar.gz: 53cc0e73c25323ba891e90a820c679071516187d641ed2c5dd5810a5bbb9654c2cf67c6239d400b58d8786c4cc4737aaa54b0fc1f145b4136ebf1f6b0203a00d
+  metadata.gz: 6cdd2d17fd57adc43b0c8e7354cbf396243b4bf691e8ef12d757c2c9dc515f9711ecbe9c64495b0d6f50040a28af98af2b641224c03dc83c3c4db9919ef1fb77
+  data.tar.gz: e1a71cfbe35af0d878374dcce499096331c82de867d86fb6ea3f4c876e1cc24f8b0fb59087b112012989c358ec9f238159564d17d9433ca1899a776a1c311683

data/CHANGELOG.md CHANGED Viewed

@@ -1,7 +1,17 @@
-## PROBABLY 0.12.4 is the most stable version until I sort out when and why JSON Parse errors happen
-Join the discussion if you have something to share!
-https://github.com/janmg/logstash-input-azure_blob_storage/issues/34
+## 0.12.8
+  - support append blob (use codec json_lines and logtype raw)
+  - change the default head and tail to an empty string, unless the logtype is nsgflowlog
+  - jsonclean configuration parameter to clean the json stream from faulty characters to prevent parse errors
+  - catch ContainerNotFound, print error message in log and sleep interval time.
+## 0.12.7
+  - rewrote partial_read, now the occasional json parse errors should be fixed by reading only commited blocks.
+      (This may also have been related to reading a second partial_read, where the offset wasn't updated correctly?)
+  - used the new header and tail block name, should now learn header and footer automatically again?
+  - added addall to the configurations to add system, mac, category, time, operation to the output
+  - added optional environment configuration option
+  - removed the date, which was always set to ---
+  - made a start on event rewriting to make it ECS compatibility
 ## 0.12.6
   - Fixed the 0.12.5 exception handling, it actually caused a warning to become a fatal pipeline crashing error

data/README.md CHANGED Viewed

@@ -8,6 +8,14 @@ For problems or feature requests with this specific plugin, raise a github issue
 This plugin can read from Azure Storage Blobs, for instance JSON diagnostics logs for NSG flow logs or LINE based accesslogs from App Services.
 [Azure Blob Storage](https://azure.microsoft.com/en-us/services/storage/blobs/)
+## Alternatives
+This plugin was inspired by the Azure diagnostics tools, but should work better for bigger amounts of files. the configuration is not compatible, the configuration azureblob refers to the diagnostics tools plugin and this plugin uses azure_blob_storage
+https://github.com/Azure/azure-diagnostics-tools/tree/master/Logstash/logstash-input-azureblob
+There is a Filebeat plugin, that may work in the future
+https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-azure-blob-storage.html
+## Innerworking
 The plugin depends on the [Ruby library azure-storage-blob](https://rubygems.org/gems/azure-storage-blob/versions/1.1.0) from Microsoft, that depends on Faraday for the HTTPS connection to Azure.
 The plugin executes the following steps
@@ -42,9 +50,11 @@ input {
 ## Additional Configuration
 The registry keeps track of files in the storage account, their size and how many bytes have been processed. Files can grow and the added part will be processed as a partial file. The registry is saved todisk every interval.
+The interval is also defines when a new round of listing files and processing data should happen. The NSGFLOWLOG's are written every minute into a new block of the hourly blob. This data can be partially read, because the plugin knows the JSON head and tail and removes the leading comma and fixes the JSON before parsing new events
 The registry_create_policy determines at the start of the pipeline if processing should resume from the last known unprocessed file, or to start_fresh ignoring old files and start only processing new events that came after the start of the pipeline. Or start_over to process all the files ignoring the registry.
-interval defines the minimum time the registry should be saved to the registry file (by default to 'data/registry.dat'), this is only needed in case the pipeline dies unexpectedly. During a normal shutdown the registry is also saved.
+interval defines the minimum time the registry should be saved to the registry file. By default to 'data/registry.dat' in the storageaccount, but can be also kept on the server running logstash by setting registry_local_path. The registry is kept also in memory, the registry file is only needed in case the pipeline dies unexpectedly. During a normal shutdown the registry is also saved.
 When registry_local_path is set to a directory, the registry is saved on the logstash server in that directory. The filename is the pipe.id
@@ -66,13 +76,15 @@ The pipeline can be started in several ways.
    ```
  - As managed pipeline from Kibana
-Logstash itself (so not specific to this plugin) has a feature where multiple instances can run on the same system. The default TCP port is 9600, but if it's already in use it will use 9601 (and up). To update a config file on a running instance on the commandline you can add the argument --config.reload.automatic and if you modify the files that are in the pipeline.yml you can send a SIGHUP channel to reload the pipelines where the config was changed.
+Logstash itself (so not specific to this plugin) has a feature where multiple instances can run on the same system. The default TCP port is 9600, but if it's already in use it will use 9601 (and up), this is probably not true anymore from v8. To update a config file on a running instance on the commandline you can add the argument --config.reload.automatic and if you modify the files that are in the pipeline.yml you can send a SIGHUP channel to reload the pipelines where the config was changed.
 [https://www.elastic.co/guide/en/logstash/current/reloading-config.html](https://www.elastic.co/guide/en/logstash/current/reloading-config.html)
 ## Internal Working
 When the plugin is started, it will read all the filenames and sizes in the blob store excluding the directies of files that are excluded by the "path_filters". After every interval it will write a registry to the storageaccount to save the information of how many bytes per blob (file) are read and processed. After all files are processed and at least one interval has passed a new file list is generated and a worklist is constructed that will be processed. When a file has already been processed before, partial files are read from the offset to the filesize at the time of the file listing. If the codec is JSON partial files will be have the header and tail will be added. They can be configured. If logtype is nsgflowlog, the plugin will process the splitting into individual tuple events. The logtype wadiis may in the future be used to process the grok formats to split into log lines. Any other format is fed into the queue as one event per file or partial file. It's then up to the filter to split and mutate the file format.
-By default the root of the json message is named "message" so you can modify the content in the filter block
+By default the root of the json message is named "message", you can modify the content in the filter block
+Additional fields can be enabled with addfilename and addall, ecs_compatibility is not yet supported.
 The configurations and the rest of the code are in [https://github.com/janmg/logstash-input-azure_blob_storage/tree/master/lib/logstash/inputs](lib/logstash/inputs) [https://github.com/janmg/logstash-input-azure_blob_storage/blob/master/lib/logstash/inputs/azure_blob_storage.rb#L10](azure_blob_storage.rb)
@@ -130,7 +142,7 @@ filter {
 }
 output {
-  stdout { }
+  stdout { codec => rubydebug }
 }
 output {
@@ -139,24 +151,37 @@ output {
         index => "nsg-flow-logs-%{+xxxx.ww}"
     }
 }
+output {
+    file {
+        path => /tmp/abuse.txt
+        codec => line { format => "%{decision} %{flowstate} %{src_ip} ${dst_port}"}
+    }
+}
 ```
 A more elaborate input configuration example
 ```
 input {
     azure_blob_storage {
         codec => "json"
-        storageaccount => "yourstorageaccountname"
-        access_key => "Ba5e64c0d3=="
+        # storageaccount => "yourstorageaccountname"
+        # access_key => "Ba5e64c0d3=="
+        connection_string => "DefaultEndpointsProtocol=https;AccountName=yourstorageaccountname;AccountKey=Ba5e64c0d3==;EndpointSuffix=core.windows.net"
         container => "insights-logs-networksecuritygroupflowevent"
         logtype => "nsgflowlog"
         prefix => "resourceId=/"
         path_filters => ['**/*.json']
         addfilename => true
+        addall => true
+        environment => "dev-env"
         registry_create_policy => "resume"
         registry_local_path => "/usr/share/logstash/plugin"
         interval => 300
         debug_timer => true
-        debug_until => 100
+        debug_until => 1000
+        addall => true
+        registry_create_policy => "start_over"
     }
 }
@@ -167,6 +192,20 @@ output {
     }
 }
 ```
+Another for json_lines on append_blobs
+```
+input {
+    azure_blob_storage {
+        codec => json_lines {
+          delimiter => "\n"
+          charset => "UTF-8"
+        }
+        # below options are optional
+        logtype => "raw"
+        append => true
+        cleanjson => true
+```
 The configuration documentation is in the first 100 lines of the code
 [GITHUB/janmg/logstash-input-azure_blob_storage/blob/master/lib/logstash/inputs/azure_blob_storage.rb](https://github.com/janmg/logstash-input-azure_blob_storage/blob/master/lib/logstash/inputs/azure_blob_storage.rb)
@@ -211,5 +250,9 @@ filter {
     remove_field => ["timestamp"]
   }
 }
+output {
+  stdout { codec => rubydebug }
+}
 ```

data/lib/logstash/inputs/azure_blob_storage.rb CHANGED Viewed

@@ -17,14 +17,16 @@ require 'json'
 # D672f4bbd95a04209b00dc05d899e3cce 2576  json objects for 1st minute
 # D7fe0d4f275a84c32982795b0e5c7d3a1 2312  json objects for 2nd minute
 # Z00000000000000000000000000000000 2     ]}
+#
+# The azure-storage-ruby connects to the storageaccount and the files are read through get_blob. For partial read the options with start and end ar used.
+# https://github.com/Azure/azure-storage-ruby/blob/master/blob/lib/azure/storage/blob/blob.rb#L89
+#
 # A storage account has by default a globally unique name, {storageaccount}.blob.core.windows.net which is a CNAME to  Azures blob servers blob.*.store.core.windows.net. A storageaccount has an container and those have a directory and blobs (like files). Blobs have one or more blocks. After writing the blocks, they can be committed. Some Azure diagnostics can send events to an EventHub that can be parse through the plugin logstash-input-azure_event_hubs, but for the events that are only stored in an storage account, use this plugin. The original logstash-input-azureblob from azure-diagnostics-tools is great for low volumes, but it suffers from outdated client, slow reads, lease locking issues and json parse errors.
 class LogStash::Inputs::AzureBlobStorage < LogStash::Inputs::Base
     config_name "azure_blob_storage"
-    # If undefined, Logstash will complain, even if codec is unused. The codec for nsgflowlog is "json" and the for WADIIS and APPSERVICE is "line".
+    # If undefined, Logstash will complain, even if codec is unused. The codec for nsgflowlog is "json", "json_line" works and the for WADIIS and APPSERVICE is "line".
     default :codec, "json"
     # logtype can be nsgflowlog, wadiis, appservice or raw. The default is raw, where files are read and added as one event. If the file grows, the next interval the file is read from the offset, so that the delta is sent as another event. In raw mode, further processing has to be done in the filter block. If the logtype is specified, this plugin will split and mutate and add individual events to the queue.
@@ -66,7 +68,7 @@ class LogStash::Inputs::AzureBlobStorage < LogStash::Inputs::Base
     # when set to `start_fresh`, it will read log files that are created or appended since this start of the pipeline.
     config :registry_create_policy, :validate => ['resume','start_over','start_fresh'], :required => false, :default => 'resume'
-	# The interval is used to save the registry regularly, when new events have have been processed. It is also used to wait before listing the files again and substracting the registry of already processed files to determine the worklist.
+    # The interval is used to save the registry regularly, when new events have have been processed. It is also used to wait before listing the files again and substracting the registry of already processed files to determine the worklist.
     # waiting time in seconds until processing the next batch. NSGFLOWLOGS append a block per minute, so use multiples of 60 seconds, 300 for 5 minutes, 600 for 10 minutes. The registry is also saved after every interval.
     # Partial reading starts from the offset and reads until the end, so the starting tag is prepended
     config :interval, :validate => :number, :default => 60
@@ -74,6 +76,12 @@ class LogStash::Inputs::AzureBlobStorage < LogStash::Inputs::Base
     # add the filename as a field into the events
     config :addfilename, :validate => :boolean, :default => false, :required => false
+    # add environment
+    config :environment, :validate => :string, :required => false
+    # add all resource details
+    config :addall, :validate => :boolean, :default => false, :required => false
     # debug_until will at the creation of the pipeline for a maximum amount of processed messages shows 3 types of log printouts including processed filenames. After a number of events, the plugin will stop logging the events and continue silently. This is a lightweight alternative to switching the loglevel from info to debug or even trace to see what the plugin is doing and how fast at the start of the plugin. A good value would be approximately 3x the amount of events per file. For instance 6000 events.
     config :debug_until, :validate => :number, :default => 0, :required => false
@@ -87,10 +95,14 @@ class LogStash::Inputs::AzureBlobStorage < LogStash::Inputs::Base
     config :skip_learning, :validate => :boolean, :default => false, :required => false
     # The string that starts the JSON. Only needed when the codec is JSON. When partial file are read, the result will not be valid JSON unless the start and end are put back. the file_head and file_tail are learned at startup, by reading the first file in the blob_list and taking the first and last block, this would work for blobs that are appended like nsgflowlogs. The configuration can be set to override the learning. In case learning fails and the option is not set, the default is to use the 'records' as set by nsgflowlogs.
-    config :file_head, :validate => :string, :required => false, :default => '{"records":['
+    config :file_head, :validate => :string, :required => false, :default => ''
     # The string that ends the JSON
-    config :file_tail, :validate => :string, :required => false, :default => ']}'
+    config :file_tail, :validate => :string, :required => false, :default => ''
+    # inspect the bytes and remove faulty characters
+    config :cleanjson, :validate => :boolean, :default => false, :required => false
+    config :append, :validate => :boolean, :default => false, :required => false
     # By default it will watch every file in the storage container. The prefix option is a simple filter that only processes files with a path that starts with that value.
     # For NSGFLOWLOGS a path starts with "resourceId=/". This would only be needed to exclude other paths that may be written in the same container. The registry file will be excluded.
     # You may also configure multiple paths. See an example on the <<array,Logstash configuration page>>.
@@ -110,6 +122,7 @@ public
         @logger.info("If this plugin doesn't work, please raise an issue in https://github.com/janmg/logstash-input-azure_blob_storage")
         @busy_writing_registry = Mutex.new
         # TODO: consider multiple readers, so add pipeline @id or use logstash-to-logstash communication?
+        # For now it's difficult because the plugin would then have to synchronize the worklist
     end
@@ -120,41 +133,10 @@ public
         @regsaved = @processed
         connect
         @registry = Hash.new
-        if registry_create_policy == "resume"
-            for counter in 1..3
-                begin
-                    if (!@registry_local_path.nil?)
-                        unless File.file?(@registry_local_path+"/"+@pipe_id)
-                            @registry = Marshal.load(@blob_client.get_blob(container, registry_path)[1])
-                            #[0] headers [1] responsebody
-                            @logger.info("migrating from remote registry #{registry_path}")
-                        else
-                            if !Dir.exist?(@registry_local_path)
-                                FileUtils.mkdir_p(@registry_local_path)
-                            end
-                            @registry = Marshal.load(File.read(@registry_local_path+"/"+@pipe_id))
-                            @logger.info("resuming from local registry #{registry_local_path+"/"+@pipe_id}")
-                        end
-                    else
-                        @registry = Marshal.load(@blob_client.get_blob(container, registry_path)[1])
-                        #[0] headers [1] responsebody
-                        @logger.info("resuming from remote registry #{registry_path}")
-                    end
-                    break
-                rescue Exception => e
-                    @logger.error("caught: #{e.message}")
-                    @registry.clear
-                    @logger.error("loading registry failed for attempt #{counter} of 3")
-                end
-             end
-        end
-        # read filelist and set offsets to file length to mark all the old files as done
-        if registry_create_policy == "start_fresh"
-            @registry = list_blobs(true)
-            save_registry()
-            @logger.info("starting fresh, writing a clean registry to contain #{@registry.size} blobs/files")
+        load_registry()
+        @registry.each do |name, file|
+            @logger.info("offset: #{file[:offset]} length: #{file[:length]}")
         end
         @is_json = false
@@ -166,22 +148,29 @@ public
                 @is_json_line = true
             end
         end
         @head = ''
         @tail = ''
-        # if codec=json sniff one files blocks A and Z to learn file_head and file_tail
         if @is_json
+            # if codec=json sniff one files blocks A and Z to learn file_head and file_tail
+            if @logtype == 'nsgflowlog'
+                @head = '{"records":['
+                @tail = ']}'
+            end
             if file_head
                 @head = file_head
             end
             if file_tail
                 @tail = file_tail
             end
-            if file_head and file_tail and !skip_learning
+            if !skip_learning
                 learn_encapsulation
             end
-            @logger.info("head will be: #{@head} and tail is set to #{@tail}")
+            @logger.info("head will be: '#{@head}' and tail is set to: '#{@tail}'")
         end
         filelist = Hash.new
         worklist = Hash.new
         @last = start = Time.now.to_i
@@ -198,24 +187,27 @@ public
             # load the registry, compare it's offsets to file list, set offset to 0 for new files, process the whole list and if finished within the interval wait for next loop,
             # TODO: sort by timestamp ?
             #filelist.sort_by(|k,v|resource(k)[:date])
-            worklist.clear
             filelist.clear
             # Listing all the files
             filelist = list_blobs(false)
+            if (@debug_until > @processed) then
+                @registry.each do |name, file|
+                    @logger.info("#{name} offset: #{file[:offset]} length: #{file[:length]}")
+                end
+            end
             filelist.each do |name, file|
                 off = 0
                 if @registry.key?(name) then
-                  begin
-                    off = @registry[name][:offset]
-                  rescue Exception => e
-                    @logger.error("caught: #{e.message} while reading #{name}")
-                  end
+                    begin
+                        off = @registry[name][:offset]
+                    rescue Exception => e
+                        @logger.error("caught: #{e.message} while reading #{name}")
+                    end
                 end
                 @registry.store(name, { :offset => off, :length => file[:length] })
                 if (@debug_until > @processed) then @logger.info("2: adding offsets: #{name} #{off} #{file[:length]}") end
             end
-            # size nilClass when the list doesn't grow?!
             # clean registry of files that are not in the filelist
             @registry.each do |name,file|
@@ -234,14 +226,16 @@ public
             # Start of processing
             # This would be ideal for threading since it's IO intensive, would be nice with a ruby native ThreadPool
+            # pool = Concurrent::FixedThreadPool.new(5) # 5 threads
+            #pool.post do
+            # some parallel work
+            #end
             if (worklist.size > 0) then
                 worklist.each do |name, file|
                     start = Time.now.to_i
                     if (@debug_until > @processed) then @logger.info("3: processing #{name} from #{file[:offset]} to #{file[:length]}") end
                     size = 0
                     if file[:offset] == 0
-                        # This is where Sera4000 issue starts
-                        # For an append blob, reading full and crashing, retry, last_modified? ... lenght? ... committed? ...
                         # length and skip reg value
                         if (file[:length] > 0)
                             begin
@@ -260,55 +254,72 @@ public
                             delta_size = 0
                         end
                     else
-                        chunk = partial_read_json(name, file[:offset], file[:length])
-                        delta_size = chunk.size
-                        @logger.debug("partial file #{name} from #{file[:offset]} to #{file[:length]}")
+                        chunk = partial_read(name, file[:offset])
+                        delta_size = chunk.size - @head.length - 1
                     end
-                    if logtype == "nsgflowlog" && @is_json
-                        # skip empty chunks
-                        unless chunk.nil?
-                            res = resource(name)
-                            begin
-                                fingjson = JSON.parse(chunk)
-                                @processed += nsgflowlog(queue, fingjson, name)
-                                @logger.debug("Processed #{res[:nsg]} [#{res[:date]}] #{@processed} events")
-                            rescue JSON::ParserError => e
-                                @logger.error("parse error #{e.message} on #{res[:nsg]} [#{res[:date]}] offset: #{file[:offset]} length: #{file[:length]}")
-                                if (@debug_until > @processed) then @logger.info("#{chunk}") end
-                            end
-                        end
-                    # TODO: Convert this to line based grokking.
-                    # TODO: ECS Compliance?
-                    elsif logtype == "wadiis" && !@is_json
-                        @processed += wadiislog(queue, name)
-                    else
-                        # Handle JSONLines format
-                        if !@chunk.nil? && @is_json_line
-                            newline_rindex = chunk.rindex("\n")
-                            if newline_rindex.nil?
-                                # No full line in chunk, skip it without updating the registry.
-                                # Expecting that the JSON line would be filled in at a subsequent iteration.
-                                next
-                            end
-                            chunk = chunk[0..newline_rindex]
-                            delta_size = chunk.size
+                    #
+                    # TODO! ... split out the logtypes and use individual methods
+                    # how does a byte array chuck from json_lines get translated to strings/json/events
+                    # should the byte array be converted to a multiline and then split? drawback need to know characterset and linefeed characters
+                    # how does the json_line decoder work on byte arrays?
+                    #
+                    # so many questions
+                    unless chunk.nil?
+                    counter = 0
+                        if @is_json
+                            if logtype == "nsgflowlog"
+                                res = resource(name)
+                                begin
+                                    fingjson = JSON.parse(chunk)
+                                    @processed += nsgflowlog(queue, fingjson, name)
+                                    @logger.debug("Processed #{res[:nsg]} #{@processed} events")
+                                rescue JSON::ParserError => e
+                                    @logger.error("parse error #{e.message} on #{res[:nsg]} offset: #{file[:offset]} length: #{file[:length]}")
+                                    if (@debug_until > @processed) then @logger.info("#{chunk}") end
+                                end
+                            else
+                                begin
+                                    @codec.decode(chunk) do |event|
+                                        counter += 1
+                                        if @addfilename
+                                            event.set('filename', name)
+                                        end
+                                        decorate(event)
+                                        queue << event
+                                     end
+                                     @processed += counter
+                                 rescue Exception => e
+                                     @logger.error("codec exception: #{e.message} .. continue and pretend this never happened")
+                                 end
+                             end
+                          end
+                        if logtype == "wadiis" && !@is_json
+                            # TODO: Convert this to line based grokking.
+                            @processed += wadiislog(queue, name)
                         end
-                        counter = 0
-                        begin
-                            @codec.decode(chunk) do |event|
-                                counter += 1
-                                if @addfilename
-                                    event.set('filename', name)
+                        if @is_json_line
+                            # parse one line at a time and dump it in the chunk?
+                            lines = chunk.to_s
+                            if cleanjson
+                                @logger.info("cleaning in progress")
+                                lines.chars.select(&:valid_encoding?).join
+                                #lines.delete "\\"
+                                #lines.scrub{|bytes| '<'+bytes.unpack('H*')[0]+'>' }
+                            end
+                            begin
+                                @codec.decode(lines) do |event|
+                                    counter += 1
+                                    queue << event
                                 end
-                                decorate(event)
-                                queue << event
+                                @processed += counter
+                            rescue Exception => e
+                                # todo: fix codec_lines exception: no implicit conversion of Array into String
+                                @logger.error("json_lines codec exception: #{e.message} .. continue and pretend this never happened")
                             end
-                            @processed += counter
-                        rescue Exception => e
-                            @logger.error("codec exception: #{e.message} .. will continue and pretend this never happened")
-                            @logger.debug("#{chunk}")
                         end
                     end
@@ -348,6 +359,24 @@ public
 private
+    def list_files
+        filelist = list_blobs(false)
+        filelist.each do |name, file|
+            off = 0
+            if @registry.key?(name) then
+                begin
+                    off = @registry[name][:offset]
+                rescue Exception => e
+                    @logger.error("caught: #{e.message} while reading #{name}")
+                end
+            end
+            @registry.store(name, { :offset => off, :length => file[:length] })
+            if (@debug_until > @processed) then @logger.info("2: adding offsets: #{name} #{off} #{file[:length]}") end
+        end
+        return filelist
+    end
+    # size nilClass when the list doesn't grow?!
     def connect
         # Try in this order to access the storageaccount
         # 1. storageaccount / sas_token
@@ -378,11 +407,48 @@ private
             # end
         end
     end
+    # @registry_create_policy,@registry_local_path,@container,@registry_path
+    def load_registry()
+        if @registry_create_policy == "resume"
+            for counter in 1..3
+                begin
+                    if (!@registry_local_path.nil?)
+                        unless File.file?(@registry_local_path+"/"+@pipe_id)
+                            @registry = Marshal.load(@blob_client.get_blob(@container, path)[1])
+                            #[0] headers [1] responsebody
+                            @logger.info("migrating from remote registry #{path}")
+                        else
+                            if !Dir.exist?(@registry_local_path)
+                                FileUtils.mkdir_p(@registry_local_path)
+                            end
+                            @registry = Marshal.load(File.read(@registry_local_path+"/"+@pipe_id))
+                            @logger.info("resuming from local registry #{@registry_local_path+"/"+@pipe_id}")
+                        end
+                    else
+                        @registry = Marshal.load(@blob_client.get_blob(container, path)[1])
+                        #[0] headers [1] responsebody
+                        @logger.info("resuming from remote registry #{path}")
+                    end
+                    break
+                rescue Exception => e
+                    @logger.error("caught: #{e.message}")
+                    @registry.clear
+                    @logger.error("loading registry failed for attempt #{counter} of 3")
+                end
+             end
+        end
+        # read filelist and set offsets to file length to mark all the old files as done
+        if @registry_create_policy == "start_fresh"
+            @registry = list_blobs(true)
+            #save_registry()
+            @logger.info("starting fresh, with a clean registry containing #{@registry.size} blobs/files")
+        end
+    end
     def full_read(filename)
         tries ||= 2
         begin
-            return @blob_client.get_blob(container, filename)[1]
+            return @blob_client.get_blob(@container, filename)[1]
         rescue Exception => e
             @logger.error("caught: #{e.message} for full_read")
             if (tries -= 1) > 0
@@ -393,19 +459,56 @@ private
             end
         end
         begin
-            chuck = @blob_client.get_blob(container, filename)[1]
+            chuck = @blob_client.get_blob(@container, filename)[1]
         end
         return chuck
     end
-    def partial_read_json(filename, offset, length)
-        content = @blob_client.get_blob(container, filename, start_range: offset-@tail.length, end_range: length-1)[1]
-        if content.end_with?(@tail)
-            # the tail is part of the last block, so included in the total length of the get_blob
-            return @head + strip_comma(content)
-        else
-            # when the file has grown between list_blobs and the time of partial reading, the tail will be wrong
-            return @head + strip_comma(content[0...-@tail.length]) + @tail
+    def partial_read(blobname, offset)
+        # 1. read committed blocks, calculate length
+        # 2. calculate the offset to read
+        # 3. strip comma
+        # if json strip comma and fix head and tail
+        size = 0
+        begin
+            if @append
+                return @blob_client.get_blob(@container, blobname, start_range: offset-1)[1]
+            end
+            blocks = @blob_client.list_blob_blocks(@container, blobname)
+            blocks[:committed].each do |block|
+                size += block.size
+            end
+            # read the new blob blocks from the offset to the last committed size.
+            # if it is json, fix the head and tail
+            # crap committed block at the end is the tail, so must be substracted from the read and then comma stripped and tail added.
+            # but why did I need a -1 for the length?? probably the offset starts at 0 and ends at size-1
+            # should first check commit, read and the check committed again? no, only read the commited size
+            # should read the full content and then substract json tail
+            unless @is_json
+                return @blob_client.get_blob(@container, blobname, start_range: offset, end_range: size-1)[1]
+            else
+                content = @blob_client.get_blob(@container, blobname, start_range: offset-1, end_range: size-1)[1]
+                if content.end_with?(@tail)
+                    return @head + strip_comma(content)
+                else
+                    @logger.info("Fixed a tail! probably new committed blocks started appearing!")
+                    # substract the length of the tail and add the tail, because the file grew.size was calculated as the block boundary, so replacing the last bytes with the tail should fix the problem
+                    return @head + strip_comma(content[0...-@tail.length]) + @tail
+                end
+            end
+        rescue InvalidBlobType => ibt
+            @logger.error("caught #{ibt.message}. Setting BlobType to append")
+            @append = true
+            retry
+        rescue NoMethodError => nme
+            @logger.error("caught #{nme.message}. Setting append to true")
+            @append = true
+            retry
+        rescue Exception => e
+            @logger.error("caught #{e.message}")
         end
     end
@@ -422,8 +525,9 @@ private
         count=0
         begin
             json["records"].each do |record|
-                res = resource(record["resourceId"])
-                resource = { :subscription => res[:subscription], :resourcegroup => res[:resourcegroup], :nsg => res[:nsg] }
+                resource = resource(record["resourceId"])
+                # resource = { :subscription => res[:subscription], :resourcegroup => res[:resourcegroup], :nsg => res[:nsg] }
+                extras = { :time => record["time"], :system => record["systemId"], :mac => record["macAddress"], :category => record["category"], :operation => record["operationName"] }
                 @logger.trace(resource.to_s)
                 record["properties"]["flows"].each do |flows|
                     rule = resource.merge ({ :rule => flows["rule"]})
@@ -442,7 +546,18 @@ private
                             if @addfilename
                                 ev.merge!( {:filename => name } )
                             end
+                            unless @environment.nil?
+                                ev.merge!( {:environment => environment } )
+                            end
+                            if @addall
+                                ev.merge!( extras )
+                            end
+                            # Add event to logstash queue
                             event = LogStash::Event.new('message' => ev.to_json)
+                            #if @ecs_compatibility != "disabled"
+                            #    event = ecs(event)
+                            #end
                             decorate(event)
                             queue << event
                             count+=1
@@ -493,26 +608,31 @@ private
         nextMarker = nil
         counter = 1
         loop do
-            blobs = @blob_client.list_blobs(container, { marker: nextMarker, prefix: @prefix})
-            blobs.each do |blob|
-                # FNM_PATHNAME is required so that "**/test" can match "test" at the root folder
-                # FNM_EXTGLOB allows you to use "test{a,b,c}" to match either "testa", "testb" or "testc" (closer to shell behavior)
-                unless blob.name == registry_path
-                    if @path_filters.any? {|path| File.fnmatch?(path, blob.name, File::FNM_PATHNAME | File::FNM_EXTGLOB)}
-                        length = blob.properties[:content_length].to_i
-                        offset = 0
-                        if fill
-                            offset = length
+            begin
+                blobs = @blob_client.list_blobs(@container, { marker: nextMarker, prefix: @prefix})
+                blobs.each do |blob|
+                    # FNM_PATHNAME is required so that "**/test" can match "test" at the root folder
+                    # FNM_EXTGLOB allows you to use "test{a,b,c}" to match either "testa", "testb" or "testc" (closer to shell behavior)
+                    unless blob.name == registry_path
+                        if @path_filters.any? {|path| File.fnmatch?(path, blob.name, File::FNM_PATHNAME | File::FNM_EXTGLOB)}
+                            length = blob.properties[:content_length].to_i
+                            offset = 0
+                            if fill
+                                offset = length
+                            end
+                            files.store(blob.name, { :offset => offset, :length => length })
+                            if (@debug_until > @processed) then @logger.info("1: list_blobs #{blob.name} #{offset} #{length}") end
                         end
-                        files.store(blob.name, { :offset => offset, :length => length })
-                        if (@debug_until > @processed) then @logger.info("1: list_blobs #{blob.name} #{offset} #{length}") end
                     end
                 end
+                nextMarker = blobs.continuation_token
+                break unless nextMarker && !nextMarker.empty?
+                if (counter % 10 == 0) then @logger.info(" listing #{counter * 50000} files") end
+                counter+=1
+            rescue Exception => e
+                @logger.error("caught: #{e.message} while trying to list blobs")
+                return files
             end
-            nextMarker = blobs.continuation_token
-            break unless nextMarker && !nextMarker.empty?
-            if (counter % 10 == 0) then @logger.info(" listing #{counter * 50000} files") end
-            counter+=1
         end
         if @debug_timer
             @logger.info("list_blobs took #{Time.now.to_i - chrono} sec")
@@ -532,7 +652,7 @@ private
                     begin
                         @busy_writing_registry.lock
                         unless (@registry_local_path)
-                            @blob_client.create_block_blob(container, registry_path, regdump)
+                            @blob_client.create_block_blob(@container, registry_path, regdump)
                             @logger.info("processed #{@processed} events, saving #{regsize} blobs and offsets to remote registry #{registry_path}")
                         else
                             File.open(@registry_local_path+"/"+@pipe_id, 'w') { |file| file.write(regdump) }
@@ -558,20 +678,20 @@ private
         @logger.info("learn_encapsulation, this can be skipped by setting skip_learning => true. Or set both head_file and tail_file")
         # From one file, read first block and last block to learn head and tail
         begin
-            blobs = @blob_client.list_blobs(container, { max_results: 3, prefix: @prefix})
+            blobs = @blob_client.list_blobs(@container, { max_results: 3, prefix: @prefix})
             blobs.each do |blob|
                 unless blob.name == registry_path
                     begin
-                        blocks = @blob_client.list_blob_blocks(container, blob.name)[:committed]
-                        if blocks.first.name.start_with?('A00')
+                        blocks = @blob_client.list_blob_blocks(@container, blob.name)[:committed]
+                        if ['A00000000000000000000000000000000','QTAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAw'].include?(blocks.first.name)
                             @logger.debug("using #{blob.name}/#{blocks.first.name} to learn the json header")
-                            @head = @blob_client.get_blob(container, blob.name, start_range: 0, end_range: blocks.first.size-1)[1]
+                            @head = @blob_client.get_blob(@container, blob.name, start_range: 0, end_range: blocks.first.size-1)[1]
                         end
-                        if blocks.last.name.start_with?('Z00')
+                        if ['Z00000000000000000000000000000000','WjAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAw'].include?(blocks.last.name)
                             @logger.debug("using #{blob.name}/#{blocks.last.name} to learn the json footer")
                             length = blob.properties[:content_length].to_i
                             offset = length - blocks.last.size
-                            @tail = @blob_client.get_blob(container, blob.name, start_range: offset, end_range: length-1)[1]
+                            @tail = @blob_client.get_blob(@container, blob.name, start_range: offset, end_range: length-1)[1]
                             @logger.debug("learned tail: #{@tail}")
                         end
                     rescue Exception => e
@@ -586,15 +706,61 @@ private
     def resource(str)
         temp = str.split('/')
-        date = '---'
-        unless temp[9].nil?
-            date = val(temp[9])+'/'+val(temp[10])+'/'+val(temp[11])+'-'+val(temp[12])+':00'
-        end
-        return {:subscription=> temp[2], :resourcegroup=>temp[4], :nsg=>temp[8], :date=>date}
+        #date = '---'
+        #unless temp[9].nil?
+        #    date = val(temp[9])+'/'+val(temp[10])+'/'+val(temp[11])+'-'+val(temp[12])+':00'
+        #end
+        return {:subscription=> temp[2], :resourcegroup=>temp[4], :nsg=>temp[8]}
     end
     def val(str)
         return str.split('=')[1]
     end
 end # class LogStash::Inputs::AzureBlobStorage
+# This is a start towards mapping NSG events to ECS fields ... it's complicated
+=begin
+    def ecs(old)
+        # https://www.elastic.co/guide/en/ecs/current/ecs-field-reference.html
+        ecs = LogStash::Event.new()
+        ecs.set("ecs.version", "1.0.0")
+        ecs.set("@timestamp", old.timestamp)
+        ecs.set("cloud.provider", "azure")
+        ecs.set("cloud.account.id", old.get("[subscription]")
+        ecs.set("cloud.project.id", old.get("[environment]")
+        ecs.set("file.name", old.get("[filename]")
+        ecs.set("event.category", "network")
+        if old.get("[decision]") == "D"
+            ecs.set("event.type", "denied")
+        else
+            ecs.set("event.type", "allowed")
+        end
+        ecs.set("event.action", "")
+        ecs.set("rule.ruleset", old.get("[nsg]")
+        ecs.set("rule.name", old.get("[rule]")
+        ecs.set("trace.id", old.get("[protocol]")+"/"+old.get("[src_ip]")+":"+old.get("[src_port]")+"-"+old.get("[dst_ip]")+":"+old.get("[dst_port]")
+        # requires logic to match sockets and flip src/dst for outgoing.
+        ecs.set("host.mac", old.get("[mac]")
+        ecs.set("source.ip", old.get("[src_ip]")
+        ecs.set("source.port", old.get("[src_port]")
+        ecs.set("source.bytes", old.get("[srcbytes]")
+        ecs.set("source.packets", old.get("[src_pack]")
+        ecs.set("destination.ip", old.get("[dst_ip]")
+        ecs.set("destination.port", old.get("[dst_port]")
+        ecs.set("destination.bytes", old.get("[dst_bytes]")
+        ecs.set("destination.packets", old.get("[dst_packets]")
+        if old.get("[protocol]") = "U"
+            ecs.set("network.transport", "udp")
+        else
+            ecs.set("network.transport", "tcp")
+        end
+        if old.get("[decision]") == "I"
+            ecs.set("network.direction", "incoming")
+        else
+            ecs.set("network.direction", "outgoing")
+        end
+        ecs.set("network.bytes", old.get("[src_bytes]")+old.get("[dst_bytes]")
+        ecs.set("network.packets", old.get("[src_packets]")+old.get("[dst_packets]")
+        return ecs
+    end
+=end

data/logstash-input-azure_blob_storage.gemspec CHANGED Viewed

@@ -1,6 +1,6 @@
 Gem::Specification.new do |s|
     s.name          = 'logstash-input-azure_blob_storage'
-    s.version       = '0.12.6'
+    s.version       = '0.12.8'
     s.licenses      = ['Apache-2.0']
     s.summary       = 'This logstash plugin reads and parses data from Azure Storage Blobs.'
     s.description   = <<-EOF
@@ -24,5 +24,5 @@ EOF
     s.add_runtime_dependency 'stud', '~> 0.0.23'
     s.add_runtime_dependency 'azure-storage-blob', '~> 2', '>= 2.0.3'
     s.add_development_dependency 'logstash-devutils', '~> 2.4'
-    s.add_development_dependency 'rubocop', '~> 1.48'
+    s.add_development_dependency 'rubocop', '~> 1.50'
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: logstash-input-azure_blob_storage
 version: !ruby/object:Gem::Version
-  version: 0.12.6
+  version: 0.12.8
 platform: ruby
 authors:
 - Jan Geertsma
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2023-03-17 00:00:00.000000000 Z
+date: 2023-07-15 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   requirement: !ruby/object:Gem::Requirement
@@ -77,7 +77,7 @@ dependencies:
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.48'
+        version: '1.50'
   name: rubocop
   prerelease: false
   type: :development
@@ -85,7 +85,7 @@ dependencies:
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.48'
+        version: '1.50'
 description: " This gem is a Logstash plugin. It reads and parses data from Azure\
   \ Storage Blobs. The azure_blob_storage is a reimplementation to replace azureblob\
   \ from azure-diagnostics-tools/Logstash. It can deal with larger volumes and partial\