RubyGems - logstash-input-azure_blob_storage - Versions diffs - 0.12.5 → 0.12.7 - Mend

logstash-input-azure_blob_storage 0.12.5 → 0.12.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +18 -0
data/README.md +24 -7
data/lib/logstash/inputs/azure_blob_storage.rb +115 -29
data/logstash-input-azure_blob_storage.gemspec +3 -4
metadata +10 -10

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 00e66bdb4eda73c6d9a4219034d6a33bbf4d8a1c8206f7f2e6ec39b414dd9d63
-  data.tar.gz: 953fa1cc28b60e5a44d7575ead5428e6fd62c309ad1c9630a2f2cd73dac1ffc3
+  metadata.gz: 6bc1a46c4c6ae533e05c83f0e7cb90715cad7390a5cedb9b6e023c46f2e620d1
+  data.tar.gz: 520d7b5131a6b00b6de066a12cd93a99082c7af0bb7184df9f2bc9c8ca64babd
 SHA512:
-  metadata.gz: a19ff34ae098f9bf115789b43781c4073268934c34fd69a2ae119ea844deffcd30d853aefc29e00fe4495858cbf336ec1c6dc0f2113c26239a47fcacfb73bb87
-  data.tar.gz: 93ff0a91bfc54f8b159c80c9e0156064b3805166c0d83ae51fae09a8944e6f6b852ed72d269d5b1dd6e8561007a80042583fe2f7d7be20b09122c414c13ff94b
+  metadata.gz: 3c069008cfef9b08c4b9793b24538c9c8bdc217b64285626d3c9564a57584b237bfef90f4382e4b68366c2555b1b9a6e91d897951bbcc336b355eaefb310ce00
+  data.tar.gz: ccb7ba1d556cec586872ebe1c94237b3223f484902218d3bff899993467b741519c521b9c08075f98328536cf31274cd2aa386f64458097b025bbef2841c486d

data/CHANGELOG.md CHANGED Viewed

@@ -1,3 +1,21 @@
+## 0.12.7
+  - rewrote partial_read, now the occasional json parse errors should be fixed by reading only commited blocks.
+      (This may also have been related to reading a second partial_read, where the offset wasn't updated correctly?)
+  - used the new header and tail block name, should now learn header and footer automatically again?
+  - added addall to the configurations to add system, mac, category, time, operation to the output
+  - added optional environment configuration option
+  - removed the date, which was always set to ---
+  - made a start on event rewriting to make it ECS compatibility
+## 0.12.6
+  - Fixed the 0.12.5 exception handling, it actually caused a warning to become a fatal pipeline crashing error
+  - The chuck that failed to process should be printed in debug mode, for testing use debug_until => 10000
+  - Now check if registry entry exist before loading the offsets, to avoid caught: undefined method `[]' for nil:NilClass
+## 0.12.5
+  - Added exception message on json parse errors
 ## 0.12.4
   - Connection Cache reset removed, since agents are cached per host
   - Explicit handling of json_lines and respecting line boundaries (thanks nttoshev)

data/README.md CHANGED Viewed

@@ -42,9 +42,11 @@ input {
 ## Additional Configuration
 The registry keeps track of files in the storage account, their size and how many bytes have been processed. Files can grow and the added part will be processed as a partial file. The registry is saved todisk every interval.
+The interval is also defines when a new round of listing files and processing data should happen. The NSGFLOWLOG's are written every minute into a new block of the hourly blob. This data can be partially read, because the plugin knows the JSON head and tail and removes the leading comma and fixes the JSON before parsing new events
 The registry_create_policy determines at the start of the pipeline if processing should resume from the last known unprocessed file, or to start_fresh ignoring old files and start only processing new events that came after the start of the pipeline. Or start_over to process all the files ignoring the registry.
-interval defines the minimum time the registry should be saved to the registry file (by default to 'data/registry.dat'), this is only needed in case the pipeline dies unexpectedly. During a normal shutdown the registry is also saved.
+interval defines the minimum time the registry should be saved to the registry file. By default to 'data/registry.dat' in the storageaccount, but can be also kept on the server running logstash by setting registry_local_path. The registry is kept also in memory, the registry file is only needed in case the pipeline dies unexpectedly. During a normal shutdown the registry is also saved.
 When registry_local_path is set to a directory, the registry is saved on the logstash server in that directory. The filename is the pipe.id
@@ -66,13 +68,15 @@ The pipeline can be started in several ways.
    ```
  - As managed pipeline from Kibana
-Logstash itself (so not specific to this plugin) has a feature where multiple instances can run on the same system. The default TCP port is 9600, but if it's already in use it will use 9601 (and up). To update a config file on a running instance on the commandline you can add the argument --config.reload.automatic and if you modify the files that are in the pipeline.yml you can send a SIGHUP channel to reload the pipelines where the config was changed.
+Logstash itself (so not specific to this plugin) has a feature where multiple instances can run on the same system. The default TCP port is 9600, but if it's already in use it will use 9601 (and up), this is probably not true anymore from v8. To update a config file on a running instance on the commandline you can add the argument --config.reload.automatic and if you modify the files that are in the pipeline.yml you can send a SIGHUP channel to reload the pipelines where the config was changed.
 [https://www.elastic.co/guide/en/logstash/current/reloading-config.html](https://www.elastic.co/guide/en/logstash/current/reloading-config.html)
 ## Internal Working
 When the plugin is started, it will read all the filenames and sizes in the blob store excluding the directies of files that are excluded by the "path_filters". After every interval it will write a registry to the storageaccount to save the information of how many bytes per blob (file) are read and processed. After all files are processed and at least one interval has passed a new file list is generated and a worklist is constructed that will be processed. When a file has already been processed before, partial files are read from the offset to the filesize at the time of the file listing. If the codec is JSON partial files will be have the header and tail will be added. They can be configured. If logtype is nsgflowlog, the plugin will process the splitting into individual tuple events. The logtype wadiis may in the future be used to process the grok formats to split into log lines. Any other format is fed into the queue as one event per file or partial file. It's then up to the filter to split and mutate the file format.
-By default the root of the json message is named "message" so you can modify the content in the filter block
+By default the root of the json message is named "message", you can modify the content in the filter block
+Additional fields can be enabled with addfilename and addall, ecs_compatibility is not yet supported.
 The configurations and the rest of the code are in [https://github.com/janmg/logstash-input-azure_blob_storage/tree/master/lib/logstash/inputs](lib/logstash/inputs) [https://github.com/janmg/logstash-input-azure_blob_storage/blob/master/lib/logstash/inputs/azure_blob_storage.rb#L10](azure_blob_storage.rb)
@@ -130,7 +134,7 @@ filter {
 }
 output {
-  stdout { }
+  stdout { codec => rubydebug }
 }
 output {
@@ -139,24 +143,37 @@ output {
         index => "nsg-flow-logs-%{+xxxx.ww}"
     }
 }
+output {
+    file {
+        path => /tmp/abuse.txt
+        codec => line { format => "%{decision} %{flowstate} %{src_ip} ${dst_port}"}
+    }
+}
 ```
 A more elaborate input configuration example
 ```
 input {
     azure_blob_storage {
         codec => "json"
-        storageaccount => "yourstorageaccountname"
-        access_key => "Ba5e64c0d3=="
+        # storageaccount => "yourstorageaccountname"
+        # access_key => "Ba5e64c0d3=="
+        connection_string => "DefaultEndpointsProtocol=https;AccountName=yourstorageaccountname;AccountKey=Ba5e64c0d3==;EndpointSuffix=core.windows.net"
         container => "insights-logs-networksecuritygroupflowevent"
         logtype => "nsgflowlog"
         prefix => "resourceId=/"
         path_filters => ['**/*.json']
         addfilename => true
+        addall => true
+        environment => "dev-env"
         registry_create_policy => "resume"
         registry_local_path => "/usr/share/logstash/plugin"
         interval => 300
         debug_timer => true
-        debug_until => 100
+        debug_until => 1000
+        addall => true
+        registry_create_policy => "start_over"
     }
 }

data/lib/logstash/inputs/azure_blob_storage.rb CHANGED Viewed

@@ -17,10 +17,12 @@ require 'json'
 # D672f4bbd95a04209b00dc05d899e3cce 2576  json objects for 1st minute
 # D7fe0d4f275a84c32982795b0e5c7d3a1 2312  json objects for 2nd minute
 # Z00000000000000000000000000000000 2     ]}
+#
+# The azure-storage-ruby connects to the storageaccount and the files are read through get_blob. For partial read the options with start and end ar used.
+# https://github.com/Azure/azure-storage-ruby/blob/master/blob/lib/azure/storage/blob/blob.rb#L89
+#
 # A storage account has by default a globally unique name, {storageaccount}.blob.core.windows.net which is a CNAME to  Azures blob servers blob.*.store.core.windows.net. A storageaccount has an container and those have a directory and blobs (like files). Blobs have one or more blocks. After writing the blocks, they can be committed. Some Azure diagnostics can send events to an EventHub that can be parse through the plugin logstash-input-azure_event_hubs, but for the events that are only stored in an storage account, use this plugin. The original logstash-input-azureblob from azure-diagnostics-tools is great for low volumes, but it suffers from outdated client, slow reads, lease locking issues and json parse errors.
 class LogStash::Inputs::AzureBlobStorage < LogStash::Inputs::Base
     config_name "azure_blob_storage"
@@ -74,6 +76,12 @@ class LogStash::Inputs::AzureBlobStorage < LogStash::Inputs::Base
     # add the filename as a field into the events
     config :addfilename, :validate => :boolean, :default => false, :required => false
+    # add environment
+    config :environment, :validate => :string, :required => false
+    # add all resource details
+    config :addall, :validate => :boolean, :default => false, :required => false
     # debug_until will at the creation of the pipeline for a maximum amount of processed messages shows 3 types of log printouts including processed filenames. After a number of events, the plugin will stop logging the events and continue silently. This is a lightweight alternative to switching the loglevel from info to debug or even trace to see what the plugin is doing and how fast at the start of the plugin. A good value would be approximately 3x the amount of events per file. For instance 6000 events.
     config :debug_until, :validate => :number, :default => 0, :required => false
@@ -205,10 +213,12 @@ public
             filelist = list_blobs(false)
             filelist.each do |name, file|
                 off = 0
-                begin
+                if @registry.key?(name) then
+                  begin
                     off = @registry[name][:offset]
-                rescue Exception => e
+                  rescue Exception => e
                     @logger.error("caught: #{e.message} while reading #{name}")
+                  end
                 end
                 @registry.store(name, { :offset => off, :length => file[:length] })
                 if (@debug_until > @processed) then @logger.info("2: adding offsets: #{name} #{off} #{file[:length]}") end
@@ -258,9 +268,8 @@ public
                             delta_size = 0
                         end
                     else
-                        chunk = partial_read_json(name, file[:offset], file[:length])
-                        delta_size = chunk.size
-                        @logger.debug("partial file #{name} from #{file[:offset]} to #{file[:length]}")
+                        chunk = partial_read(name, file[:offset])
+                        delta_size = chunk.size - @head.length - 1
                     end
                     if logtype == "nsgflowlog" && @is_json
@@ -270,14 +279,13 @@ public
                             begin
                                 fingjson = JSON.parse(chunk)
                                 @processed += nsgflowlog(queue, fingjson, name)
-                                @logger.debug("Processed #{res[:nsg]} [#{res[:date]}] #{@processed} events")
-                            rescue JSON::ParserError
-                                @logger.error("parse error #{e.message} on #{res[:nsg]} [#{res[:date]}] offset: #{file[:offset]} length: #{file[:length]}")
-                                @logger.debug("#{chunk}")
+                                @logger.debug("Processed #{res[:nsg]} #{@processed} events")
+                            rescue JSON::ParserError => e
+                                @logger.error("parse error #{e.message} on #{res[:nsg]} offset: #{file[:offset]} length: #{file[:length]}")
+                                if (@debug_until > @processed) then @logger.info("#{chunk}") end
                             end
                         end
                     # TODO: Convert this to line based grokking.
-                    # TODO: ECS Compliance?
                     elsif logtype == "wadiis" && !@is_json
                         @processed += wadiislog(queue, name)
                     else
@@ -396,14 +404,35 @@ private
         return chuck
     end
-    def partial_read_json(filename, offset, length)
-        content = @blob_client.get_blob(container, filename, start_range: offset-@tail.length, end_range: length-1)[1]
-        if content.end_with?(@tail)
-            # the tail is part of the last block, so included in the total length of the get_blob
-            return @head + strip_comma(content)
+    def partial_read(blobname, offset)
+        # 1. read committed blocks, calculate length
+        # 2. calculate the offset to read
+        # 3. strip comma
+        # if json strip comma and fix head and tail
+        size = 0
+        blocks = @blob_client.list_blob_blocks(container, blobname)
+        blocks[:committed].each do |block|
+            size += block.size
+        end
+        # read the new blob blocks from the offset to the last committed size.
+        # if it is json, fix the head and tail
+        # crap committed block at the end is the tail, so must be substracted from the read and then comma stripped and tail added.
+        # but why did I need a -1 for the length?? probably the offset starts at 0 and ends at size-1
+        # should first check commit, read and the check committed again? no, only read the commited size
+        # should read the full content and then substract json tail
+        if @is_json
+            content = @blob_client.get_blob(container, blobname, start_range: offset-1, end_range: size-1)[1]
+            if content.end_with?(@tail)
+                return @head + strip_comma(content)
+            else
+                @logger.info("Fixed a tail! probably new committed blocks started appearing!")
+                # substract the length of the tail and add the tail, because the file grew.size was calculated as the block boundary, so replacing the last bytes with the tail should fix the problem
+                return @head + strip_comma(content[0...-@tail.length]) + @tail
+            end
         else
-            # when the file has grown between list_blobs and the time of partial reading, the tail will be wrong
-            return @head + strip_comma(content[0...-@tail.length]) + @tail
+            content = @blob_client.get_blob(container, blobname, start_range: offset, end_range: size-1)[1]
         end
     end
@@ -420,8 +449,9 @@ private
         count=0
         begin
             json["records"].each do |record|
-                res = resource(record["resourceId"])
-                resource = { :subscription => res[:subscription], :resourcegroup => res[:resourcegroup], :nsg => res[:nsg] }
+                resource = resource(record["resourceId"])
+                # resource = { :subscription => res[:subscription], :resourcegroup => res[:resourcegroup], :nsg => res[:nsg] }
+                extras = { :time => record["time"], :system => record["systemId"], :mac => record["macAddress"], :category => record["category"], :operation => record["operationName"] }
                 @logger.trace(resource.to_s)
                 record["properties"]["flows"].each do |flows|
                     rule = resource.merge ({ :rule => flows["rule"]})
@@ -440,7 +470,18 @@ private
                             if @addfilename
                                 ev.merge!( {:filename => name } )
                             end
+                            unless @environment.nil?
+                                ev.merge!( {:environment => environment } )
+                            end
+                            if @addall
+                                ev.merge!( extras )
+                            end
+                            # Add event to logstash queue
                             event = LogStash::Event.new('message' => ev.to_json)
+                            #if @ecs_compatibility != "disabled"
+                            #    event = ecs(event)
+                            #end
                             decorate(event)
                             queue << event
                             count+=1
@@ -561,11 +602,11 @@ private
                 unless blob.name == registry_path
                     begin
                         blocks = @blob_client.list_blob_blocks(container, blob.name)[:committed]
-                        if blocks.first.name.start_with?('A00')
+                        if ['A00000000000000000000000000000000','QTAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAw'].include?(blocks.first.name)
                             @logger.debug("using #{blob.name}/#{blocks.first.name} to learn the json header")
                             @head = @blob_client.get_blob(container, blob.name, start_range: 0, end_range: blocks.first.size-1)[1]
                         end
-                        if blocks.last.name.start_with?('Z00')
+                        if ['Z00000000000000000000000000000000','WjAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAw'].include?(blocks.last.name)
                             @logger.debug("using #{blob.name}/#{blocks.last.name} to learn the json footer")
                             length = blob.properties[:content_length].to_i
                             offset = length - blocks.last.size
@@ -573,7 +614,7 @@ private
                             @logger.debug("learned tail: #{@tail}")
                         end
                     rescue Exception => e
-                        @logger.info("learn json one of the attempts failed #{e.message}")
+                        @logger.info("learn json one of the attempts failed")
                     end
                 end
             end
@@ -584,15 +625,60 @@ private
     def resource(str)
         temp = str.split('/')
-        date = '---'
-        unless temp[9].nil?
-            date = val(temp[9])+'/'+val(temp[10])+'/'+val(temp[11])+'-'+val(temp[12])+':00'
-        end
-        return {:subscription=> temp[2], :resourcegroup=>temp[4], :nsg=>temp[8], :date=>date}
+        #date = '---'
+        #unless temp[9].nil?
+        #    date = val(temp[9])+'/'+val(temp[10])+'/'+val(temp[11])+'-'+val(temp[12])+':00'
+        #end
+        return {:subscription=> temp[2], :resourcegroup=>temp[4], :nsg=>temp[8]}
     end
     def val(str)
         return str.split('=')[1]
     end
+=begin
+    def ecs(old)
+        # https://www.elastic.co/guide/en/ecs/current/ecs-field-reference.html
+        ecs = LogStash::Event.new()
+        ecs.set("ecs.version", "1.0.0")
+        ecs.set("@timestamp", old.timestamp)
+        ecs.set("cloud.provider", "azure")
+        ecs.set("cloud.account.id", old.get("[subscription]")
+        ecs.set("cloud.project.id", old.get("[environment]")
+        ecs.set("file.name", old.get("[filename]")
+        ecs.set("event.category", "network")
+        if old.get("[decision]") == "D"
+            ecs.set("event.type", "denied")
+        else
+            ecs.set("event.type", "allowed")
+        end
+        ecs.set("event.action", "")
+        ecs.set("rule.ruleset", old.get("[nsg]")
+        ecs.set("rule.name", old.get("[rule]")
+        ecs.set("trace.id", old.get("[protocol]")+"/"+old.get("[src_ip]")+":"+old.get("[src_port]")+"-"+old.get("[dst_ip]")+":"+old.get("[dst_port]")
+        # requires logic to match sockets and flip src/dst for outgoing.
+        ecs.set("host.mac", old.get("[mac]")
+        ecs.set("source.ip", old.get("[src_ip]")
+        ecs.set("source.port", old.get("[src_port]")
+        ecs.set("source.bytes", old.get("[srcbytes]")
+        ecs.set("source.packets", old.get("[src_pack]")
+        ecs.set("destination.ip", old.get("[dst_ip]")
+        ecs.set("destination.port", old.get("[dst_port]")
+        ecs.set("destination.bytes", old.get("[dst_bytes]")
+        ecs.set("destination.packets", old.get("[dst_packets]")
+        if old.get("[protocol]") = "U"
+            ecs.set("network.transport", "udp")
+        else
+            ecs.set("network.transport", "tcp")
+        end
+        if old.get("[decision]") == "I"
+            ecs.set("network.direction", "incoming")
+        else
+            ecs.set("network.direction", "outgoing")
+        end
+        ecs.set("network.bytes", old.get("[src_bytes]")+old.get("[dst_bytes]")
+        ecs.set("network.packets", old.get("[src_packets]")+old.get("[dst_packets]")
+        return ecs
+    end
+=end
 end # class LogStash::Inputs::AzureBlobStorage

data/logstash-input-azure_blob_storage.gemspec CHANGED Viewed

@@ -1,6 +1,6 @@
 Gem::Specification.new do |s|
     s.name          = 'logstash-input-azure_blob_storage'
-    s.version       = '0.12.5'
+    s.version       = '0.12.7'
     s.licenses      = ['Apache-2.0']
     s.summary       = 'This logstash plugin reads and parses data from Azure Storage Blobs.'
     s.description   = <<-EOF
@@ -23,7 +23,6 @@ EOF
     s.add_runtime_dependency 'logstash-core-plugin-api', '~> 2.0'
     s.add_runtime_dependency 'stud', '~> 0.0.23'
     s.add_runtime_dependency 'azure-storage-blob', '~> 2', '>= 2.0.3'
-    s.add_development_dependency 'logstash-devutils'
-    s.add_development_dependency 'rubocop'
+    s.add_development_dependency 'logstash-devutils', '~> 2.4'
+    s.add_development_dependency 'rubocop', '~> 1.48'
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: logstash-input-azure_blob_storage
 version: !ruby/object:Gem::Version
-  version: 0.12.5
+  version: 0.12.7
 platform: ruby
 authors:
 - Jan Geertsma
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2023-03-08 00:00:00.000000000 Z
+date: 2023-04-02 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   requirement: !ruby/object:Gem::Requirement
@@ -61,31 +61,31 @@ dependencies:
 - !ruby/object:Gem::Dependency
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
-        version: '0'
+        version: '2.4'
   name: logstash-devutils
   prerelease: false
   type: :development
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
-        version: '0'
+        version: '2.4'
 - !ruby/object:Gem::Dependency
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
-        version: '0'
+        version: '1.48'
   name: rubocop
   prerelease: false
   type: :development
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
-        version: '0'
+        version: '1.48'
 description: " This gem is a Logstash plugin. It reads and parses data from Azure\
   \ Storage Blobs. The azure_blob_storage is a reimplementation to replace azureblob\
   \ from azure-diagnostics-tools/Logstash. It can deal with larger volumes and partial\