RubyGems - fluent-plugin-influxdb-deduplication - Versions diffs - 0.1.1 → 0.2.0 - Mend

fluent-plugin-influxdb-deduplication 0.1.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

checksums.yaml +4 -4
data/ChangeLog +4 -0
data/README.md +125 -25
data/VERSION +1 -1
data/lib/fluent/plugin/filter_influxdb_deduplication.rb +73 -21
data/test/test_filter_influxdb_deduplication.rb +242 -17
metadata +2 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: e2c3760445648631a72cd9fa02f5f18ec0391b3e5e9172f49448a62dc5cc895e
-  data.tar.gz: c354954939bc6e3209a71e173e34e796a0073bf345276701ad0aba386dd3a662
+  metadata.gz: 79e774ac8c57b020efdd13befa8820f8cb89c32eb2b42fc94a181a5aef462a65
+  data.tar.gz: 3dce1619fd1fa37e508d8e61e849c4ebcad9b1f635c05024411e19832c8daf46
 SHA512:
-  metadata.gz: e7e54a5bc97b6ba35b29582fb1ac4487fa486c2503c3b0f174591dd522c9ab124678350fe87c5fbecedd3c17da5ee85ed9231a5c04fb365a7293b4eec151ce11
-  data.tar.gz: 0d0e5c0853ffdf0f661f23102fe6e6ef5ce71087d3ffc1897353df4ed03fcb96def70a9040812d030ea616c6034bd590c5d23f56e07fd368f5404ec5366d4c2a
+  metadata.gz: d97b6a3ba2c9676c3d34d9893ed82d842c5a59448e42a438d132343f56325a2b3d6fffe3f2b7494d1d71e211c4ff1b8891c7bf04371660680105320d973075cc
+  data.tar.gz: 023dbdb42131021fa610e0cf8e6ef29707497cc2714c5b1e7efd5c3dd1e26a8952abafae69dcae348462c8998c1e6b92487f8237457a14187dc6462d0dc272e8

data/ChangeLog CHANGED Viewed

@@ -1,3 +1,7 @@
+Release 0.2.0 - 2020/03/02
+  * Deduplicate points using a sequence tag
 Release 0.1.0 - 2020/02/23
   * First release

data/README.md CHANGED Viewed

@@ -1,7 +1,7 @@
 # [Fluentd](https://www.fluentd.org/) filter plugin to deduplicate records for InfluxDB
-A filter plugin that implements the deduplication techniques described in the [InfluxDB doc](https://docs.influxdata.com/influxdb/v2.0/write-data/best-practices/duplicate-points/).
+A filter plugin that implements the deduplication techniques described in
+the [InfluxDB doc](https://docs.influxdata.com/influxdb/v2.0/write-data/best-practices/duplicate-points/).
 ## Installation
@@ -11,43 +11,52 @@ Using RubyGems:
 fluent-gem install fluent-plugin-influxdb-deduplication
 ```
 ## Configuration
 ### Deduplicate by incrementing the timestamp
-The filter plugin reads the fluentd record event time with a precision to the second, and stores it in the `time_key` field.
-Any following record with the same timestamp has a `time_key` incremented by 1 nanosecond.
+Each data point is assigned a unique timestamp. The filter plugin reads the fluentd record event time with a precision
+to the second, and stores it in a field with a precision to the nanosecond. Any sequence of record with the same
+timestamp has a timestamp incremented by 1 nanosecond.
     <filter pattern>
       @type influxdb_deduplication
-      # field to store the deduplicated timestamp
-      time_key my_key_field
+      <time>
+        # field to store the deduplicated timestamp
+        key my_key_field
+      </time>
     </filter>
 For example, the following input records:
-    1613910640 { "k1" => 0, "k2" => "value0" }
-    1613910640 { "k1" => 1, "k2" => "value1" }
-    1613910640 { "k1" => 2, "k2" => "value2" }
-    1613910641 { "k1" => 3, "k3" => "value3" }
+| Fluentd Event Time | Record |
+|---|---|
+| 1613910640 | { "k1" => 0, "k2" => "value0" } |
+| 1613910640 | { "k1" => 1, "k2" => "value1" } |
+| 1613910640 | { "k1" => 2, "k2" => "value2" } |
+| 1613910641 | { "k1" => 3, "k3" => "value3" } |
-Would create on output:
+Would become on output:
-    1613910640 { "k1" => 0, "k2" => "value0", "my_key_field" => 1613910640000000000 }
-    1613910640 { "k1" => 1, "k2" => "value1", "my_key_field" => 1613910640000000001 }
-    1613910640 { "k1" => 2, "k2" => "value2", "my_key_field" => 1613910640000000002 }
-    1613910641 { "k1" => 3, "k3" => "value3", "my_key_field" => 1613910643000000000 }
+| Fluentd Event Time | Record |
+|---|---|
+| 1613910640 | { "k1" => 0, "k2" => "value0", "my_key_field" => 1613910640000000000 } |
+| 1613910640 | { "k1" => 1, "k2" => "value1", "my_key_field" => 1613910640000000001 } |
+| 1613910640 | { "k1" => 2, "k2" => "value2", "my_key_field" => 1613910640000000002 } |
+| 1613910641 | { "k1" => 3, "k3" => "value3", "my_key_field" => 1613910643000000000 } |
-The time key field can then be passed as is to the [fluent-plugin-influxdb-v2](https://github.com/influxdata/influxdb-plugin-fluent).
-Example configuration on nginx logs:
+The time key field can then be passed as is to
+the [fluent-plugin-influxdb-v2](https://github.com/influxdata/influxdb-plugin-fluent). Example configuration on nginx
+logs:
     <filter nginx.access>
       @type influxdb_deduplication
-      # field to store the deduplicated timestamp
-      time_key my_key_field
+      <time>
+        # field to store the deduplicated timestamp
+        key my_key_field
+      </time>
     </filter>
     <match nginx.access>
@@ -59,7 +68,7 @@ Example configuration on nginx logs:
         bucket          my-bucket
         org             my-org
-        # the influxdb2 timekey must be set to the same value as the influxdb_deduplication time_key
+        # the influxdb2 time_key must be set to the same value as the influxdb_deduplication time.key
         time_key my_key_field
         # the timestamp precision must be set to ns
@@ -74,13 +83,104 @@ The data can then be queried as a table and viewed in [Grafana](https://grafana.
     from(bucket: "my-bucket")
       |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
       |> pivot(
-        rowKey:["_time"],
+        rowKey: ["_time"],
         columnKey: ["_field"],
         valueColumn: "_value"
       )
       |> keep(columns: ["_time", "request_method", "status", "remote_addr", "request_uri"])
 ### Deduplicate by adding a sequence tag
-TODO
+Each record is assigned a sequence number, the output record can be uniquely identified by the pair (fluentd_event_time,
+sequence_number). The event time is untouched so no precision is lost for time.
+    <filter pattern>
+      @type influxdb_deduplication
+      <tag>
+        # field to store the deduplicated timestamp
+        key my_key_field
+      </tag>
+    </filter>
+For example, the following input records:
+| Fluentd Event Time | Record |
+|---|---|
+| 1613910640 | { "k1" => 0, "k2" => "value0" } |
+| 1613910640 | { "k1" => 1, "k2" => "value1" } |
+| 1613910640 | { "k1" => 2, "k2" => "value2" } |
+| 1613910641 | { "k1" => 3, "k3" => "value3" } |
+Would become on output:
+| Fluentd Event Time | Record |
+|---|---|
+| 1613910640 | { "k1" => 0, "k2" => "value0", "my_key_field" => 0 } |
+| 1613910640 | { "k1" => 1, "k2" => "value1", "my_key_field" => 1 } |
+| 1613910640 | { "k1" => 2, "k2" => "value2", "my_key_field" => 2 } |
+| 1613910641 | { "k1" => 3, "k3" => "value3", "my_key_field" => 0 } |
+The sequence tag should be passed in the tag parameters
+of [fluent-plugin-influxdb-v2](https://github.com/influxdata/influxdb-plugin-fluent). Example configuration on nginx
+logs:
+    <filter nginx.access>
+      @type influxdb_deduplication
+      <time>
+        # field to store the deduplicated timestamp
+        key my_key_field
+      </time>
+    </filter>
+    <match nginx.access>
+        @type influxdb2
+        # setup the access to your InfluxDB v2 instance
+        url             https://localhost:8086
+        token           my-token
+        bucket          my-bucket
+        org             my-org
+        # the influxdb2 time_key is not specified so the fluentd event time is used
+        # time_key
+        # there's no requirements on the time_precision value this time
+        # time_precision ns
+        # "my_key_field" must be passed to influxdb's tag_keys
+        tag_keys ["request_method", "status", "my_key_field"]
+        field_keys ["remote_addr", "request_uri"]
+    </match>
+The data can then be queried as a table and viewed in [Grafana](https://grafana.com/) for example with the flux query:
+    from(bucket: "my-bucket")
+      |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
+      |> pivot(
+        rowKey: ["_time", "my_key_field"],
+        columnKey: ["_field"],
+        valueColumn: "_value"
+      )
+      |> keep(columns: ["_time", "request_method", "status", "remote_addr", "request_uri"])
+### Detecting out of order records
+This filter plugin expects the fluentd event timestamps of the incoming record to increase and never decrease.
+Optionally, a order key can be added to indicate if the record arrived in order or not. For example with this config
+    <filter pattern>
+      @type influxdb_deduplication
+      order_key order_field
+      <time>
+        # field to store the deduplicated timestamp
+        key my_key_field
+      </time>
+    </filter>
+Without order key, out of order records are dropped to avoid previous data points being overridden. With a order key,
+out of order records will still be pushed but with `order_field = false`. Out of order records are not deduplicated but
+they will be apparent in influxdb.

data/VERSION CHANGED Viewed

	@@ -1 +1 @@
1	- 0.1.1
1	+ 0.2.0

data/lib/fluent/plugin/filter_influxdb_deduplication.rb CHANGED Viewed

@@ -4,21 +4,30 @@ module Fluent
   class Plugin::InfluxdbDeduplicationFilter < Plugin::Filter
     Fluent::Plugin.register_filter('influxdb_deduplication', self)
-    config_param :time_key, :string, default: nil,
-                 desc: <<-DESC
-The output time key to use.
-    DESC
+    desc "If not nil, the corresponding field takes the value true if the record arrived in order."
+    config_param :order_key, :string, default: nil
-    config_param :out_of_order, :string, default: nil,
-                 desc: <<-DESC
-If not nil, the field takes the value true if the record arrives in order and false otherwise
-    DESC
+    config_section :time, param_name: :time, multi: false, required: false do
+      desc "The output time key to use."
+      config_param :key, :string
+    end
+    config_section :tag, param_name: :tag, multi: false, required: false do
+      desc "The output sequence tag to use."
+      config_param :key, :string
+    end
     def configure(conf)
       super
-      unless @time_key
-        raise Fluent::ConfigError, "a time key must be set"
+      if @time == nil and @tag == nil
+        raise Fluent::ConfigError, "one of tag or time deduplication needs to be set."
+      elsif @time != nil and @tag != nil
+        raise Fluent::ConfigError, "tag and time deduplication are mutually exclusive."
+      elsif @time != nil and (@time.key == nil or @time.key == "")
+        raise Fluent::ConfigError, "an output 'key' field is required for time deduplication"
+      elsif @tag != nil and (@tag == nil or @tag.key == "")
+        raise Fluent::ConfigError, "an output 'key' field is required for tag deduplication"
       end
     end
@@ -30,6 +39,14 @@ If not nil, the field takes the value true if the record arrives in order and fa
     end
     def filter(tag, time, record)
+      if @time
+        time_deduplication(time, record)
+      else
+        tag_deduplication(time, record)
+      end
+    end
+    def time_deduplication(time, record)
       if time.is_a?(Integer)
         input_time = Fluent::EventTime.new(time)
       elsif time.is_a?(Fluent::EventTime)
@@ -43,33 +60,68 @@ If not nil, the field takes the value true if the record arrives in order and fa
       if input_time.sec < @last_timestamp
         @log.debug("out of sequence timestamp")
-        if @out_of_order
-          record[@out_of_order] = true
-          record[@time_key] = nano_time
+        if @order_key
+          record[@order_key] = false
+          record[@time.key] = nano_time
         else
           @log.debug("out of order record dropped")
           return nil
         end
-      elsif input_time.sec == @last_timestamp && @sequence < 999999999
+      elsif input_time.sec == @last_timestamp and @sequence < 999999999
         @sequence = @sequence + 1
-        record[@time_key] = nano_time + @sequence
-        if @out_of_order
-          record[@out_of_order] = false
+        record[@time.key] = nano_time + @sequence
+        if @order_key
+          record[@order_key] = true
         end
-      elsif input_time.sec == @last_timestamp && @sequence == 999999999
+      elsif input_time.sec == @last_timestamp and @sequence == 999999999
         @log.error("received more then 999999999 records in a second")
         return nil
       else
         @sequence = 0
         @last_timestamp = input_time.sec
-        record[@time_key] = nano_time
-        if @out_of_order
-          record[@out_of_order] = false
+        record[@time.key] = nano_time
+        if @order_key
+          record[@order_key] = true
         end
       end
       record
     end
+    def tag_deduplication(time, record)
+      if time.is_a?(Integer)
+        input_time = time
+      elsif time.is_a?(Fluent::EventTime)
+        input_time = time.sec * 1000000000 + time.nsec
+      else
+        @log.error("unreadable time")
+        return nil
+      end
+      if input_time < @last_timestamp
+        @log.debug("out of sequence timestamp")
+        if @order_key
+          record[@order_key] = false
+        else
+          @log.debug("out of order record dropped")
+          return nil
+        end
+      elsif input_time == @last_timestamp
+        @sequence = @sequence + 1
+        record[@tag.key] = @sequence
+        if @order_key
+          record[@order_key] = true
+        end
+      else
+        @sequence = 0
+        @last_timestamp = input_time
+        record[@tag.key] = 0
+        if @order_key
+          record[@order_key] = true
+        end
+      end
+      record
+    end
   end
 end

data/test/test_filter_influxdb_deduplication.rb CHANGED Viewed

@@ -14,22 +14,76 @@ class InfluxdbDeduplicationFilterTest < Test::Unit::TestCase
     Fluent::Test::Driver::Filter.new(Fluent::Plugin::InfluxdbDeduplicationFilter).configure(conf)
   end
-  def test_configure
-    d = create_driver %[
-      time_key my_time_key
+  def test_configure_time
+    create_driver %[
+      <time>
+        key my_time_key
+      </time>
     ]
-    time_key = d.instance.instance_variable_get(:@time_key)
-    assert time_key == "my_time_key"
+    assert_raises Fluent::ConfigError do
+      create_driver %[
+        <time>
+        </time>
+      ]
+    end
+    assert_raises Fluent::ConfigError do
+      create_driver %[
+        <time>
+          key
+        </time>
+      ]
+    end
+  end
+  def test_configure_tag
+    create_driver %[
+      <tag>
+        key my_tag_key
+      </tag>
+    ]
+    assert_raises Fluent::ConfigError do
+      create_driver %[
+        <tag>
+        </tag>
+      ]
+    end
+    assert_raises Fluent::ConfigError do
+      create_driver %[
+        <tag>
+          key
+        </tag>
+      ]
+    end
+  end
+  def test_configuration_needed
     assert_raises Fluent::ConfigError do
       create_driver ""
     end
   end
-  def test_in_sequence
+  def test_time_and_tag_exclusivity
+    assert_raises Fluent::ConfigError do
+      create_driver %[
+        <time>
+          key my_time_key
+        </time>
+        <tag>
+          key my_tag_key
+        </tag>
+      ]
+    end
+  end
+  def test_time_in_sequence
     d = create_driver %[
-      time_key time_key
+      <time>
+        key time_key
+      </time>
     ]
     time0 = Fluent::EventTime.new(1613910640)
@@ -52,9 +106,40 @@ class InfluxdbDeduplicationFilterTest < Test::Unit::TestCase
                  ], d.filtered
   end
-  def test_out_of_sequence_dropped
+  def test_time_in_sequence_integer_time
+    d = create_driver %[
+      <time>
+        key time_key
+      </time>
+    ]
+    time0 = 1613910640
+    time1 = 1613910643
+    d.run(default_tag: @tag) do
+      d.feed(time0, { "k1" => 0 })
+      d.feed(time0, { "k1" => 1 })
+      d.feed(time0, { "k1" => 2 })
+      d.feed(time1, { "k1" => 3 })
+      d.feed(time1, { "k1" => 4 })
+    end
+    assert_equal d.instance.instance_variable_get(:@last_timestamp), 1613910643
+    assert_equal [
+                   [time0, { "k1" => 0, "time_key" => 1613910640000000000 }],
+                   [time0, { "k1" => 1, "time_key" => 1613910640000000001 }],
+                   [time0, { "k1" => 2, "time_key" => 1613910640000000002 }],
+                   [time1, { "k1" => 3, "time_key" => 1613910643000000000 }],
+                   [time1, { "k1" => 4, "time_key" => 1613910643000000001 }]
+                 ], d.filtered
+  end
+  def test_time_out_of_sequence_dropped
     d = create_driver %[
-      time_key time_key
+      <time>
+        key time_key
+      </time>
     ]
     time0 = Fluent::EventTime.new(1613910640)
@@ -76,10 +161,12 @@ class InfluxdbDeduplicationFilterTest < Test::Unit::TestCase
                  ], d.filtered
   end
-  def test_out_of_sequence_field
+  def test_time_order_field
     d = create_driver %[
-      time_key time_key
-      out_of_order ooo_field
+      order_key order_field
+      <time>
+        key time_key
+      </time>
     ]
     time0 = Fluent::EventTime.new(1613910640)
@@ -94,12 +181,150 @@ class InfluxdbDeduplicationFilterTest < Test::Unit::TestCase
     end
     assert_equal [
-                   [time0, { "k1" => 0, "time_key" => 1613910640000000000, "ooo_field" => false }],
-                   [time1, { "k1" => 1, "time_key" => 1613910643000000000, "ooo_field" => false }],
-                   [time0, { "k1" => 2, "time_key" => 1613910640000000000, "ooo_field" => true }],
-                   [time1, { "k1" => 3, "time_key" => 1613910643000000001, "ooo_field" => false }],
-                   [time1, { "k1" => 4, "time_key" => 1613910643000000002, "ooo_field" => false }]
+                   [time0, { "k1" => 0, "time_key" => 1613910640000000000, "order_field" => true }],
+                   [time1, { "k1" => 1, "time_key" => 1613910643000000000, "order_field" => true }],
+                   [time0, { "k1" => 2, "time_key" => 1613910640000000000, "order_field" => false }],
+                   [time1, { "k1" => 3, "time_key" => 1613910643000000001, "order_field" => true }],
+                   [time1, { "k1" => 4, "time_key" => 1613910643000000002, "order_field" => true }]
                  ], d.filtered
   end
+  def test_time_max_sequence
+    d = create_driver %[
+      <time>
+        key time_key
+      </time>
+    ]
+    time0 = Fluent::EventTime.new(1613910640)
+    time1 = Fluent::EventTime.new(1613910641)
+    d.run(default_tag: @tag) do
+      d.feed(time0, { "k1" => 0 })
+      d.instance.instance_variable_set(:@sequence, 999999998)
+      d.feed(time0, { "k1" => 1 })
+      d.feed(time0, { "k1" => 2 })
+      d.feed(time1, { "k1" => 3 })
+      d.feed(time1, { "k1" => 4 })
+    end
+    assert_equal [
+                   [time0, { "k1" => 0, "time_key" => 1613910640000000000 }],
+                   [time0, { "k1" => 1, "time_key" => 1613910640999999999 }],
+                   [time1, { "k1" => 3, "time_key" => 1613910641000000000 }],
+                   [time1, { "k1" => 4, "time_key" => 1613910641000000001 }]
+                 ], d.filtered
+  end
+  def test_tag_in_sequence
+    d = create_driver %[
+      <tag>
+        key tag_key
+      </tag>
+    ]
+    time0 = Fluent::EventTime.new(1613910640)
+    time1 = Fluent::EventTime.new(1613910643)
+    d.run(default_tag: @tag) do
+      d.feed(time0, { "k1" => 0 })
+      d.feed(time0, { "k1" => 1 })
+      d.feed(time0, { "k1" => 2 })
+      d.feed(time1, { "k1" => 3 })
+      d.feed(time1, { "k1" => 4 })
+    end
+    assert_equal d.instance.instance_variable_get(:@last_timestamp), 1613910643000000000
+    assert_equal [
+                   [time0, { "k1" => 0, "tag_key" => 0 }],
+                   [time0, { "k1" => 1, "tag_key" => 1 }],
+                   [time0, { "k1" => 2, "tag_key" => 2 }],
+                   [time1, { "k1" => 3, "tag_key" => 0 }],
+                   [time1, { "k1" => 4, "tag_key" => 1 }]
+                 ], d.filtered
+  end
+  def test_tag_in_sequence_integer_time
+    d = create_driver %[
+      <tag>
+        key tag_key
+      </tag>
+    ]
+    time0 = 1613910640
+    time1 = 1613910643
+    d.run(default_tag: @tag) do
+      d.feed(time0, { "k1" => 0 })
+      d.feed(time0, { "k1" => 1 })
+      d.feed(time0, { "k1" => 2 })
+      d.feed(time1, { "k1" => 3 })
+      d.feed(time1, { "k1" => 4 })
+    end
+    assert_equal d.instance.instance_variable_get(:@last_timestamp), 1613910643
+    assert_equal [
+                   [time0, { "k1" => 0, "tag_key" => 0 }],
+                   [time0, { "k1" => 1, "tag_key" => 1 }],
+                   [time0, { "k1" => 2, "tag_key" => 2 }],
+                   [time1, { "k1" => 3, "tag_key" => 0 }],
+                   [time1, { "k1" => 4, "tag_key" => 1 }]
+                 ], d.filtered
+  end
+  def test_tag_out_of_sequence_dropped
+    d = create_driver %[
+      <tag>
+        key tag_key
+      </tag>
+    ]
+    time0 = Fluent::EventTime.new(1613910640)
+    time1 = Fluent::EventTime.new(1613910643)
+    d.run(default_tag: @tag) do
+      d.feed(time0, { "k1" => 0 })
+      d.feed(time1, { "k1" => 1 })
+      d.feed(time0, { "k1" => 2 })
+      d.feed(time1, { "k1" => 3 })
+      d.feed(time1, { "k1" => 4 })
+    end
+    assert_equal [
+                   [time0, { "k1" => 0, "tag_key" => 0 }],
+                   [time1, { "k1" => 1, "tag_key" => 0 }],
+                   [time1, { "k1" => 3, "tag_key" => 1 }],
+                   [time1, { "k1" => 4, "tag_key" => 2 }]
+                 ], d.filtered
+  end
+  def test_tag_order_field
+    d = create_driver %[
+      order_key order_field
+      <tag>
+        key tag_key
+      </tag>
+    ]
+    time0 = Fluent::EventTime.new(1613910640)
+    time1 = Fluent::EventTime.new(1613910643)
+    d.run(default_tag: @tag) do
+      d.feed(time0, { "k1" => 0 })
+      d.feed(time1, { "k1" => 1 })
+      d.feed(time0, { "k1" => 2 })
+      d.feed(time1, { "k1" => 3 })
+      d.feed(time1, { "k1" => 4 })
+    end
+    assert_equal [
+                   [time0, { "k1" => 0, "tag_key" => 0, "order_field" => true }],
+                   [time1, { "k1" => 1, "tag_key" => 0, "order_field" => true }],
+                   [time0, { "k1" => 2, "order_field" => false }],
+                   [time1, { "k1" => 3, "tag_key" => 1, "order_field" => true }],
+                   [time1, { "k1" => 4, "tag_key" => 2, "order_field" => true }]
+                 ], d.filtered
+  end
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: fluent-plugin-influxdb-deduplication
 version: !ruby/object:Gem::Version
-  version: 0.1.1
+  version: 0.2.0
 platform: ruby
 authors:
 - Marc Adams
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2021-02-25 00:00:00.000000000 Z
+date: 2021-03-02 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: fluentd