RubyGems - ruby-kafka - Versions diffs - 0.1.5 → 0.1.6 - Mend

ruby-kafka 0.1.5 → 0.1.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

checksums.yaml +4 -4
data/README.md +71 -6
data/examples/simple-producer.rb +1 -1
data/lib/kafka.rb +1 -1
data/lib/kafka/async_producer.rb +181 -0
data/lib/kafka/client.rb +13 -1
data/lib/kafka/compression.rb +23 -0
data/lib/kafka/fetch_operation.rb +2 -2
data/lib/kafka/gzip_codec.rb +28 -0
data/lib/kafka/produce_operation.rb +15 -4
data/lib/kafka/producer.rb +9 -6
data/lib/kafka/protocol/message.rb +47 -13
data/lib/kafka/protocol/message_set.rb +47 -5
data/lib/kafka/protocol/produce_request.rb +5 -31
data/lib/kafka/snappy_codec.rb +20 -0
data/lib/kafka/version.rb +1 -1
data/ruby-kafka.gemspec +1 -0
metadata +20 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: add61b5ca7d42ab4a606bf411d78d9fb8baa0224
-  data.tar.gz: c8704b6429bed2e7e1159c5fc32e03db9476717f
+  metadata.gz: 50a1c28cf71285d37c57e3dbe7a5d156f891576c
+  data.tar.gz: f8a139dc4061ec8a771f86c0a5ca280692ddd5cc
 SHA512:
-  metadata.gz: 1669663bd73bee8b0ac73d0aefe93b8219a996f1c482ad6e7867c0b46ce5d3f91857acb21e418880d7754b67d687ff785596383b31eff7793f1211af13c943d8
-  data.tar.gz: f1de6fec8420801ca562d99925da270e61b955495a486117cc1659ccafe3c431fe8872000ebb4346af69ea8306ff4cceb237d047a957f1bebce3d8e48a87ef78
+  metadata.gz: 6d3245db50893aba63b50b600903dc0baf345017c940dcaa36b0e2fbe0f50fe231bc15f9af2dc083d4969fd63ba0493781570811a3ba4ba0e7df21b06e0fd993
+  data.tar.gz: d5cddd687cc85f02b0826e95205114e2f64cffd9dad48ef692e717dbb3f7cac41ce2b0afd71ee271e8590e27b9eac42d8f67f247c71fe760dec2915191361273

data/README.md CHANGED Viewed

@@ -24,7 +24,7 @@ Or install it yourself as:
 ## Usage
-Please see the [documentation site](http://www.rubydoc.info/gems/ruby-kafka) for detailed documentation on the latest release.
+Please see the [documentation site](http://www.rubydoc.info/gems/ruby-kafka) for detailed documentation on the latest release. Note that the documentation on GitHub may not match the version of the library you're using – there are still being made many changes to the API.
 ### Producing Messages to Kafka
@@ -39,7 +39,7 @@ kafka = Kafka.new(seed_brokers: ["kafka1:9092", "kafka2:9092"])
 A producer buffers messages and sends them to the broker that is the leader of the partition a given message is assigned to.
 ```ruby
-producer = kafka.get_producer
+producer = kafka.producer
 ```
 `produce` will buffer the message in the producer but will _not_ actually send it to the Kafka cluster.
@@ -66,7 +66,7 @@ If you don't know exactly how many partitions are in the topic, or you'd rather
 producer.produce("hello4", topic: "test-messages", partition_key: "yo")
 ```
-`deliver_messages` will send the buffered messages to the cluster. Since messages may be destined for different partitions, this could involve writing to more than one Kafka broker. Note that a failure to send all buffered messages after the configured number of retries will result in `Kafka::FailedToSendMessages` being raised. This can be rescued and ignored; the messages will be kept in the buffer until the next attempt.
+`deliver_messages` will send the buffered messages to the cluster. Since messages may be destined for different partitions, this could involve writing to more than one Kafka broker. Note that a failure to send all buffered messages after the configured number of retries will result in `Kafka::DeliveryFailed` being raised. This can be rescued and ignored; the messages will be kept in the buffer until the next attempt.
 ```ruby
 producer.deliver_messages
@@ -74,6 +74,71 @@ producer.deliver_messages
 Read the docs for [Kafka::Producer](http://www.rubydoc.info/gems/ruby-kafka/Kafka/Producer) for more details.
+### Asynchronously Producing Messages
+A normal producer will block while `#deliver_messages` is sending messages to Kafka, possible for tens of seconds or even minutes at a time, depending on your timeout and retry settings. Furthermore, you have to call `#deliver_messages` manually, with a frequency that balances batch size with message delay.
+In order to avoid blocking during message deliveries you can use the _asynchronous producer_ API. It is mostly similar to the synchronous API, with calls to `#produce` and `#deliver_messages`. The main difference is that rather than blocking, these calls will return immediately. The actual work will be done in a background thread, with the messages and operations being sent from the caller over a thread safe queue.
+```ruby
+# `#async_producer` will create a new asynchronous producer.
+producer = kafka.async_producer
+# The `#produce` API works as normal.
+producer.produce("hello", topic: "greetings")
+# `#deliver_messages` will return immediately.
+producer.deliver_messages
+# Make sure to call `#shutdown` on the producer in order to
+# avoid leaking resources.
+producer.shutdown
+```
+By default, the delivery policy will be the same as for a synchronous producer: only when `#deliver_messages` is called will the messages be delivered. However, the asynchronous producer offers two complementary policies for _automatic delivery_:
+1. Trigger a delivery once the producer's message buffer reaches a specified _threshold_. This can be used to improve efficiency by increasing the batch size when sending messages to the Kafka cluster.
+2. Trigger a delivery at a _fixed time interval_. This puts an upper bound on message delays.
+These policies can be used alone or in combination.
+```ruby
+# `async_producer` will create a new asynchronous producer.
+producer = kafka.async_producer(
+  # Trigger a delivery once 100 messages have been buffered.
+  delivery_threshold: 100,
+  # Trigger a delivery every 30 seconds.
+  delivery_interval: 30,
+)
+producer.produce("hello", topic: "greetings")
+# ...
+```
+**Note:** if the calling thread produces messages faster than the producer can write them to Kafka, you'll eventually run into problems. The internal queue used for sending messages from the calling thread to the background worker has a size limit; once this limit is reached, a call to `#produce` will raise `Kafka::BufferOverflow`.
+### Serialization
+This library is agnostic to which serialization format you prefer. Both the value and key of a message is treated as a binary string of data. This makes it easier to use whatever serialization format you want, since you don't have to do anything special to make it work with ruby-kafka. Here's an example of encoding data with JSON:
+```ruby
+require "json"
+# ...
+event = {
+  "name" => "pageview",
+  "url" => "https://example.com/posts/123",
+  # ...
+}
+data = JSON.dump(event)
+producer.produce(data, topic: "events")
+```
 ### Partitioning
 Kafka topics are partitioned, with messages being assigned to a partition by the client. This allows a great deal of flexibility for the users. This section describes several strategies for partitioning and how they impact performance, data locality, etc.
@@ -137,9 +202,9 @@ producer.produce(event, topic: "events", partition: partition)
 The producer is designed for resilience in the face of temporary network errors, Kafka broker failovers, and other issues that prevent the client from writing messages to the destination topics. It does this by employing local, in-memory buffers. Only when messages are acknowledged by a Kafka broker will they be removed from the buffer.
-Typically, you'd configure the producer to retry failed attempts at sending messages, but sometimes all retries are exhausted. In that case, `Kafka::FailedToSendMessages` is raised from `Kafka::Producer#deliver_messages`. If you wish to have your application be resilient to this happening (e.g. if you're logging to Kafka from a web application) you can rescue this exception. The failed messages are still retained in the buffer, so a subsequent call to `#deliver_messages` will still attempt to send them.
+Typically, you'd configure the producer to retry failed attempts at sending messages, but sometimes all retries are exhausted. In that case, `Kafka::DeliveryFailed` is raised from `Kafka::Producer#deliver_messages`. If you wish to have your application be resilient to this happening (e.g. if you're logging to Kafka from a web application) you can rescue this exception. The failed messages are still retained in the buffer, so a subsequent call to `#deliver_messages` will still attempt to send them.
-Note that there's a maximum buffer size; pass in a different value for `max_buffer_size` when calling `#get_producer` in order to configure this.
+Note that there's a maximum buffer size; pass in a different value for `max_buffer_size` when calling `#producer` in order to configure this.
 A final note on buffers: local buffers give resilience against broker and network failures, and allow higher throughput due to message batching, but they also trade off consistency guarantees for higher availibility and resilience. If your local process dies while messages are buffered, those messages will be lost. If you require high levels of consistency, you should call `#deliver_messages` immediately after `#produce`.
@@ -152,7 +217,7 @@ It's important to understand how timeouts work if you have a latency sensitive a
 * `connect_timeout` sets the number of seconds to wait while connecting to a broker for the first time. When ruby-kafka initializes, it needs to connect to at least one host in `seed_brokers` in order to discover the Kafka cluster. Each host is tried until there's one that works. Usually that means the first one, but if your entire cluster is down, or there's a network partition, you could wait up to `n * connect_timeout` seconds, where `n` is the number of seed brokers.
 * `socket_timeout` sets the number of seconds to wait when reading from or writing to a socket connection to a broker. After this timeout expires the connection will be killed. Note that some Kafka operations are by definition long-running, such as waiting for new messages to arrive in a partition, so don't set this value too low. When configuring timeouts relating to specific Kafka operations, make sure to make them shorter than this one.
-**Producer timeouts** can be configured when calling `#get_producer` on a client instance:
+**Producer timeouts** can be configured when calling `#producer` on a client instance:
 * `ack_timeout` is a timeout executed by a broker when the client is sending messages to it. It defines the number of seconds the broker should wait for replicas to acknowledge the write before responding to the client with an error. As such, it relates to the `required_acks` setting. It should be set lower than `socket_timeout`.
 * `retry_backoff` configures the number of seconds to wait after a failed attempt to send messages to a Kafka broker before retrying. The `max_retries` setting defines the maximum number of retries to attempt, and so the total duration could be up to `max_retries * retry_backoff` seconds. The timeout can be arbitrarily long, and shouldn't be too short: if a broker goes down its partitions will be handed off to another broker, and that can take tens of seconds.

data/examples/simple-producer.rb CHANGED Viewed

@@ -23,7 +23,7 @@ kafka = Kafka.new(
   logger: logger,
 )
-producer = kafka.get_producer
+producer = kafka.producer
 begin
   $stdin.each_with_index do |line, index|

data/lib/kafka.rb CHANGED Viewed

@@ -98,7 +98,7 @@ module Kafka
   end
   # Raised if not all messages could be sent by a producer.
-  class FailedToSendMessages < Error
+  class DeliveryFailed < Error
   end
   # Initializes a new Kafka client.

data/lib/kafka/async_producer.rb ADDED Viewed

@@ -0,0 +1,181 @@
+require "thread"
+module Kafka
+  # A Kafka producer that does all its work in the background so as to not block
+  # the calling thread. Calls to {#deliver_messages} are asynchronous and return
+  # immediately.
+  #
+  # In addition to this property it's possible to define automatic delivery
+  # policies. These allow placing an upper bound on the number of buffered
+  # messages and the time between message deliveries.
+  #
+  # * If `delivery_threshold` is set to a value _n_ higher than zero, the producer
+  #   will automatically deliver its messages once its buffer size reaches _n_.
+  # * If `delivery_interval` is set to a value _n_ higher than zero, the producer
+  #   will automatically deliver its messages every _n_ seconds.
+  #
+  # By default, automatic delivery is disabled and you'll have to call
+  # {#deliver_messages} manually.
+  #
+  # The calling thread communicates with the background thread doing the actual
+  # work using a thread safe queue. While the background thread is busy delivering
+  # messages, new messages will be buffered in the queue. In order to avoid
+  # the queue growing uncontrollably in cases where the background thread gets
+  # stuck or can't follow the pace of the calling thread, there's a maximum
+  # number of messages that is allowed to be buffered. You can configure this
+  # value by setting `max_queue_size`.
+  #
+  # ## Example
+  #
+  #     producer = kafka.async_producer(
+  #       # Keep at most 1.000 messages in the buffer before delivering:
+  #       delivery_threshold: 1000,
+  #
+  #       # Deliver messages every 30 seconds:
+  #       delivery_interval: 30,
+  #     )
+  #
+  #     # There's no need to manually call #deliver_messages, it will happen
+  #     # automatically in the background.
+  #     producer.produce("hello", topic: "greetings")
+  #
+  #     # Remember to shut down the producer when you're done with it.
+  #     producer.shutdown
+  #
+  class AsyncProducer
+    # Initializes a new AsyncProducer.
+    #
+    # @param sync_producer [Kafka::Producer] the synchronous producer that should
+    #   be used in the background.
+    # @param max_queue_size [Integer] the maximum number of messages allowed in
+    #   the queue.
+    # @param delivery_threshold [Integer] if greater than zero, the number of
+    #   buffered messages that will automatically trigger a delivery.
+    # @param delivery_interval [Integer] if greater than zero, the number of
+    #   seconds between automatic message deliveries.
+    #
+    def initialize(sync_producer:, max_queue_size: 1000, delivery_threshold: 0, delivery_interval: 0)
+      raise ArgumentError unless max_queue_size > 0
+      raise ArgumentError unless delivery_threshold >= 0
+      raise ArgumentError unless delivery_interval >= 0
+      @queue = Queue.new
+      @max_queue_size = max_queue_size
+      @worker_thread = Thread.new do
+        worker = Worker.new(
+          queue: @queue,
+          producer: sync_producer,
+          delivery_threshold: delivery_threshold,
+        )
+        worker.run
+      end
+      @worker_thread.abort_on_exception = true
+      if delivery_interval > 0
+        Thread.new do
+          Timer.new(queue: @queue, interval: delivery_interval).run
+        end
+      end
+    end
+    # Produces a message to the specified topic.
+    #
+    # @see Kafka::Producer#produce
+    # @param (see Kafka::Producer#produce)
+    # @raise [BufferOverflow] if the message queue is full.
+    # @return [nil]
+    def produce(*args)
+      raise BufferOverflow if @queue.size >= @max_queue_size
+      @queue << [:produce, args]
+      nil
+    end
+    # Asynchronously delivers the buffered messages. This method will return
+    # immediately and the actual work will be done in the background.
+    #
+    # @see Kafka::Producer#deliver_messages
+    # @return [nil]
+    def deliver_messages
+      @queue << [:deliver_messages, nil]
+      nil
+    end
+    # Shuts down the producer, releasing the network resources used. This
+    # method will block until the buffered messages have been delivered.
+    #
+    # @see Kafka::Producer#shutdown
+    # @return [nil]
+    def shutdown
+      @queue << [:shutdown, nil]
+      @worker_thread.join
+      nil
+    end
+    class Timer
+      def initialize(interval:, queue:)
+        @queue = queue
+        @interval = interval
+      end
+      def run
+        loop do
+          sleep(@interval)
+          @queue << [:deliver_messages, nil]
+        end
+      end
+    end
+    class Worker
+      def initialize(queue:, producer:, delivery_threshold:)
+        @queue = queue
+        @producer = producer
+        @delivery_threshold = delivery_threshold
+      end
+      def run
+        loop do
+          operation, payload = @queue.pop
+          case operation
+          when :produce
+            @producer.produce(*payload)
+            deliver_messages if threshold_reached?
+          when :deliver_messages
+            deliver_messages
+          when :shutdown
+            # Deliver any pending messages first.
+            deliver_messages
+            # Stop the run loop.
+            break
+          else
+            raise "Unknown operation #{operation.inspect}"
+          end
+        end
+      ensure
+        @producer.shutdown
+      end
+      private
+      def deliver_messages
+        @producer.deliver_messages
+      rescue DeliveryFailed
+        # Delivery failed.
+      end
+      def threshold_reached?
+        @delivery_threshold > 0 &&
+          @producer.buffer_size >= @delivery_threshold
+      end
+    end
+  end
+end

data/lib/kafka/client.rb CHANGED Viewed

@@ -1,5 +1,6 @@
 require "kafka/cluster"
 require "kafka/producer"
+require "kafka/async_producer"
 require "kafka/fetched_message"
 require "kafka/fetch_operation"
@@ -47,10 +48,21 @@ module Kafka
     #
     # @see Producer#initialize
     # @return [Kafka::Producer] the Kafka producer.
-    def get_producer(**options)
+    def producer(**options)
       Producer.new(cluster: @cluster, logger: @logger, **options)
     end
+    def async_producer(delivery_interval: 0, delivery_threshold: 0, max_queue_size: 1000, **options)
+      sync_producer = producer(**options)
+      AsyncProducer.new(
+        sync_producer: sync_producer,
+        delivery_interval: delivery_interval,
+        delivery_threshold: delivery_threshold,
+        max_queue_size: max_queue_size,
+      )
+    end
     # Fetches a batch of messages from a single partition. Note that it's possible
     # to get back empty batches.
     #

data/lib/kafka/compression.rb ADDED Viewed

@@ -0,0 +1,23 @@
+require "kafka/snappy_codec"
+require "kafka/gzip_codec"
+module Kafka
+  module Compression
+    def self.find_codec(name)
+      case name
+      when nil then nil
+      when :snappy then SnappyCodec.new
+      when :gzip then GzipCodec.new
+      else raise "Unknown compression codec #{name}"
+      end
+    end
+    def self.find_codec_by_id(codec_id)
+      case codec_id
+      when 1 then GzipCodec.new
+      when 2 then SnappyCodec.new
+      else raise "Unknown codec id #{codec_id}"
+      end
+    end
+  end
+end

data/lib/kafka/fetch_operation.rb CHANGED Viewed

@@ -69,13 +69,13 @@ module Kafka
           fetched_topic.partitions.flat_map {|fetched_partition|
             Protocol.handle_error(fetched_partition.error_code)
-            fetched_partition.messages.map {|offset, message|
+            fetched_partition.messages.map {|message|
               FetchedMessage.new(
                 value: message.value,
                 key: message.key,
                 topic: fetched_topic.name,
                 partition: fetched_partition.partition,
-                offset: offset,
+                offset: message.offset,
               )
             }
           }

data/lib/kafka/gzip_codec.rb ADDED Viewed

@@ -0,0 +1,28 @@
+module Kafka
+  class GzipCodec
+    def initialize
+      require "zlib"
+    end
+    def codec_id
+      1
+    end
+    def compress(data)
+      buffer = StringIO.new
+      buffer.set_encoding(Encoding::BINARY)
+      writer = Zlib::GzipWriter.new(buffer, Zlib::DEFAULT_COMPRESSION, Zlib::DEFAULT_STRATEGY)
+      writer.write(data)
+      writer.close
+      buffer.string
+    end
+    def decompress(data)
+      buffer = StringIO.new(data)
+      reader = Zlib::GzipReader.new(buffer)
+      reader.read
+    end
+  end
+end

data/lib/kafka/produce_operation.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+require "kafka/protocol/message_set"
 module Kafka
   # A produce operation attempts to send all messages in a buffer to the Kafka cluster.
   # Since topics and partitions are spread among all brokers in a cluster, this usually
@@ -23,11 +25,12 @@ module Kafka
   # * `sent_message_count` – the number of messages that were successfully sent.
   #
   class ProduceOperation
-    def initialize(cluster:, buffer:, required_acks:, ack_timeout:, logger:)
+    def initialize(cluster:, buffer:, compression_codec:, required_acks:, ack_timeout:, logger:)
       @cluster = cluster
       @buffer = buffer
       @required_acks = required_acks
       @ack_timeout = ack_timeout
+      @compression_codec = compression_codec
       @logger = logger
     end
@@ -67,12 +70,20 @@ module Kafka
         end
       end
-      messages_for_broker.each do |broker, message_set|
+      messages_for_broker.each do |broker, message_buffer|
         begin
-          @logger.info "Sending #{message_set.size} messages to #{broker}"
+          @logger.info "Sending #{message_buffer.size} messages to #{broker}"
+          messages_for_topics = {}
+          message_buffer.each do |topic, partition, messages|
+            message_set = Protocol::MessageSet.new(messages: messages, compression_codec: @compression_codec)
+            messages_for_topics[topic] ||= {}
+            messages_for_topics[topic][partition] = message_set
+          end
           response = broker.produce(
-            messages_for_topics: message_set.to_h,
+            messages_for_topics: messages_for_topics,
             required_acks: @required_acks,
             timeout: @ack_timeout * 1000, # Kafka expects the timeout in milliseconds.
           )

data/lib/kafka/producer.rb CHANGED Viewed

@@ -2,6 +2,7 @@ require "kafka/partitioner"
 require "kafka/message_buffer"
 require "kafka/produce_operation"
 require "kafka/pending_message"
+require "kafka/compression"
 module Kafka
@@ -14,11 +15,11 @@ module Kafka
   #     kafka = Kafka.new(...)
   #
   #     # Will instantiate Kafka::Producer
-  #     producer = kafka.get_producer
+  #     producer = kafka.producer
   #
   # This is done in order to share a logger as well as a pool of broker connections across
   # different producers. This also means that you don't need to pass the `cluster` and
-  # `logger` options to `#get_producer`. See {#initialize} for the list of other options
+  # `logger` options to `#producer`. See {#initialize} for the list of other options
   # you can pass in.
   #
   # ## Buffering
@@ -77,7 +78,7 @@ module Kafka
   #       logger: logger,
   #     )
   #
-  #     producer = kafka.get_producer
+  #     producer = kafka.producer
   #
   #     begin
   #       $stdin.each_with_index do |line, index|
@@ -117,7 +118,7 @@ module Kafka
     # @param max_buffer_size [Integer] the number of messages allowed in the buffer
     #   before new writes will raise BufferOverflow exceptions.
     #
-    def initialize(cluster:, logger:, ack_timeout: 5, required_acks: 1, max_retries: 2, retry_backoff: 1, max_buffer_size: 1000)
+    def initialize(cluster:, logger:, compression_codec: nil, ack_timeout: 5, required_acks: 1, max_retries: 2, retry_backoff: 1, max_buffer_size: 1000)
       @cluster = cluster
       @logger = logger
       @required_acks = required_acks
@@ -125,6 +126,7 @@ module Kafka
       @max_retries = max_retries
       @retry_backoff = retry_backoff
       @max_buffer_size = max_buffer_size
+      @compression_codec = Compression.find_codec(compression_codec)
       # A buffer organized by topic/partition.
       @buffer = MessageBuffer.new
@@ -185,7 +187,7 @@ module Kafka
     # the writes. The `ack_timeout` setting places an upper bound on the amount of
     # time the call will block before failing.
     #
-    # @raise [FailedToSendMessages] if not all messages could be successfully sent.
+    # @raise [DeliveryFailed] if not all messages could be successfully sent.
     # @return [nil]
     def deliver_messages
       # There's no need to do anything if the buffer is empty.
@@ -233,6 +235,7 @@ module Kafka
         buffer: @buffer,
         required_acks: @required_acks,
         ack_timeout: @ack_timeout,
+        compression_codec: @compression_codec,
         logger: @logger,
       )
@@ -268,7 +271,7 @@ module Kafka
       unless @buffer.empty?
         partitions = @buffer.map {|topic, partition, _| "#{topic}/#{partition}" }.join(", ")
-        raise FailedToSendMessages, "Failed to send messages to #{partitions}"
+        raise DeliveryFailed, "Failed to send messages to #{partitions}"
       end
     end

data/lib/kafka/protocol/message.rb CHANGED Viewed

@@ -16,43 +16,77 @@ module Kafka
     class Message
       MAGIC_BYTE = 0
-      attr_reader :key, :value, :attributes
+      attr_reader :key, :value, :attributes, :offset
-      def initialize(key:, value:, attributes: 0)
+      def initialize(value:, key: nil, attributes: 0, offset: -1)
         @key = key
         @value = value
         @attributes = attributes
+        @offset = offset
       end
       def encode(encoder)
-        data = encode_without_crc
-        crc = Zlib.crc32(data)
+        data = encode_with_crc
-        encoder.write_int32(crc)
-        encoder.write(data)
+        encoder.write_int64(offset)
+        encoder.write_bytes(data)
       end
       def ==(other)
-        @key == other.key && @value == other.value && @attributes == other.attributes
+        @key == other.key &&
+          @value == other.value &&
+          @attributes == other.attributes &&
+          @offset == other.offset
+      end
+      def compressed?
+        @attributes != 0
+      end
+      # @return [Kafka::Protocol::MessageSet]
+      def decompress
+        codec = Compression.find_codec_by_id(@attributes)
+        # For some weird reason we need to cut out the first 20 bytes.
+        data = codec.decompress(value)
+        message_set_decoder = Decoder.from_string(data)
+        MessageSet.decode(message_set_decoder)
       end
       def self.decode(decoder)
-        crc = decoder.int32
-        magic_byte = decoder.int8
+        offset = decoder.int64
+        message_decoder = Decoder.from_string(decoder.bytes)
+        crc = message_decoder.int32
+        magic_byte = message_decoder.int8
         unless magic_byte == MAGIC_BYTE
           raise Kafka::Error, "Invalid magic byte: #{magic_byte}"
         end
-        attributes = decoder.int8
-        key = decoder.bytes
-        value = decoder.bytes
+        attributes = message_decoder.int8
+        key = message_decoder.bytes
+        value = message_decoder.bytes
-        new(key: key, value: value, attributes: attributes)
+        new(key: key, value: value, attributes: attributes, offset: offset)
       end
       private
+      def encode_with_crc
+        buffer = StringIO.new
+        encoder = Encoder.new(buffer)
+        data = encode_without_crc
+        crc = Zlib.crc32(data)
+        encoder.write_int32(crc)
+        encoder.write(data)
+        buffer.string
+      end
       def encode_without_crc
         buffer = StringIO.new
         encoder = Encoder.new(buffer)

data/lib/kafka/protocol/message_set.rb CHANGED Viewed

@@ -3,23 +3,65 @@ module Kafka
     class MessageSet
       attr_reader :messages
-      def initialize(messages:)
+      def initialize(messages: [], compression_codec: nil)
         @messages = messages
+        @compression_codec = compression_codec
+      end
+      def ==(other)
+        messages == other.messages
+      end
+      def encode(encoder)
+        if @compression_codec.nil?
+          encode_without_compression(encoder)
+        else
+          encode_with_compression(encoder)
+        end
       end
       def self.decode(decoder)
         fetched_messages = []
         until decoder.eof?
-          offset = decoder.int64
-          message_decoder = Decoder.from_string(decoder.bytes)
-          message = Message.decode(message_decoder)
+          message = Message.decode(decoder)
-          fetched_messages << [offset, message]
+          if message.compressed?
+            wrapped_message_set = message.decompress
+            fetched_messages.concat(wrapped_message_set.messages)
+          else
+            fetched_messages << message
+          end
         end
         new(messages: fetched_messages)
       end
+      private
+      def encode_with_compression(encoder)
+        codec = @compression_codec
+        buffer = StringIO.new
+        encode_without_compression(Encoder.new(buffer))
+        data = codec.compress(buffer.string)
+        wrapper_message = Protocol::Message.new(
+          value: data,
+          attributes: codec.codec_id,
+        )
+        message_set = MessageSet.new(messages: [wrapper_message])
+        message_set.encode(encoder)
+      end
+      def encode_without_compression(encoder)
+        # Messages in a message set are *not* encoded as an array. Rather,
+        # they are written in sequence.
+        @messages.each do |message|
+          message.encode(encoder)
+        end
+      end
     end
   end
 end

data/lib/kafka/protocol/produce_request.rb CHANGED Viewed

@@ -59,44 +59,18 @@ module Kafka
         encoder.write_array(@messages_for_topics) do |topic, messages_for_partition|
           encoder.write_string(topic)
-          encoder.write_array(messages_for_partition) do |partition, messages|
+          encoder.write_array(messages_for_partition) do |partition, message_set|
+            encoder.write_int32(partition)
             # When encoding the message set into the request, the bytesize of the message
-            # set must precede the actual bytes. Therefore we need to encode the entire
+            # set must precede the actual data. Therefore we need to encode the entire
             # message set into a separate buffer first.
-            encoded_message_set = encode_message_set(messages)
-            encoder.write_int32(partition)
+            encoded_message_set = Encoder.encode_with(message_set)
-            # When encoding bytes, the 32 bit size of the byte buffer is encoded first.
             encoder.write_bytes(encoded_message_set)
           end
         end
       end
-      private
-      def encode_message_set(messages)
-        buffer = StringIO.new
-        encoder = Encoder.new(buffer)
-        # Messages in a message set are *not* encoded as an array. Rather,
-        # they are written in sequence with only the byte size prepended.
-        messages.each do |message|
-          offset = -1 # offsets don't matter here.
-          # When encoding a message into a message set, the bytesize of the message must
-          # precede the actual bytes. Therefore we need to encode the message into a
-          # separate buffer first.
-          encoded_message = Encoder.encode_with(message)
-          encoder.write_int64(offset)
-          # When encoding bytes, the 32 bit size of the byte buffer is encoded first.
-          encoder.write_bytes(encoded_message)
-        end
-        buffer.string
-      end
     end
   end
 end

data/lib/kafka/snappy_codec.rb ADDED Viewed

@@ -0,0 +1,20 @@
+module Kafka
+  class SnappyCodec
+    def initialize
+      require "snappy"
+    end
+    def codec_id
+      2
+    end
+    def compress(data)
+      Snappy.deflate(data)
+    end
+    def decompress(data)
+      buffer = StringIO.new(data)
+      Snappy::Reader.new(buffer).read
+    end
+  end
+end

data/lib/kafka/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Kafka
-  VERSION = "0.1.5"
+  VERSION = "0.1.6"
 end

data/ruby-kafka.gemspec CHANGED Viewed

@@ -35,4 +35,5 @@ Gem::Specification.new do |spec|
   spec.add_development_dependency "docker-api"
   spec.add_development_dependency "rspec-benchmark"
   spec.add_development_dependency "activesupport", ">= 4.2.0", "< 5.1"
+  spec.add_development_dependency "snappy"
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: ruby-kafka
 version: !ruby/object:Gem::Version
-  version: 0.1.5
+  version: 0.1.6
 platform: ruby
 authors:
 - Daniel Schierbeck
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2016-02-18 00:00:00.000000000 Z
+date: 2016-02-22 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler
@@ -128,6 +128,20 @@ dependencies:
     - - "<"
       - !ruby/object:Gem::Version
         version: '5.1'
+- !ruby/object:Gem::Dependency
+  name: snappy
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
 description: |-
   A client library for the Kafka distributed commit log.
@@ -151,13 +165,16 @@ files:
 - examples/simple-consumer.rb
 - examples/simple-producer.rb
 - lib/kafka.rb
+- lib/kafka/async_producer.rb
 - lib/kafka/broker.rb
 - lib/kafka/broker_pool.rb
 - lib/kafka/client.rb
 - lib/kafka/cluster.rb
+- lib/kafka/compression.rb
 - lib/kafka/connection.rb
 - lib/kafka/fetch_operation.rb
 - lib/kafka/fetched_message.rb
+- lib/kafka/gzip_codec.rb
 - lib/kafka/instrumentation.rb
 - lib/kafka/message_buffer.rb
 - lib/kafka/partitioner.rb
@@ -178,6 +195,7 @@ files:
 - lib/kafka/protocol/produce_response.rb
 - lib/kafka/protocol/request_message.rb
 - lib/kafka/protocol/topic_metadata_request.rb
+- lib/kafka/snappy_codec.rb
 - lib/kafka/socket_with_timeout.rb
 - lib/kafka/version.rb
 - lib/ruby-kafka.rb