RubyGems - ruby-kafka - Versions diffs - 0.1.7 → 0.2.0 - Mend

ruby-kafka 0.1.7 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (36) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +10 -0
data/README.md +12 -1
data/lib/kafka.rb +18 -0
data/lib/kafka/broker.rb +42 -0
data/lib/kafka/client.rb +35 -5
data/lib/kafka/cluster.rb +30 -0
data/lib/kafka/compressor.rb +59 -0
data/lib/kafka/connection.rb +1 -0
data/lib/kafka/consumer.rb +211 -0
data/lib/kafka/consumer_group.rb +172 -0
data/lib/kafka/fetch_operation.rb +2 -2
data/lib/kafka/produce_operation.rb +4 -8
data/lib/kafka/producer.rb +7 -5
data/lib/kafka/protocol.rb +27 -0
data/lib/kafka/protocol/consumer_group_protocol.rb +17 -0
data/lib/kafka/protocol/group_coordinator_request.rb +21 -0
data/lib/kafka/protocol/group_coordinator_response.rb +25 -0
data/lib/kafka/protocol/heartbeat_request.rb +25 -0
data/lib/kafka/protocol/heartbeat_response.rb +15 -0
data/lib/kafka/protocol/join_group_request.rb +39 -0
data/lib/kafka/protocol/join_group_response.rb +31 -0
data/lib/kafka/protocol/leave_group_request.rb +23 -0
data/lib/kafka/protocol/leave_group_response.rb +15 -0
data/lib/kafka/protocol/member_assignment.rb +40 -0
data/lib/kafka/protocol/message_set.rb +5 -37
data/lib/kafka/protocol/metadata_response.rb +5 -1
data/lib/kafka/protocol/offset_commit_request.rb +42 -0
data/lib/kafka/protocol/offset_commit_response.rb +27 -0
data/lib/kafka/protocol/offset_fetch_request.rb +34 -0
data/lib/kafka/protocol/offset_fetch_response.rb +51 -0
data/lib/kafka/protocol/sync_group_request.rb +31 -0
data/lib/kafka/protocol/sync_group_response.rb +21 -0
data/lib/kafka/round_robin_assignment_strategy.rb +40 -0
data/lib/kafka/version.rb +1 -1
metadata +23 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: cc49e2526024ee3dd1cb52b5abfef38dbc2e37ea
-  data.tar.gz: a3c0111bccd3e8daf83eec647baec2fa9921fbf8
+  metadata.gz: e58cc6609bed1f99291323a15a90e1a8433370a9
+  data.tar.gz: eca9c8f1efc71812b4d4d7580f0917a7921864f1
 SHA512:
-  metadata.gz: 94547594275850a3bacd22758fc0ccce8956c83f0e6e7c46dd1c4a5fb4aa5b52d6199383862cb5087a90b89ee2e0ca63b486082b0d44da2e0994f6cc76e794a7
-  data.tar.gz: 69d7522e0ecd82057fd4b5879273026e6669441e0803c8d28bee83213a66bdbdd13f08051d4b3bb0725c3f19ad9df4dedde7c4a6c0236fac56079b1a666c7ed7
+  metadata.gz: 7be411f4c72bd6de0154bbee286df4a0d67052af54f371fc643248cc6f6a8cb4760a2f4c18c47921a6031722be8c280502617fb27aa38423cf6add20667db992
+  data.tar.gz: cb721b17f832fc9c4485e6db40902e5ebdd2fbbcf62d24b27879ffa697455f38eb2ebc05355b37a413264a2452169052a35c3f0910312752a328f35d787371a2

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,10 @@
+# Changelog
+Changes and additions to the library will be listed here.
+## Unreleased
+## v0.2.0
+- Add instrumentation of message compression.
+- **New!** Consumer API – still alpha level. Expect many changes.

data/README.md CHANGED Viewed

@@ -4,7 +4,7 @@
 A Ruby client library for [Apache Kafka](http://kafka.apache.org/), a distributed log and message bus. The focus of this library will be operational simplicity, with good logging and metrics that can make debugging issues easier.
-Currently, only the Producer API has been implemented, but a fully-fledged Consumer implementation compatible with Kafka 0.9 is on the roadmap.
+The Producer API is currently beta level and used in production. There's an alpha level Consumer Group API that has not yet been used in production and that may change without warning. Feel free to try it out but don't expect it to be stable or correct quite yet.
 ## Installation
@@ -208,6 +208,17 @@ Note that there's a maximum buffer size; pass in a different value for `max_buff
 A final note on buffers: local buffers give resilience against broker and network failures, and allow higher throughput due to message batching, but they also trade off consistency guarantees for higher availibility and resilience. If your local process dies while messages are buffered, those messages will be lost. If you require high levels of consistency, you should call `#deliver_messages` immediately after `#produce`.
+### Logging
+It's a very good idea to configure the Kafka client with a logger. All important operations and errors are logged. When instantiating your client, simply pass in a valid logger:
+```ruby
+logger = Logger.new("log/kafka.log")
+kafka = Kafka.new(logger: logger, ...)
+```
+By default, nothing is logged.
 ### Understanding Timeouts
 It's important to understand how timeouts work if you have a latency sensitive application. This library allows configuring timeouts on different levels:

data/lib/kafka.rb CHANGED Viewed

@@ -59,6 +59,12 @@ module Kafka
   class OffsetMetadataTooLarge < ProtocolError
   end
+  class GroupCoordinatorNotAvailable < ProtocolError
+  end
+  class NotCoordinatorForGroup < ProtocolError
+  end
   # For a request which attempts to access an invalid topic (e.g. one which has
   # an illegal name), or if an attempt is made to write to an internal topic
   # (such as the consumer offsets topic).
@@ -89,6 +95,18 @@ module Kafka
   class ReplicaNotAvailable < ProtocolError
   end
+  class UnknownMemberId < ProtocolError
+  end
+  class RebalanceInProgress < ProtocolError
+  end
+  class IllegalGeneration < ProtocolError
+  end
+  class InvalidSessionTimeout < ProtocolError
+  end
   # Raised when there's a network connection error.
   class ConnectionError < Error
   end

data/lib/kafka/broker.rb CHANGED Viewed

@@ -64,5 +64,47 @@ module Kafka
       @connection.send_request(request)
     end
+    def fetch_offsets(**options)
+      request = Protocol::OffsetFetchRequest.new(**options)
+      @connection.send_request(request)
+    end
+    def commit_offsets(**options)
+      request = Protocol::OffsetCommitRequest.new(**options)
+      @connection.send_request(request)
+    end
+    def join_group(**options)
+      request = Protocol::JoinGroupRequest.new(**options)
+      @connection.send_request(request)
+    end
+    def sync_group(**options)
+      request = Protocol::SyncGroupRequest.new(**options)
+      @connection.send_request(request)
+    end
+    def leave_group(**options)
+      request = Protocol::LeaveGroupRequest.new(**options)
+      @connection.send_request(request)
+    end
+    def find_group_coordinator(**options)
+      request = Protocol::GroupCoordinatorRequest.new(**options)
+      @connection.send_request(request)
+    end
+    def heartbeat(**options)
+      request = Protocol::HeartbeatRequest.new(**options)
+      @connection.send_request(request)
+    end
   end
 end

data/lib/kafka/client.rb CHANGED Viewed

@@ -1,13 +1,12 @@
 require "kafka/cluster"
 require "kafka/producer"
+require "kafka/consumer"
 require "kafka/async_producer"
 require "kafka/fetched_message"
 require "kafka/fetch_operation"
 module Kafka
   class Client
-    DEFAULT_CLIENT_ID = "ruby-kafka"
-    DEFAULT_LOGGER = Logger.new("/dev/null")
     # Initializes a new Kafka client.
     #
@@ -16,7 +15,7 @@ module Kafka
     #
     # @param client_id [String] the identifier for this application.
     #
-    # @param logger [Logger]
+    # @param logger [Logger] the logger that should be used by the client.
     #
     # @param connect_timeout [Integer, nil] the timeout setting for connecting
     #   to brokers. See {BrokerPool#initialize}.
@@ -25,8 +24,8 @@ module Kafka
     #   connections. See {BrokerPool#initialize}.
     #
     # @return [Client]
-    def initialize(seed_brokers:, client_id: DEFAULT_CLIENT_ID, logger: DEFAULT_LOGGER, connect_timeout: nil, socket_timeout: nil)
-      @logger = logger
+    def initialize(seed_brokers:, client_id: "ruby-kafka", logger: nil, connect_timeout: nil, socket_timeout: nil)
+      @logger = logger || Logger.new("/dev/null")
       broker_pool = BrokerPool.new(
         client_id: client_id,
@@ -52,6 +51,20 @@ module Kafka
       Producer.new(cluster: @cluster, logger: @logger, **options)
     end
+    # Creates a new AsyncProducer instance.
+    #
+    # All parameters allowed by {#producer} can be passed. In addition to this,
+    # a few extra parameters can be passed when creating an async producer.
+    #
+    # @param max_queue_size [Integer] the maximum number of messages allowed in
+    #   the queue.
+    # @param delivery_threshold [Integer] if greater than zero, the number of
+    #   buffered messages that will automatically trigger a delivery.
+    # @param delivery_interval [Integer] if greater than zero, the number of
+    #   seconds between automatic message deliveries.
+    #
+    # @see AsyncProducer
+    # @return [AsyncProducer]
     def async_producer(delivery_interval: 0, delivery_threshold: 0, max_queue_size: 1000, **options)
       sync_producer = producer(**options)
@@ -63,6 +76,20 @@ module Kafka
       )
     end
+    # Creates a new Consumer instance.
+    #
+    # `options` are passed to {Consumer#initialize}.
+    #
+    # @see Consumer
+    # @return [Consumer]
+    def consumer(**options)
+      Consumer.new(
+        cluster: @cluster,
+        logger: @logger,
+        **options,
+      )
+    end
     # Fetches a batch of messages from a single partition. Note that it's possible
     # to get back empty batches.
     #
@@ -152,6 +179,9 @@ module Kafka
       @cluster.partitions_for(topic).count
     end
+    # Closes all connections to the Kafka brokers and frees up used resources.
+    #
+    # @return [nil]
     def close
       @cluster.disconnect
     end

data/lib/kafka/cluster.rb CHANGED Viewed

@@ -66,6 +66,36 @@ module Kafka
       connect_to_broker(get_leader_id(topic, partition))
     end
+    def get_group_coordinator(group_id:)
+      @logger.debug "Getting group coordinator for `#{group_id}`"
+      refresh_metadata_if_necessary!
+      cluster_info.brokers.each do |broker_info|
+        begin
+          broker = connect_to_broker(broker_info.node_id)
+          response = broker.find_group_coordinator(group_id: group_id)
+          Protocol.handle_error(response.error_code)
+          coordinator_id = response.coordinator_id
+          coordinator = connect_to_broker(coordinator_id)
+          @logger.debug "Coordinator for group `#{group_id}` is #{coordinator}"
+          return coordinator
+        rescue GroupCoordinatorNotAvailable
+          @logger.debug "Coordinator not available; retrying in 1s"
+          sleep 1
+          retry
+        rescue ConnectionError => e
+          @logger.error "Failed to get group coordinator info from #{broker}: #{e}"
+        end
+      end
+      raise Kafka::Error, "Failed to find group coordinator"
+    end
     def partitions_for(topic)
       add_target_topics([topic])
       cluster_info.partitions_for(topic)

data/lib/kafka/compressor.rb ADDED Viewed

@@ -0,0 +1,59 @@
+require "kafka/compression"
+module Kafka
+  # Compresses message sets using a specified codec.
+  #
+  # A message set is only compressed if its size meets the defined threshold.
+  #
+  # ## Instrumentation
+  #
+  # Whenever a message set is compressed, the notification
+  # `compress.compressor.kafka` will be emitted with the following payload:
+  #
+  # * `message_count` – the number of messages in the message set.
+  # * `uncompressed_bytesize` – the byte size of the original data.
+  # * `compressed_bytesize` – the byte size of the compressed data.
+  #
+  class Compressor
+    # @param codec_name [Symbol, nil]
+    # @param threshold [Integer] the minimum number of messages in a message set
+    #   that will trigger compression.
+    def initialize(codec_name:, threshold:)
+      @codec = Compression.find_codec(codec_name)
+      @threshold = threshold
+    end
+    # @param message_set [Protocol::MessageSet]
+    # @return [Protocol::MessageSet]
+    def compress(message_set)
+      return message_set if @codec.nil? || message_set.size < @threshold
+      compressed_data = compress_data(message_set)
+      wrapper_message = Protocol::Message.new(
+        value: compressed_data,
+        attributes: @codec.codec_id,
+      )
+      Protocol::MessageSet.new(messages: [wrapper_message])
+    end
+    private
+    def compress_data(message_set)
+      data = Protocol::Encoder.encode_with(message_set)
+      Instrumentation.instrument("compress.compressor.kafka") do |notification|
+        compressed_data = @codec.compress(data)
+        notification[:message_count] = message_set.size
+        notification[:uncompressed_bytesize] = data.bytesize
+        notification[:compressed_bytesize] = compressed_data.bytesize
+        compressed_data
+      end
+    end
+  end
+end

data/lib/kafka/connection.rb CHANGED Viewed

@@ -126,6 +126,7 @@ module Kafka
       message = Kafka::Protocol::RequestMessage.new(
         api_key: request.api_key,
+        api_version: request.respond_to?(:api_version) ? request.api_version : 0,
         correlation_id: @correlation_id,
         client_id: @client_id,
         request: request,

data/lib/kafka/consumer.rb ADDED Viewed

@@ -0,0 +1,211 @@
+require "kafka/consumer_group"
+require "kafka/fetch_operation"
+module Kafka
+  # @note This code is still alpha level. Don't use this for anything important.
+  #   The API may also change without warning.
+  #
+  # A client that consumes messages from a Kafka cluster in coordination with
+  # other clients.
+  #
+  # A Consumer subscribes to one or more Kafka topics; all consumers with the
+  # same *group id* then agree on who should read from the individual topic
+  # partitions. When group members join or leave, the group synchronizes,
+  # making sure that all partitions are assigned to a single member, and that
+  # all members have some partitions to read from.
+  #
+  # ## Example
+  #
+  # A simple producer that simply writes the messages it consumes to the
+  # console.
+  #
+  #     require "kafka"
+  #
+  #     kafka = Kafka.new(seed_brokers: ["kafka1:9092", "kafka2:9092"])
+  #
+  #     # Create a new Consumer instance in the group `my-group`:
+  #     consumer = kafka.consumer(group_id: "my-group")
+  #
+  #     # Subscribe to a Kafka topic:
+  #     consumer.subscribe("messages")
+  #
+  #     begin
+  #       # Loop forever, reading in messages from all topics that have been
+  #       # subscribed to.
+  #       consumer.each_message do |message|
+  #         puts message.topic
+  #         puts message.partition
+  #         puts message.key
+  #         puts message.value
+  #         puts message.offset
+  #       end
+  #     ensure
+  #       # Make sure to shut down the consumer after use. This lets
+  #       # the consumer notify the Kafka cluster that it's leaving
+  #       # the group, causing a synchronization and re-balancing of
+  #       # the group.
+  #       consumer.shutdown
+  #     end
+  #
+  class Consumer
+    # Creates a new Consumer.
+    #
+    # @param cluster [Kafka::Cluster]
+    # @param logger [Logger]
+    # @param group_id [String] the id of the group that the consumer should join.
+    # @param session_timeout [Integer] the interval between consumer heartbeats,
+    #   in seconds.
+    def initialize(cluster:, logger:, group_id:, session_timeout: 30)
+      @cluster = cluster
+      @logger = logger
+      @group_id = group_id
+      @session_timeout = session_timeout
+      @group = ConsumerGroup.new(
+        cluster: cluster,
+        logger: logger,
+        group_id: group_id,
+        session_timeout: @session_timeout,
+      )
+      @offsets = {}
+      @default_offsets = {}
+    end
+    # Subscribes the consumer to a topic.
+    #
+    # Typically you either want to start reading messages from the very
+    # beginning of the topic's partitions or you simply want to wait for new
+    # messages to be written. In the former case, set `default_offsets` to
+    # `:earliest` (the default); in the latter, set it to `:latest`.
+    #
+    # @param topic [String] the name of the topic to subscribe to.
+    # @param default_offset [Symbol] whether to start from the beginning or the
+    #   end of the topic's partitions.
+    # @return [nil]
+    def subscribe(topic, default_offset: :earliest)
+      @group.subscribe(topic)
+      @default_offsets[topic] = default_offset
+      nil
+    end
+    # Fetches and enumerates the messages in the topics that the consumer group
+    # subscribes to.
+    #
+    # Each message is yielded to the provided block. If the block returns
+    # without raising an exception, the message will be considered successfully
+    # processed. At regular intervals the offset of the most recent successfully
+    # processed message in each partition will be committed to the Kafka
+    # offset store. If the consumer crashes or leaves the group, the group member
+    # that is tasked with taking over processing of these partitions will resume
+    # at the last committed offsets.
+    #
+    # @yieldparam message [Kafka::FetchedMessage] a message fetched from Kafka.
+    # @return [nil]
+    def each_message
+      loop do
+        begin
+          batch = fetch_batch
+          batch.each do |message|
+            yield message
+            send_heartbeat_if_necessary
+            mark_message_as_processed(message)
+          end
+        rescue ConnectionError => e
+          @logger.error "Connection error while fetching messages: #{e}"
+        else
+          commit_offsets unless batch.nil? || batch.empty?
+        end
+      end
+    end
+    # Shuts down the consumer.
+    #
+    # In order to quickly have the consumer group re-balance itself, it's
+    # important that members explicitly tell Kafka when they're leaving.
+    # Therefore it's a good idea to call this method whenever your consumer
+    # is about to quit. If this method is not called, it may take up to
+    # the amount of time defined by the `session_timeout` parameter for
+    # Kafka to realize that this consumer is no longer present and trigger
+    # a group re-balance. In that period of time, the partitions that used
+    # to be assigned to this consumer won't be processed.
+    #
+    # @return [nil]
+    def shutdown
+      @group.leave
+    end
+    private
+    def fetch_batch
+      @group.join unless @group.member?
+      @logger.debug "Fetching a batch of messages"
+      assigned_partitions = @group.assigned_partitions
+      send_heartbeat_if_necessary
+      raise "No partitions assigned!" if assigned_partitions.empty?
+      operation = FetchOperation.new(
+        cluster: @cluster,
+        logger: @logger,
+        min_bytes: 1,
+        max_wait_time: 5,
+      )
+      offset_response = @group.fetch_offsets
+      assigned_partitions.each do |topic, partitions|
+        partitions.each do |partition|
+          offset = @offsets.fetch(topic, {}).fetch(partition) {
+            offset_response.offset_for(topic, partition)
+          }
+          offset = @default_offsets.fetch(topic) if offset < 0
+          @logger.debug "Fetching from #{topic}/#{partition} starting at offset #{offset}"
+          operation.fetch_from_partition(topic, partition, offset: offset)
+        end
+      end
+      messages = operation.execute
+      @logger.debug "Fetched #{messages.count} messages"
+      messages
+    end
+    def commit_offsets
+      @logger.debug "Committing offsets"
+      @group.commit_offsets(@offsets)
+    end
+    # Sends a heartbeat if it would be necessary in order to avoid getting
+    # kicked out of the consumer group.
+    #
+    # Each consumer needs to send a heartbeat with a frequency defined by
+    # `session_timeout`.
+    #
+    def send_heartbeat_if_necessary
+      @last_heartbeat ||= Time.at(0)
+      if @last_heartbeat <= Time.now - @session_timeout + 2
+        @group.heartbeat
+        @last_heartbeat = Time.now
+      end
+    end
+    def mark_message_as_processed(message)
+      @offsets[message.topic] ||= {}
+      @offsets[message.topic][message.partition] = message.offset + 1
+    end
+  end
+end