RubyGems - ruby-kafka - Versions diffs - 0.3.9 → 0.3.10 - Mend

ruby-kafka 0.3.9 → 0.3.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 0681733f79e04de3118e211640474ed752c8b340
-  data.tar.gz: d35d33238ec84ad331c7af167ac3464c29bc65ff
+  metadata.gz: c81a0d1df2ae4667c58001fd625b1df17916fad6
+  data.tar.gz: 6dcd1593cbe60056a675eca9b26a86972af41ac7
 SHA512:
-  metadata.gz: 7b040e4e49d312961a23f4b42238877e0c014734c4f99de3df41f0c2db011afe12f3dd91d7edaaa016131c28925233c27482fd1d55a7207eeb2e767f2996151c
-  data.tar.gz: cb3b411ff48f4ae6868adbec7f122e6793d50ea93cb8acfd7c5228ab20ef023e43a50b136249939154649b75dc4b09b6ea0c611e6011ef9c79b2d01c17d63edb
+  metadata.gz: 0255775eedb66b4fe2dac35df58854748746ebcebdba436f636105d99a59279211bd8f876fcf760d47fecd42041338991aacd0ced17c771791819a7a4a72267f
+  data.tar.gz: 7b9e20157c087717d6178a6cf5c466e011123c966e44ab384fa4b62c36c113a85bf7435a5e9e7cec2ecfa686b375e0ade50db8507a4589e7d4350d9c3a216456

data/CHANGELOG.md CHANGED Viewed

@@ -4,6 +4,13 @@ Changes and additions to the library will be listed here.
 ## Unreleased
+## v0.3.10
+- Handle brokers becoming unavailable while in a consumer loop (#228).
+- Handle edge case when consuming from the end of a topic (#230).
+- Ensure the library can be loaded without Bundler (#224).
+- Add an API for fetching the last offset in a partition (#232).
 ## v0.3.9
 - Improve the default durability setting. The producer setting `required_acks` now defaults to `:all` (#210).

data/ISSUE_TEMPLATE.md ADDED Viewed

@@ -0,0 +1,23 @@
+If this is a bug report, please fill out the following:
+* Version of Ruby:
+* Version of Kafka:
+* Version of ruby-kafka:
+Please verify that the problem you're seeing hasn't been fixed by the current `master` of ruby-kafka.
+###### Steps to reproduce
+```ruby
+kafka = Kafka.new(...)
+# Please write an example that reproduces the problem you're describing.
+```
+###### Expected outcome
+What you thought would happen when running the example.
+###### Actual outcome
+What actually happened.

data/README.md CHANGED Viewed

@@ -23,10 +23,11 @@ Although parts of this library work with Kafka 0.8 – specifically, the Produce
     7. [Compression](#compression)
     8. [Producing Messages from a Rails Application](#producing-messages-from-a-rails-application)
   3. [Consuming Messages from Kafka](#consuming-messages-from-kafka)
-    1. [Consumer Checkpointing](#consumer-checkpointing)
-    2. [Topic Subscriptions](#topic-subscriptions)
-    3. [Consuming Messages in Batches](#consuming-messages-in-batches)
-    4. [Balancing Throughput and Latency](#balancing-throughput-and-latency)
+    1. [Consumer Groups](#consumer-groups)
+    2. [Consumer Checkpointing](#consumer-checkpointing)
+    3. [Topic Subscriptions](#topic-subscriptions)
+    4. [Consuming Messages in Batches](#consuming-messages-in-batches)
+    5. [Balancing Throughput and Latency](#balancing-throughput-and-latency)
   4. [Thread Safety](#thread-safety)
   5. [Logging](#logging)
   6. [Instrumentation](#instrumentation)
@@ -110,52 +111,52 @@ kafka = Kafka.new(...)
 kafka.deliver_message("Hello, World!", topic: "greetings")
 ```
-This will write the message to a random partition in the `greetings` topic.
-#### Efficiently Producing Messages
-While `#deliver_message` works fine for infrequent writes, there are a number of downside:
-* Kafka is optimized for transmitting _batches_ of messages rather than individual messages, so there's a significant overhead and performance penalty in using the single-message API.
-* The message delivery can fail in a number of different ways, but this simplistic API does not provide automatic retries.
-* The message is not buffered, so if there is an error, it is lost.
-The Producer API solves all these problems and more:
+This will write the message to a random partition in the `greetings` topic. If you want to write to a _specific_ partition, pass the `partition` parameter:
 ```ruby
-producer = kafka.producer
+# Will write to partition 42.
+kafka.deliver_message("Hello, World!", topic: "greetings", partition: 42)
 ```
-`produce` will buffer the message in the producer but will _not_ actually send it to the Kafka cluster.
+If you don't know exactly how many partitions are in the topic, or if you'd rather have some level of indirection, you can pass in `partition_key` instead. Two messages with the same partition key will always be assigned to the same partition. This is useful if you want to make sure all messages with a given attribute are always written to the same partition, e.g. all purchase events for a given customer id.
 ```ruby
-producer.produce("hello1", topic: "test-messages")
+# Partition keys assign a partition deterministically.
+kafka.deliver_message("Hello, World!", topic: "greetings", partition_key: "hello")
 ```
-It's possible to specify a message key.
+Kafka also supports _message keys_. When passed, a message key can be used instead of a partition key. The message key is written alongside the message value and can be read by consumers. Message keys in Kafka can be used for interesting things such as [Log Compaction](http://kafka.apache.org/documentation.html#compaction). See [Partitioning](#partitioning) for more information.
 ```ruby
-producer.produce("hello2", key: "x", topic: "test-messages")
+# Set a message key; the key will be used for partitioning since no explicit
+# `partition_key` is set.
+kafka.deliver_message("Hello, World!", key: "hello", topic: "greetings")
 ```
-If you need to control which partition a message should be assigned to, you can pass in the `partition` parameter.
-```ruby
-producer.produce("hello3", topic: "test-messages", partition: 1)
-```
+#### Efficiently Producing Messages
-If you don't know exactly how many partitions are in the topic, or you'd rather have some level of indirection, you can pass in `partition_key`. Two messages with the same partition key will always be assigned to the same partition.
+While `#deliver_message` works fine for infrequent writes, there are a number of downside:
-```ruby
-producer.produce("hello4", topic: "test-messages", partition_key: "yo")
-```
+* Kafka is optimized for transmitting _batches_ of messages rather than individual messages, so there's a significant overhead and performance penalty in using the single-message API.
+* The message delivery can fail in a number of different ways, but this simplistic API does not provide automatic retries.
+* The message is not buffered, so if there is an error, it is lost.
-`deliver_messages` will send the buffered messages to the cluster. Since messages may be destined for different partitions, this could involve writing to more than one Kafka broker. Note that a failure to send all buffered messages after the configured number of retries will result in `Kafka::DeliveryFailed` being raised. This can be rescued and ignored; the messages will be kept in the buffer until the next attempt.
+The Producer API solves all these problems and more:
 ```ruby
+# Instantiate a new producer.
+producer = kafka.producer
+# Add a message to the producer buffer.
+producer.produce("hello1", topic: "test-messages")
+# Deliver the messages to Kafka.
 producer.deliver_messages
 ```
+`#produce` will buffer the message in the producer but will _not_ actually send it to the Kafka cluster. Buffered messages are only delivered to the Kafka cluster once `#deliver_messages` is called. Since messages may be destined for different partitions, this could involve writing to more than one Kafka broker. Note that a failure to send all buffered messages after the configured number of retries will result in `Kafka::DeliveryFailed` being raised. This can be rescued and ignored; the messages will be kept in the buffer until the next attempt.
 Read the docs for [Kafka::Producer](http://www.rubydoc.info/gems/ruby-kafka/Kafka/Producer) for more details.
 #### Asynchronously Producing Messages
@@ -291,15 +292,23 @@ The producer is designed for resilience in the face of temporary network errors,
 Typically, you'd configure the producer to retry failed attempts at sending messages, but sometimes all retries are exhausted. In that case, `Kafka::DeliveryFailed` is raised from `Kafka::Producer#deliver_messages`. If you wish to have your application be resilient to this happening (e.g. if you're logging to Kafka from a web application) you can rescue this exception. The failed messages are still retained in the buffer, so a subsequent call to `#deliver_messages` will still attempt to send them.
-Note that there's a maximum buffer size; pass in a different value for `max_buffer_size` when calling `#producer` in order to configure this.
+Note that there's a maximum buffer size; by default, it's set to 1,000 messages and 10MB. It's possible to configure both these numbers:
+```ruby
+producer = kafka.producer(
+  max_buffer_size: 5_000,           # Allow at most 5K messages to be buffered.
+  max_buffer_bytesize: 100_000_000, # Allow at most 100MB to be buffered.
+  ...
+)
+```
-A final note on buffers: local buffers give resilience against broker and network failures, and allow higher throughput due to message batching, but they also trade off consistency guarantees for higher availibility and resilience. If your local process dies while messages are buffered, those messages will be lost. If you require high levels of consistency, you should call `#deliver_messages` immediately after `#produce`.
+A final note on buffers: local buffers give resilience against broker and network failures, and allow higher throughput due to message batching, but they also trade off consistency guarantees for higher availability and resilience. If your local process dies while messages are buffered, those messages will be lost. If you require high levels of consistency, you should call `#deliver_messages` immediately after `#produce`.
 #### Message Durability
-Once the client has delivered a set of messages to a Kafka broker the broker will forward them to its replicas, thus ensuring that a single broker failure will not result in message loss. However, the client can choose _when the leader acknowledges the write_. At one extreme, the client can choose fire-and-forget delivery, not even bothering to check whether the messages have been acknowledged. At the other end, the client can ask the broker to wait until _all_ its replicas have acknowledged the write before returning. This is the safest option, and the default. It's also possible to have the broker return as soon as it has written the messages to its own log but before the replicas have done so. This leaves a window of time where a failure of the leader will result in the messages being lost, although this should not be a common occurence.
+Once the client has delivered a set of messages to a Kafka broker the broker will forward them to its replicas, thus ensuring that a single broker failure will not result in message loss. However, the client can choose _when the leader acknowledges the write_. At one extreme, the client can choose fire-and-forget delivery, not even bothering to check whether the messages have been acknowledged. At the other end, the client can ask the broker to wait until _all_ its replicas have acknowledged the write before returning. This is the safest option, and the default. It's also possible to have the broker return as soon as it has written the messages to its own log but before the replicas have done so. This leaves a window of time where a failure of the leader will result in the messages being lost, although this should not be a common occurrence.
-Write latency and throughput are negativaly impacted by having more replicas acknowledge a write, so if you require low-latency, high throughput writes you may want to accept lower durability.
+Write latency and throughput are negatively impacted by having more replicas acknowledge a write, so if you require low-latency, high throughput writes you may want to accept lower durability.
 This behavior is controlled by the `required_acks` option to `#producer` and `#async_producer`:
@@ -343,7 +352,7 @@ producer.deliver_messages
 That is, once `#deliver_messages` returns we can be sure that Kafka has received the message. Note that there are some big caveats here:
 - Depending on how your cluster and topic is configured the message could still be lost by Kafka.
-- If you configure the producer to not require acknowledgements from the Kafka brokers by setting `required_acks` to zero there is no guarantee that the messsage will ever make it to a Kafka broker.
+- If you configure the producer to not require acknowledgements from the Kafka brokers by setting `required_acks` to zero there is no guarantee that the message will ever make it to a Kafka broker.
 - If you use the asynchronous producer there's no guarantee that messages will have been delivered after `#deliver_messages` returns. A way of blocking until a message has been delivered with the asynchronous producer may be implemented in the future.
 It's possible to improve your chances of success when calling `#deliver_messages`, at the price of a longer max latency:
@@ -435,25 +444,23 @@ end
 **Warning:** The Consumer API is still alpha level and will likely change. The consumer code should not be considered stable, as it hasn't been exhaustively tested in production environments yet.
-The simplest way to consume messages from a Kafka topic is using the `#fetch_messages` API:
+Consuming messages from a Kafka topic is simple:
 ```ruby
 require "kafka"
 kafka = Kafka.new(seed_brokers: ["kafka1:9092", "kafka2:9092"])
-messages = kafka.fetch_messages(topic: "greetings", partition: 42)
-messages.each do |message|
+kafka.each_message(topic: "greetings") do |message|
   puts message.offset, message.key, message.value
 end
 ```
 While this is great for extremely simple use cases, there are a number of downsides:
-- You can only fetch from a single topic and partition at a time.
+- You can only fetch from a single topic at a time.
 - If you want to have multiple processes consume from the same topic, there's no way of coordinating which processes should fetch from which partitions.
-- If a process dies, there's no way to have another process resume fetching from the point in the partition that the original process had reached.
+- If the process dies, there's no way to have another process resume fetching from the point in the partition that the original process had reached.
 #### Consumer Groups
@@ -487,7 +494,7 @@ Each consumer process will be assigned one or more partitions from each topic th
 In order to be able to resume processing after a consumer crashes, each consumer will periodically _checkpoint_ its position within each partition it reads from. Since each partition has a monotonically increasing sequence of message offsets, this works by _committing_ the offset of the last message that was processed in a given partition. Kafka handles these commits and allows another consumer in a group to resume from the last commit when a member crashes or becomes unresponsive.
-By default, offsets are committed every 10 seconds. You can increase the frequency, known as the _offset commit interval_, to limit the duration of double-processing scenarios, at the cost of a lower throughput due to the added coordination. If you want to improve throughtput, and double-processing is of less concern to you, then you can decrease the frequency.
+By default, offsets are committed every 10 seconds. You can increase the frequency, known as the _offset commit interval_, to limit the duration of double-processing scenarios, at the cost of a lower throughput due to the added coordination. If you want to improve throughput, and double-processing is of less concern to you, then you can decrease the frequency.
 In addition to the time based trigger it's possible to trigger checkpointing in response to _n_ messages having been processed, known as the _offset commit threshold_. This puts a bound on the number of messages that can be double-processed before the problem is detected. Setting this to 1 will cause an offset commit to take place every time a message has been processed. By default this trigger is disabled.
@@ -696,7 +703,7 @@ After checking out the repo, run `bin/setup` to install dependencies. Then, run
 ## Roadmap
-The current stable release is v0.2. This release is running in production at Zendesk, but it's still not recommended that you use it when data loss is unacceptable. It will take a little while until all edge cases have been uncovered and handled.
+The current stable release is v0.3. This release is running in production at Zendesk, but it's still not recommended that you use it when data loss is unacceptable. It will take a little while until all edge cases have been uncovered and handled.
 ### v0.4

data/lib/kafka/client.rb CHANGED Viewed

@@ -1,4 +1,5 @@
 require "openssl"
+require "uri"
 require "kafka/cluster"
 require "kafka/producer"
@@ -215,6 +216,7 @@ module Kafka
       )
       offset_manager = OffsetManager.new(
+        cluster: cluster,
         group: group,
         logger: @logger,
         commit_interval: offset_commit_interval,
@@ -311,9 +313,32 @@ module Kafka
       operation.execute.flat_map {|batch| batch.messages }
     end
-    # EXPERIMENTAL: Enumerates all messages in a topic.
-    def each_message(topic:, offset: :earliest, max_wait_time: 5, min_bytes: 1, max_bytes: 1048576, &block)
-      offsets = Hash.new { offset }
+    # Enumerate all messages in a topic.
+    #
+    # @param topic [String] the topic to consume messages from.
+    #
+    # @param start_from_beginning [Boolean] whether to start from the beginning
+    #   of the topic or just subscribe to new messages being produced. This
+    #   only applies when first consuming a topic partition – once the consumer
+    #   has checkpointed its progress, it will always resume from the last
+    #   checkpoint.
+    #
+    # @param max_wait_time [Integer] the maximum amount of time to wait before
+    #   the server responds, in seconds.
+    #
+    # @param min_bytes [Integer] the minimum number of bytes to wait for. If set to
+    #   zero, the broker will respond immediately, but the response may be empty.
+    #   The default is 1 byte, which means that the broker will respond as soon as
+    #   a message is written to the partition.
+    #
+    # @param max_bytes [Integer] the maximum number of bytes to include in the
+    #   response message set. Default is 1 MB. You need to set this higher if you
+    #   expect messages to be larger than this.
+    #
+    # @return [nil]
+    def each_message(topic:, start_from_beginning: true, max_wait_time: 5, min_bytes: 1, max_bytes: 1048576, &block)
+      default_offset ||= start_from_beginning ? :earliest : :latest
+      offsets = Hash.new { default_offset }
       loop do
         operation = FetchOperation.new(
@@ -341,6 +366,7 @@ module Kafka
     #
     # @return [Array<String>] the list of topic names.
     def topics
+      @cluster.clear_target_topics
       @cluster.topics
     end
@@ -352,6 +378,19 @@ module Kafka
       @cluster.partitions_for(topic).count
     end
+    # Retrieve the offset of the last message in a partition. If there are no
+    # messages in the partition -1 is returned.
+    #
+    # @param topic [String]
+    # @param partition [Integer]
+    # @return [Integer] the offset of the last message in the partition, or -1 if
+    #   there are no messages in the partition.
+    def last_offset_for(topic, partition)
+      # The offset resolution API will return the offset of the "next" message to
+      # be written when resolving the "latest" offset, so we subtract one.
+      @cluster.resolve_offset(topic, partition, :latest) - 1
+    end
     # Closes all connections to the Kafka brokers and frees up used resources.
     #
     # @return [nil]

data/lib/kafka/cluster.rb CHANGED Viewed

@@ -32,6 +32,11 @@ module Kafka
       @target_topics = Set.new
     end
+    # Adds a list of topics to the target list. Only the topics on this list will
+    # be queried for metadata.
+    #
+    # @param topics [Array<String>]
+    # @return [nil]
     def add_target_topics(topics)
       new_topics = Set.new(topics) - @target_topics
@@ -44,6 +49,15 @@ module Kafka
       end
     end
+    # Clears the list of target topics.
+    #
+    # @see #add_target_topics
+    # @return [nil]
+    def clear_target_topics
+      @target_topics.clear
+      refresh_metadata!
+    end
     def mark_as_stale!
       @stale = true
     end
@@ -105,6 +119,32 @@ module Kafka
       raise
     end
+    def resolve_offset(topic, partition, offset)
+      add_target_topics([topic])
+      refresh_metadata_if_necessary!
+      broker = get_leader(topic, partition)
+      if offset == :earliest
+        offset = -2
+      elsif offset == :latest
+        offset = -1
+      end
+      response = broker.list_offsets(
+        topics: {
+          topic => [
+            {
+              partition: partition,
+              time: offset,
+              max_offsets: 1,
+            }
+          ]
+        }
+      )
+      response.offset_for(topic, partition)
+    end
     def topics
       cluster_info.topics.map(&:topic_name)
     end

data/lib/kafka/consumer.rb CHANGED Viewed

@@ -60,8 +60,8 @@ module Kafka
     #
     # Typically you either want to start reading messages from the very
     # beginning of the topic's partitions or you simply want to wait for new
-    # messages to be written. In the former case, set `default_offsets` to
-    # `:earliest` (the default); in the latter, set it to `:latest`.
+    # messages to be written. In the former case, set `start_from_beginnign`
+    # true (the default); in the latter, set it to false.
     #
     # @param topic [String] the name of the topic to subscribe to.
     # @param default_offset [Symbol] whether to start from the beginning or the
@@ -189,8 +189,10 @@ module Kafka
       while @running
         begin
           yield
-        rescue HeartbeatError, OffsetCommitError, FetchError
+        rescue HeartbeatError, OffsetCommitError
           join_group
+        rescue FetchError
+          @cluster.mark_as_stale!
         rescue LeaderNotAvailable => e
           @logger.error "Leader not available; waiting 1s before retrying"
           sleep 1

data/lib/kafka/consumer_group.rb CHANGED Viewed

@@ -71,7 +71,7 @@ module Kafka
           Protocol.handle_error(error_code)
         end
       end
-    rescue ConnectionError, UnknownMemberId, RebalanceInProgress, IllegalGeneration => e
+    rescue Kafka::Error => e
       @logger.error "Error committing offsets: #{e}"
       raise OffsetCommitError, e
     end

data/lib/kafka/datadog.rb CHANGED Viewed

@@ -93,6 +93,7 @@ module Kafka
         tags = {
           client: event.payload.fetch(:client_id),
+          group_id: event.payload.fetch(:group_id),
           topic: event.payload.fetch(:topic),
           partition: event.payload.fetch(:partition),
         }
@@ -112,6 +113,7 @@ module Kafka
         tags = {
           client: event.payload.fetch(:client_id),
+          group_id: event.payload.fetch(:group_id),
           topic: event.payload.fetch(:topic),
           partition: event.payload.fetch(:partition),
         }

data/lib/kafka/fetch_operation.rb CHANGED Viewed

@@ -69,7 +69,14 @@ module Kafka
         response.topics.flat_map {|fetched_topic|
           fetched_topic.partitions.map {|fetched_partition|
-            Protocol.handle_error(fetched_partition.error_code)
+            begin
+              Protocol.handle_error(fetched_partition.error_code)
+            rescue Kafka::Error => e
+              topic = fetched_topic.name
+              partition = fetched_partition.partition
+              @logger.error "Failed to fetch from #{topic}/#{partition}: #{e.message}"
+              raise e
+            end
             messages = fetched_partition.messages.map {|message|
               FetchedMessage.new(

data/lib/kafka/offset_manager.rb CHANGED Viewed

@@ -1,6 +1,7 @@
 module Kafka
   class OffsetManager
-    def initialize(group:, logger:, commit_interval:, commit_threshold:)
+    def initialize(cluster:, group:, logger:, commit_interval:, commit_threshold:)
+      @cluster = cluster
       @group = group
       @logger = logger
       @commit_interval = commit_interval
@@ -28,7 +29,16 @@ module Kafka
         committed_offset_for(topic, partition)
       }
-      offset = @default_offsets.fetch(topic) if offset < 0
+      # A negative offset means that no offset has been committed, so we need to
+      # resolve the default offset for the topic.
+      if offset < 0
+        offset = @default_offsets.fetch(topic)
+        offset = @cluster.resolve_offset(topic, partition, offset)
+        # Make sure we commit this offset so that we don't repeat have to
+        # resolve the default offset every time.
+        mark_as_processed(topic, partition, offset - 1)
+      end
       offset
     end

data/lib/kafka/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Kafka
-  VERSION = "0.3.9"
+  VERSION = "0.3.10"
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: ruby-kafka
 version: !ruby/object:Gem::Version
-  version: 0.3.9
+  version: 0.3.10
 platform: ruby
 authors:
 - Daniel Schierbeck
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2016-06-16 00:00:00.000000000 Z
+date: 2016-07-12 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler
@@ -208,6 +208,7 @@ files:
 - CHANGELOG.md
 - Gemfile
 - Gemfile.lock
+- ISSUE_TEMPLATE.md
 - LICENSE.txt
 - Procfile
 - README.md