ruby-kafka 0.1.5 → 0.1.6
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +71 -6
- data/examples/simple-producer.rb +1 -1
- data/lib/kafka.rb +1 -1
- data/lib/kafka/async_producer.rb +181 -0
- data/lib/kafka/client.rb +13 -1
- data/lib/kafka/compression.rb +23 -0
- data/lib/kafka/fetch_operation.rb +2 -2
- data/lib/kafka/gzip_codec.rb +28 -0
- data/lib/kafka/produce_operation.rb +15 -4
- data/lib/kafka/producer.rb +9 -6
- data/lib/kafka/protocol/message.rb +47 -13
- data/lib/kafka/protocol/message_set.rb +47 -5
- data/lib/kafka/protocol/produce_request.rb +5 -31
- data/lib/kafka/snappy_codec.rb +20 -0
- data/lib/kafka/version.rb +1 -1
- data/ruby-kafka.gemspec +1 -0
- metadata +20 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 50a1c28cf71285d37c57e3dbe7a5d156f891576c
|
4
|
+
data.tar.gz: f8a139dc4061ec8a771f86c0a5ca280692ddd5cc
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 6d3245db50893aba63b50b600903dc0baf345017c940dcaa36b0e2fbe0f50fe231bc15f9af2dc083d4969fd63ba0493781570811a3ba4ba0e7df21b06e0fd993
|
7
|
+
data.tar.gz: d5cddd687cc85f02b0826e95205114e2f64cffd9dad48ef692e717dbb3f7cac41ce2b0afd71ee271e8590e27b9eac42d8f67f247c71fe760dec2915191361273
|
data/README.md
CHANGED
@@ -24,7 +24,7 @@ Or install it yourself as:
|
|
24
24
|
|
25
25
|
## Usage
|
26
26
|
|
27
|
-
Please see the [documentation site](http://www.rubydoc.info/gems/ruby-kafka) for detailed documentation on the latest release.
|
27
|
+
Please see the [documentation site](http://www.rubydoc.info/gems/ruby-kafka) for detailed documentation on the latest release. Note that the documentation on GitHub may not match the version of the library you're using – there are still being made many changes to the API.
|
28
28
|
|
29
29
|
### Producing Messages to Kafka
|
30
30
|
|
@@ -39,7 +39,7 @@ kafka = Kafka.new(seed_brokers: ["kafka1:9092", "kafka2:9092"])
|
|
39
39
|
A producer buffers messages and sends them to the broker that is the leader of the partition a given message is assigned to.
|
40
40
|
|
41
41
|
```ruby
|
42
|
-
producer = kafka.
|
42
|
+
producer = kafka.producer
|
43
43
|
```
|
44
44
|
|
45
45
|
`produce` will buffer the message in the producer but will _not_ actually send it to the Kafka cluster.
|
@@ -66,7 +66,7 @@ If you don't know exactly how many partitions are in the topic, or you'd rather
|
|
66
66
|
producer.produce("hello4", topic: "test-messages", partition_key: "yo")
|
67
67
|
```
|
68
68
|
|
69
|
-
`deliver_messages` will send the buffered messages to the cluster. Since messages may be destined for different partitions, this could involve writing to more than one Kafka broker. Note that a failure to send all buffered messages after the configured number of retries will result in `Kafka::
|
69
|
+
`deliver_messages` will send the buffered messages to the cluster. Since messages may be destined for different partitions, this could involve writing to more than one Kafka broker. Note that a failure to send all buffered messages after the configured number of retries will result in `Kafka::DeliveryFailed` being raised. This can be rescued and ignored; the messages will be kept in the buffer until the next attempt.
|
70
70
|
|
71
71
|
```ruby
|
72
72
|
producer.deliver_messages
|
@@ -74,6 +74,71 @@ producer.deliver_messages
|
|
74
74
|
|
75
75
|
Read the docs for [Kafka::Producer](http://www.rubydoc.info/gems/ruby-kafka/Kafka/Producer) for more details.
|
76
76
|
|
77
|
+
### Asynchronously Producing Messages
|
78
|
+
|
79
|
+
A normal producer will block while `#deliver_messages` is sending messages to Kafka, possible for tens of seconds or even minutes at a time, depending on your timeout and retry settings. Furthermore, you have to call `#deliver_messages` manually, with a frequency that balances batch size with message delay.
|
80
|
+
|
81
|
+
In order to avoid blocking during message deliveries you can use the _asynchronous producer_ API. It is mostly similar to the synchronous API, with calls to `#produce` and `#deliver_messages`. The main difference is that rather than blocking, these calls will return immediately. The actual work will be done in a background thread, with the messages and operations being sent from the caller over a thread safe queue.
|
82
|
+
|
83
|
+
```ruby
|
84
|
+
# `#async_producer` will create a new asynchronous producer.
|
85
|
+
producer = kafka.async_producer
|
86
|
+
|
87
|
+
# The `#produce` API works as normal.
|
88
|
+
producer.produce("hello", topic: "greetings")
|
89
|
+
|
90
|
+
# `#deliver_messages` will return immediately.
|
91
|
+
producer.deliver_messages
|
92
|
+
|
93
|
+
# Make sure to call `#shutdown` on the producer in order to
|
94
|
+
# avoid leaking resources.
|
95
|
+
producer.shutdown
|
96
|
+
```
|
97
|
+
|
98
|
+
By default, the delivery policy will be the same as for a synchronous producer: only when `#deliver_messages` is called will the messages be delivered. However, the asynchronous producer offers two complementary policies for _automatic delivery_:
|
99
|
+
|
100
|
+
1. Trigger a delivery once the producer's message buffer reaches a specified _threshold_. This can be used to improve efficiency by increasing the batch size when sending messages to the Kafka cluster.
|
101
|
+
2. Trigger a delivery at a _fixed time interval_. This puts an upper bound on message delays.
|
102
|
+
|
103
|
+
These policies can be used alone or in combination.
|
104
|
+
|
105
|
+
```ruby
|
106
|
+
# `async_producer` will create a new asynchronous producer.
|
107
|
+
producer = kafka.async_producer(
|
108
|
+
# Trigger a delivery once 100 messages have been buffered.
|
109
|
+
delivery_threshold: 100,
|
110
|
+
|
111
|
+
# Trigger a delivery every 30 seconds.
|
112
|
+
delivery_interval: 30,
|
113
|
+
)
|
114
|
+
|
115
|
+
producer.produce("hello", topic: "greetings")
|
116
|
+
|
117
|
+
# ...
|
118
|
+
```
|
119
|
+
|
120
|
+
**Note:** if the calling thread produces messages faster than the producer can write them to Kafka, you'll eventually run into problems. The internal queue used for sending messages from the calling thread to the background worker has a size limit; once this limit is reached, a call to `#produce` will raise `Kafka::BufferOverflow`.
|
121
|
+
|
122
|
+
### Serialization
|
123
|
+
|
124
|
+
This library is agnostic to which serialization format you prefer. Both the value and key of a message is treated as a binary string of data. This makes it easier to use whatever serialization format you want, since you don't have to do anything special to make it work with ruby-kafka. Here's an example of encoding data with JSON:
|
125
|
+
|
126
|
+
```ruby
|
127
|
+
require "json"
|
128
|
+
|
129
|
+
# ...
|
130
|
+
|
131
|
+
event = {
|
132
|
+
"name" => "pageview",
|
133
|
+
"url" => "https://example.com/posts/123",
|
134
|
+
# ...
|
135
|
+
}
|
136
|
+
|
137
|
+
data = JSON.dump(event)
|
138
|
+
|
139
|
+
producer.produce(data, topic: "events")
|
140
|
+
```
|
141
|
+
|
77
142
|
### Partitioning
|
78
143
|
|
79
144
|
Kafka topics are partitioned, with messages being assigned to a partition by the client. This allows a great deal of flexibility for the users. This section describes several strategies for partitioning and how they impact performance, data locality, etc.
|
@@ -137,9 +202,9 @@ producer.produce(event, topic: "events", partition: partition)
|
|
137
202
|
|
138
203
|
The producer is designed for resilience in the face of temporary network errors, Kafka broker failovers, and other issues that prevent the client from writing messages to the destination topics. It does this by employing local, in-memory buffers. Only when messages are acknowledged by a Kafka broker will they be removed from the buffer.
|
139
204
|
|
140
|
-
Typically, you'd configure the producer to retry failed attempts at sending messages, but sometimes all retries are exhausted. In that case, `Kafka::
|
205
|
+
Typically, you'd configure the producer to retry failed attempts at sending messages, but sometimes all retries are exhausted. In that case, `Kafka::DeliveryFailed` is raised from `Kafka::Producer#deliver_messages`. If you wish to have your application be resilient to this happening (e.g. if you're logging to Kafka from a web application) you can rescue this exception. The failed messages are still retained in the buffer, so a subsequent call to `#deliver_messages` will still attempt to send them.
|
141
206
|
|
142
|
-
Note that there's a maximum buffer size; pass in a different value for `max_buffer_size` when calling `#
|
207
|
+
Note that there's a maximum buffer size; pass in a different value for `max_buffer_size` when calling `#producer` in order to configure this.
|
143
208
|
|
144
209
|
A final note on buffers: local buffers give resilience against broker and network failures, and allow higher throughput due to message batching, but they also trade off consistency guarantees for higher availibility and resilience. If your local process dies while messages are buffered, those messages will be lost. If you require high levels of consistency, you should call `#deliver_messages` immediately after `#produce`.
|
145
210
|
|
@@ -152,7 +217,7 @@ It's important to understand how timeouts work if you have a latency sensitive a
|
|
152
217
|
* `connect_timeout` sets the number of seconds to wait while connecting to a broker for the first time. When ruby-kafka initializes, it needs to connect to at least one host in `seed_brokers` in order to discover the Kafka cluster. Each host is tried until there's one that works. Usually that means the first one, but if your entire cluster is down, or there's a network partition, you could wait up to `n * connect_timeout` seconds, where `n` is the number of seed brokers.
|
153
218
|
* `socket_timeout` sets the number of seconds to wait when reading from or writing to a socket connection to a broker. After this timeout expires the connection will be killed. Note that some Kafka operations are by definition long-running, such as waiting for new messages to arrive in a partition, so don't set this value too low. When configuring timeouts relating to specific Kafka operations, make sure to make them shorter than this one.
|
154
219
|
|
155
|
-
**Producer timeouts** can be configured when calling `#
|
220
|
+
**Producer timeouts** can be configured when calling `#producer` on a client instance:
|
156
221
|
|
157
222
|
* `ack_timeout` is a timeout executed by a broker when the client is sending messages to it. It defines the number of seconds the broker should wait for replicas to acknowledge the write before responding to the client with an error. As such, it relates to the `required_acks` setting. It should be set lower than `socket_timeout`.
|
158
223
|
* `retry_backoff` configures the number of seconds to wait after a failed attempt to send messages to a Kafka broker before retrying. The `max_retries` setting defines the maximum number of retries to attempt, and so the total duration could be up to `max_retries * retry_backoff` seconds. The timeout can be arbitrarily long, and shouldn't be too short: if a broker goes down its partitions will be handed off to another broker, and that can take tens of seconds.
|
data/examples/simple-producer.rb
CHANGED
data/lib/kafka.rb
CHANGED
@@ -0,0 +1,181 @@
|
|
1
|
+
require "thread"
|
2
|
+
|
3
|
+
module Kafka
|
4
|
+
|
5
|
+
# A Kafka producer that does all its work in the background so as to not block
|
6
|
+
# the calling thread. Calls to {#deliver_messages} are asynchronous and return
|
7
|
+
# immediately.
|
8
|
+
#
|
9
|
+
# In addition to this property it's possible to define automatic delivery
|
10
|
+
# policies. These allow placing an upper bound on the number of buffered
|
11
|
+
# messages and the time between message deliveries.
|
12
|
+
#
|
13
|
+
# * If `delivery_threshold` is set to a value _n_ higher than zero, the producer
|
14
|
+
# will automatically deliver its messages once its buffer size reaches _n_.
|
15
|
+
# * If `delivery_interval` is set to a value _n_ higher than zero, the producer
|
16
|
+
# will automatically deliver its messages every _n_ seconds.
|
17
|
+
#
|
18
|
+
# By default, automatic delivery is disabled and you'll have to call
|
19
|
+
# {#deliver_messages} manually.
|
20
|
+
#
|
21
|
+
# The calling thread communicates with the background thread doing the actual
|
22
|
+
# work using a thread safe queue. While the background thread is busy delivering
|
23
|
+
# messages, new messages will be buffered in the queue. In order to avoid
|
24
|
+
# the queue growing uncontrollably in cases where the background thread gets
|
25
|
+
# stuck or can't follow the pace of the calling thread, there's a maximum
|
26
|
+
# number of messages that is allowed to be buffered. You can configure this
|
27
|
+
# value by setting `max_queue_size`.
|
28
|
+
#
|
29
|
+
# ## Example
|
30
|
+
#
|
31
|
+
# producer = kafka.async_producer(
|
32
|
+
# # Keep at most 1.000 messages in the buffer before delivering:
|
33
|
+
# delivery_threshold: 1000,
|
34
|
+
#
|
35
|
+
# # Deliver messages every 30 seconds:
|
36
|
+
# delivery_interval: 30,
|
37
|
+
# )
|
38
|
+
#
|
39
|
+
# # There's no need to manually call #deliver_messages, it will happen
|
40
|
+
# # automatically in the background.
|
41
|
+
# producer.produce("hello", topic: "greetings")
|
42
|
+
#
|
43
|
+
# # Remember to shut down the producer when you're done with it.
|
44
|
+
# producer.shutdown
|
45
|
+
#
|
46
|
+
class AsyncProducer
|
47
|
+
|
48
|
+
# Initializes a new AsyncProducer.
|
49
|
+
#
|
50
|
+
# @param sync_producer [Kafka::Producer] the synchronous producer that should
|
51
|
+
# be used in the background.
|
52
|
+
# @param max_queue_size [Integer] the maximum number of messages allowed in
|
53
|
+
# the queue.
|
54
|
+
# @param delivery_threshold [Integer] if greater than zero, the number of
|
55
|
+
# buffered messages that will automatically trigger a delivery.
|
56
|
+
# @param delivery_interval [Integer] if greater than zero, the number of
|
57
|
+
# seconds between automatic message deliveries.
|
58
|
+
#
|
59
|
+
def initialize(sync_producer:, max_queue_size: 1000, delivery_threshold: 0, delivery_interval: 0)
|
60
|
+
raise ArgumentError unless max_queue_size > 0
|
61
|
+
raise ArgumentError unless delivery_threshold >= 0
|
62
|
+
raise ArgumentError unless delivery_interval >= 0
|
63
|
+
|
64
|
+
@queue = Queue.new
|
65
|
+
@max_queue_size = max_queue_size
|
66
|
+
|
67
|
+
@worker_thread = Thread.new do
|
68
|
+
worker = Worker.new(
|
69
|
+
queue: @queue,
|
70
|
+
producer: sync_producer,
|
71
|
+
delivery_threshold: delivery_threshold,
|
72
|
+
)
|
73
|
+
|
74
|
+
worker.run
|
75
|
+
end
|
76
|
+
|
77
|
+
@worker_thread.abort_on_exception = true
|
78
|
+
|
79
|
+
if delivery_interval > 0
|
80
|
+
Thread.new do
|
81
|
+
Timer.new(queue: @queue, interval: delivery_interval).run
|
82
|
+
end
|
83
|
+
end
|
84
|
+
end
|
85
|
+
|
86
|
+
# Produces a message to the specified topic.
|
87
|
+
#
|
88
|
+
# @see Kafka::Producer#produce
|
89
|
+
# @param (see Kafka::Producer#produce)
|
90
|
+
# @raise [BufferOverflow] if the message queue is full.
|
91
|
+
# @return [nil]
|
92
|
+
def produce(*args)
|
93
|
+
raise BufferOverflow if @queue.size >= @max_queue_size
|
94
|
+
@queue << [:produce, args]
|
95
|
+
|
96
|
+
nil
|
97
|
+
end
|
98
|
+
|
99
|
+
# Asynchronously delivers the buffered messages. This method will return
|
100
|
+
# immediately and the actual work will be done in the background.
|
101
|
+
#
|
102
|
+
# @see Kafka::Producer#deliver_messages
|
103
|
+
# @return [nil]
|
104
|
+
def deliver_messages
|
105
|
+
@queue << [:deliver_messages, nil]
|
106
|
+
|
107
|
+
nil
|
108
|
+
end
|
109
|
+
|
110
|
+
# Shuts down the producer, releasing the network resources used. This
|
111
|
+
# method will block until the buffered messages have been delivered.
|
112
|
+
#
|
113
|
+
# @see Kafka::Producer#shutdown
|
114
|
+
# @return [nil]
|
115
|
+
def shutdown
|
116
|
+
@queue << [:shutdown, nil]
|
117
|
+
@worker_thread.join
|
118
|
+
|
119
|
+
nil
|
120
|
+
end
|
121
|
+
|
122
|
+
class Timer
|
123
|
+
def initialize(interval:, queue:)
|
124
|
+
@queue = queue
|
125
|
+
@interval = interval
|
126
|
+
end
|
127
|
+
|
128
|
+
def run
|
129
|
+
loop do
|
130
|
+
sleep(@interval)
|
131
|
+
@queue << [:deliver_messages, nil]
|
132
|
+
end
|
133
|
+
end
|
134
|
+
end
|
135
|
+
|
136
|
+
class Worker
|
137
|
+
def initialize(queue:, producer:, delivery_threshold:)
|
138
|
+
@queue = queue
|
139
|
+
@producer = producer
|
140
|
+
@delivery_threshold = delivery_threshold
|
141
|
+
end
|
142
|
+
|
143
|
+
def run
|
144
|
+
loop do
|
145
|
+
operation, payload = @queue.pop
|
146
|
+
|
147
|
+
case operation
|
148
|
+
when :produce
|
149
|
+
@producer.produce(*payload)
|
150
|
+
deliver_messages if threshold_reached?
|
151
|
+
when :deliver_messages
|
152
|
+
deliver_messages
|
153
|
+
when :shutdown
|
154
|
+
# Deliver any pending messages first.
|
155
|
+
deliver_messages
|
156
|
+
|
157
|
+
# Stop the run loop.
|
158
|
+
break
|
159
|
+
else
|
160
|
+
raise "Unknown operation #{operation.inspect}"
|
161
|
+
end
|
162
|
+
end
|
163
|
+
ensure
|
164
|
+
@producer.shutdown
|
165
|
+
end
|
166
|
+
|
167
|
+
private
|
168
|
+
|
169
|
+
def deliver_messages
|
170
|
+
@producer.deliver_messages
|
171
|
+
rescue DeliveryFailed
|
172
|
+
# Delivery failed.
|
173
|
+
end
|
174
|
+
|
175
|
+
def threshold_reached?
|
176
|
+
@delivery_threshold > 0 &&
|
177
|
+
@producer.buffer_size >= @delivery_threshold
|
178
|
+
end
|
179
|
+
end
|
180
|
+
end
|
181
|
+
end
|
data/lib/kafka/client.rb
CHANGED
@@ -1,5 +1,6 @@
|
|
1
1
|
require "kafka/cluster"
|
2
2
|
require "kafka/producer"
|
3
|
+
require "kafka/async_producer"
|
3
4
|
require "kafka/fetched_message"
|
4
5
|
require "kafka/fetch_operation"
|
5
6
|
|
@@ -47,10 +48,21 @@ module Kafka
|
|
47
48
|
#
|
48
49
|
# @see Producer#initialize
|
49
50
|
# @return [Kafka::Producer] the Kafka producer.
|
50
|
-
def
|
51
|
+
def producer(**options)
|
51
52
|
Producer.new(cluster: @cluster, logger: @logger, **options)
|
52
53
|
end
|
53
54
|
|
55
|
+
def async_producer(delivery_interval: 0, delivery_threshold: 0, max_queue_size: 1000, **options)
|
56
|
+
sync_producer = producer(**options)
|
57
|
+
|
58
|
+
AsyncProducer.new(
|
59
|
+
sync_producer: sync_producer,
|
60
|
+
delivery_interval: delivery_interval,
|
61
|
+
delivery_threshold: delivery_threshold,
|
62
|
+
max_queue_size: max_queue_size,
|
63
|
+
)
|
64
|
+
end
|
65
|
+
|
54
66
|
# Fetches a batch of messages from a single partition. Note that it's possible
|
55
67
|
# to get back empty batches.
|
56
68
|
#
|
@@ -0,0 +1,23 @@
|
|
1
|
+
require "kafka/snappy_codec"
|
2
|
+
require "kafka/gzip_codec"
|
3
|
+
|
4
|
+
module Kafka
|
5
|
+
module Compression
|
6
|
+
def self.find_codec(name)
|
7
|
+
case name
|
8
|
+
when nil then nil
|
9
|
+
when :snappy then SnappyCodec.new
|
10
|
+
when :gzip then GzipCodec.new
|
11
|
+
else raise "Unknown compression codec #{name}"
|
12
|
+
end
|
13
|
+
end
|
14
|
+
|
15
|
+
def self.find_codec_by_id(codec_id)
|
16
|
+
case codec_id
|
17
|
+
when 1 then GzipCodec.new
|
18
|
+
when 2 then SnappyCodec.new
|
19
|
+
else raise "Unknown codec id #{codec_id}"
|
20
|
+
end
|
21
|
+
end
|
22
|
+
end
|
23
|
+
end
|
@@ -69,13 +69,13 @@ module Kafka
|
|
69
69
|
fetched_topic.partitions.flat_map {|fetched_partition|
|
70
70
|
Protocol.handle_error(fetched_partition.error_code)
|
71
71
|
|
72
|
-
fetched_partition.messages.map {|
|
72
|
+
fetched_partition.messages.map {|message|
|
73
73
|
FetchedMessage.new(
|
74
74
|
value: message.value,
|
75
75
|
key: message.key,
|
76
76
|
topic: fetched_topic.name,
|
77
77
|
partition: fetched_partition.partition,
|
78
|
-
offset: offset,
|
78
|
+
offset: message.offset,
|
79
79
|
)
|
80
80
|
}
|
81
81
|
}
|
@@ -0,0 +1,28 @@
|
|
1
|
+
module Kafka
|
2
|
+
class GzipCodec
|
3
|
+
def initialize
|
4
|
+
require "zlib"
|
5
|
+
end
|
6
|
+
|
7
|
+
def codec_id
|
8
|
+
1
|
9
|
+
end
|
10
|
+
|
11
|
+
def compress(data)
|
12
|
+
buffer = StringIO.new
|
13
|
+
buffer.set_encoding(Encoding::BINARY)
|
14
|
+
|
15
|
+
writer = Zlib::GzipWriter.new(buffer, Zlib::DEFAULT_COMPRESSION, Zlib::DEFAULT_STRATEGY)
|
16
|
+
writer.write(data)
|
17
|
+
writer.close
|
18
|
+
|
19
|
+
buffer.string
|
20
|
+
end
|
21
|
+
|
22
|
+
def decompress(data)
|
23
|
+
buffer = StringIO.new(data)
|
24
|
+
reader = Zlib::GzipReader.new(buffer)
|
25
|
+
reader.read
|
26
|
+
end
|
27
|
+
end
|
28
|
+
end
|
@@ -1,3 +1,5 @@
|
|
1
|
+
require "kafka/protocol/message_set"
|
2
|
+
|
1
3
|
module Kafka
|
2
4
|
# A produce operation attempts to send all messages in a buffer to the Kafka cluster.
|
3
5
|
# Since topics and partitions are spread among all brokers in a cluster, this usually
|
@@ -23,11 +25,12 @@ module Kafka
|
|
23
25
|
# * `sent_message_count` – the number of messages that were successfully sent.
|
24
26
|
#
|
25
27
|
class ProduceOperation
|
26
|
-
def initialize(cluster:, buffer:, required_acks:, ack_timeout:, logger:)
|
28
|
+
def initialize(cluster:, buffer:, compression_codec:, required_acks:, ack_timeout:, logger:)
|
27
29
|
@cluster = cluster
|
28
30
|
@buffer = buffer
|
29
31
|
@required_acks = required_acks
|
30
32
|
@ack_timeout = ack_timeout
|
33
|
+
@compression_codec = compression_codec
|
31
34
|
@logger = logger
|
32
35
|
end
|
33
36
|
|
@@ -67,12 +70,20 @@ module Kafka
|
|
67
70
|
end
|
68
71
|
end
|
69
72
|
|
70
|
-
messages_for_broker.each do |broker,
|
73
|
+
messages_for_broker.each do |broker, message_buffer|
|
71
74
|
begin
|
72
|
-
@logger.info "Sending #{
|
75
|
+
@logger.info "Sending #{message_buffer.size} messages to #{broker}"
|
76
|
+
|
77
|
+
messages_for_topics = {}
|
78
|
+
|
79
|
+
message_buffer.each do |topic, partition, messages|
|
80
|
+
message_set = Protocol::MessageSet.new(messages: messages, compression_codec: @compression_codec)
|
81
|
+
messages_for_topics[topic] ||= {}
|
82
|
+
messages_for_topics[topic][partition] = message_set
|
83
|
+
end
|
73
84
|
|
74
85
|
response = broker.produce(
|
75
|
-
messages_for_topics:
|
86
|
+
messages_for_topics: messages_for_topics,
|
76
87
|
required_acks: @required_acks,
|
77
88
|
timeout: @ack_timeout * 1000, # Kafka expects the timeout in milliseconds.
|
78
89
|
)
|
data/lib/kafka/producer.rb
CHANGED
@@ -2,6 +2,7 @@ require "kafka/partitioner"
|
|
2
2
|
require "kafka/message_buffer"
|
3
3
|
require "kafka/produce_operation"
|
4
4
|
require "kafka/pending_message"
|
5
|
+
require "kafka/compression"
|
5
6
|
|
6
7
|
module Kafka
|
7
8
|
|
@@ -14,11 +15,11 @@ module Kafka
|
|
14
15
|
# kafka = Kafka.new(...)
|
15
16
|
#
|
16
17
|
# # Will instantiate Kafka::Producer
|
17
|
-
# producer = kafka.
|
18
|
+
# producer = kafka.producer
|
18
19
|
#
|
19
20
|
# This is done in order to share a logger as well as a pool of broker connections across
|
20
21
|
# different producers. This also means that you don't need to pass the `cluster` and
|
21
|
-
# `logger` options to `#
|
22
|
+
# `logger` options to `#producer`. See {#initialize} for the list of other options
|
22
23
|
# you can pass in.
|
23
24
|
#
|
24
25
|
# ## Buffering
|
@@ -77,7 +78,7 @@ module Kafka
|
|
77
78
|
# logger: logger,
|
78
79
|
# )
|
79
80
|
#
|
80
|
-
# producer = kafka.
|
81
|
+
# producer = kafka.producer
|
81
82
|
#
|
82
83
|
# begin
|
83
84
|
# $stdin.each_with_index do |line, index|
|
@@ -117,7 +118,7 @@ module Kafka
|
|
117
118
|
# @param max_buffer_size [Integer] the number of messages allowed in the buffer
|
118
119
|
# before new writes will raise BufferOverflow exceptions.
|
119
120
|
#
|
120
|
-
def initialize(cluster:, logger:, ack_timeout: 5, required_acks: 1, max_retries: 2, retry_backoff: 1, max_buffer_size: 1000)
|
121
|
+
def initialize(cluster:, logger:, compression_codec: nil, ack_timeout: 5, required_acks: 1, max_retries: 2, retry_backoff: 1, max_buffer_size: 1000)
|
121
122
|
@cluster = cluster
|
122
123
|
@logger = logger
|
123
124
|
@required_acks = required_acks
|
@@ -125,6 +126,7 @@ module Kafka
|
|
125
126
|
@max_retries = max_retries
|
126
127
|
@retry_backoff = retry_backoff
|
127
128
|
@max_buffer_size = max_buffer_size
|
129
|
+
@compression_codec = Compression.find_codec(compression_codec)
|
128
130
|
|
129
131
|
# A buffer organized by topic/partition.
|
130
132
|
@buffer = MessageBuffer.new
|
@@ -185,7 +187,7 @@ module Kafka
|
|
185
187
|
# the writes. The `ack_timeout` setting places an upper bound on the amount of
|
186
188
|
# time the call will block before failing.
|
187
189
|
#
|
188
|
-
# @raise [
|
190
|
+
# @raise [DeliveryFailed] if not all messages could be successfully sent.
|
189
191
|
# @return [nil]
|
190
192
|
def deliver_messages
|
191
193
|
# There's no need to do anything if the buffer is empty.
|
@@ -233,6 +235,7 @@ module Kafka
|
|
233
235
|
buffer: @buffer,
|
234
236
|
required_acks: @required_acks,
|
235
237
|
ack_timeout: @ack_timeout,
|
238
|
+
compression_codec: @compression_codec,
|
236
239
|
logger: @logger,
|
237
240
|
)
|
238
241
|
|
@@ -268,7 +271,7 @@ module Kafka
|
|
268
271
|
unless @buffer.empty?
|
269
272
|
partitions = @buffer.map {|topic, partition, _| "#{topic}/#{partition}" }.join(", ")
|
270
273
|
|
271
|
-
raise
|
274
|
+
raise DeliveryFailed, "Failed to send messages to #{partitions}"
|
272
275
|
end
|
273
276
|
end
|
274
277
|
|
@@ -16,43 +16,77 @@ module Kafka
|
|
16
16
|
class Message
|
17
17
|
MAGIC_BYTE = 0
|
18
18
|
|
19
|
-
attr_reader :key, :value, :attributes
|
19
|
+
attr_reader :key, :value, :attributes, :offset
|
20
20
|
|
21
|
-
def initialize(
|
21
|
+
def initialize(value:, key: nil, attributes: 0, offset: -1)
|
22
22
|
@key = key
|
23
23
|
@value = value
|
24
24
|
@attributes = attributes
|
25
|
+
@offset = offset
|
25
26
|
end
|
26
27
|
|
27
28
|
def encode(encoder)
|
28
|
-
data =
|
29
|
-
crc = Zlib.crc32(data)
|
29
|
+
data = encode_with_crc
|
30
30
|
|
31
|
-
encoder.
|
32
|
-
encoder.
|
31
|
+
encoder.write_int64(offset)
|
32
|
+
encoder.write_bytes(data)
|
33
33
|
end
|
34
34
|
|
35
35
|
def ==(other)
|
36
|
-
@key == other.key &&
|
36
|
+
@key == other.key &&
|
37
|
+
@value == other.value &&
|
38
|
+
@attributes == other.attributes &&
|
39
|
+
@offset == other.offset
|
40
|
+
end
|
41
|
+
|
42
|
+
def compressed?
|
43
|
+
@attributes != 0
|
44
|
+
end
|
45
|
+
|
46
|
+
# @return [Kafka::Protocol::MessageSet]
|
47
|
+
def decompress
|
48
|
+
codec = Compression.find_codec_by_id(@attributes)
|
49
|
+
|
50
|
+
# For some weird reason we need to cut out the first 20 bytes.
|
51
|
+
data = codec.decompress(value)
|
52
|
+
message_set_decoder = Decoder.from_string(data)
|
53
|
+
|
54
|
+
MessageSet.decode(message_set_decoder)
|
37
55
|
end
|
38
56
|
|
39
57
|
def self.decode(decoder)
|
40
|
-
|
41
|
-
|
58
|
+
offset = decoder.int64
|
59
|
+
message_decoder = Decoder.from_string(decoder.bytes)
|
60
|
+
|
61
|
+
crc = message_decoder.int32
|
62
|
+
magic_byte = message_decoder.int8
|
42
63
|
|
43
64
|
unless magic_byte == MAGIC_BYTE
|
44
65
|
raise Kafka::Error, "Invalid magic byte: #{magic_byte}"
|
45
66
|
end
|
46
67
|
|
47
|
-
attributes =
|
48
|
-
key =
|
49
|
-
value =
|
68
|
+
attributes = message_decoder.int8
|
69
|
+
key = message_decoder.bytes
|
70
|
+
value = message_decoder.bytes
|
50
71
|
|
51
|
-
new(key: key, value: value, attributes: attributes)
|
72
|
+
new(key: key, value: value, attributes: attributes, offset: offset)
|
52
73
|
end
|
53
74
|
|
54
75
|
private
|
55
76
|
|
77
|
+
def encode_with_crc
|
78
|
+
buffer = StringIO.new
|
79
|
+
encoder = Encoder.new(buffer)
|
80
|
+
|
81
|
+
data = encode_without_crc
|
82
|
+
crc = Zlib.crc32(data)
|
83
|
+
|
84
|
+
encoder.write_int32(crc)
|
85
|
+
encoder.write(data)
|
86
|
+
|
87
|
+
buffer.string
|
88
|
+
end
|
89
|
+
|
56
90
|
def encode_without_crc
|
57
91
|
buffer = StringIO.new
|
58
92
|
encoder = Encoder.new(buffer)
|
@@ -3,23 +3,65 @@ module Kafka
|
|
3
3
|
class MessageSet
|
4
4
|
attr_reader :messages
|
5
5
|
|
6
|
-
def initialize(messages:)
|
6
|
+
def initialize(messages: [], compression_codec: nil)
|
7
7
|
@messages = messages
|
8
|
+
@compression_codec = compression_codec
|
9
|
+
end
|
10
|
+
|
11
|
+
def ==(other)
|
12
|
+
messages == other.messages
|
13
|
+
end
|
14
|
+
|
15
|
+
def encode(encoder)
|
16
|
+
if @compression_codec.nil?
|
17
|
+
encode_without_compression(encoder)
|
18
|
+
else
|
19
|
+
encode_with_compression(encoder)
|
20
|
+
end
|
8
21
|
end
|
9
22
|
|
10
23
|
def self.decode(decoder)
|
11
24
|
fetched_messages = []
|
12
25
|
|
13
26
|
until decoder.eof?
|
14
|
-
|
15
|
-
message_decoder = Decoder.from_string(decoder.bytes)
|
16
|
-
message = Message.decode(message_decoder)
|
27
|
+
message = Message.decode(decoder)
|
17
28
|
|
18
|
-
|
29
|
+
if message.compressed?
|
30
|
+
wrapped_message_set = message.decompress
|
31
|
+
fetched_messages.concat(wrapped_message_set.messages)
|
32
|
+
else
|
33
|
+
fetched_messages << message
|
34
|
+
end
|
19
35
|
end
|
20
36
|
|
21
37
|
new(messages: fetched_messages)
|
22
38
|
end
|
39
|
+
|
40
|
+
private
|
41
|
+
|
42
|
+
def encode_with_compression(encoder)
|
43
|
+
codec = @compression_codec
|
44
|
+
|
45
|
+
buffer = StringIO.new
|
46
|
+
encode_without_compression(Encoder.new(buffer))
|
47
|
+
data = codec.compress(buffer.string)
|
48
|
+
|
49
|
+
wrapper_message = Protocol::Message.new(
|
50
|
+
value: data,
|
51
|
+
attributes: codec.codec_id,
|
52
|
+
)
|
53
|
+
|
54
|
+
message_set = MessageSet.new(messages: [wrapper_message])
|
55
|
+
message_set.encode(encoder)
|
56
|
+
end
|
57
|
+
|
58
|
+
def encode_without_compression(encoder)
|
59
|
+
# Messages in a message set are *not* encoded as an array. Rather,
|
60
|
+
# they are written in sequence.
|
61
|
+
@messages.each do |message|
|
62
|
+
message.encode(encoder)
|
63
|
+
end
|
64
|
+
end
|
23
65
|
end
|
24
66
|
end
|
25
67
|
end
|
@@ -59,44 +59,18 @@ module Kafka
|
|
59
59
|
encoder.write_array(@messages_for_topics) do |topic, messages_for_partition|
|
60
60
|
encoder.write_string(topic)
|
61
61
|
|
62
|
-
encoder.write_array(messages_for_partition) do |partition,
|
62
|
+
encoder.write_array(messages_for_partition) do |partition, message_set|
|
63
|
+
encoder.write_int32(partition)
|
64
|
+
|
63
65
|
# When encoding the message set into the request, the bytesize of the message
|
64
|
-
# set must precede the actual
|
66
|
+
# set must precede the actual data. Therefore we need to encode the entire
|
65
67
|
# message set into a separate buffer first.
|
66
|
-
encoded_message_set =
|
67
|
-
|
68
|
-
encoder.write_int32(partition)
|
68
|
+
encoded_message_set = Encoder.encode_with(message_set)
|
69
69
|
|
70
|
-
# When encoding bytes, the 32 bit size of the byte buffer is encoded first.
|
71
70
|
encoder.write_bytes(encoded_message_set)
|
72
71
|
end
|
73
72
|
end
|
74
73
|
end
|
75
|
-
|
76
|
-
private
|
77
|
-
|
78
|
-
def encode_message_set(messages)
|
79
|
-
buffer = StringIO.new
|
80
|
-
encoder = Encoder.new(buffer)
|
81
|
-
|
82
|
-
# Messages in a message set are *not* encoded as an array. Rather,
|
83
|
-
# they are written in sequence with only the byte size prepended.
|
84
|
-
messages.each do |message|
|
85
|
-
offset = -1 # offsets don't matter here.
|
86
|
-
|
87
|
-
# When encoding a message into a message set, the bytesize of the message must
|
88
|
-
# precede the actual bytes. Therefore we need to encode the message into a
|
89
|
-
# separate buffer first.
|
90
|
-
encoded_message = Encoder.encode_with(message)
|
91
|
-
|
92
|
-
encoder.write_int64(offset)
|
93
|
-
|
94
|
-
# When encoding bytes, the 32 bit size of the byte buffer is encoded first.
|
95
|
-
encoder.write_bytes(encoded_message)
|
96
|
-
end
|
97
|
-
|
98
|
-
buffer.string
|
99
|
-
end
|
100
74
|
end
|
101
75
|
end
|
102
76
|
end
|
@@ -0,0 +1,20 @@
|
|
1
|
+
module Kafka
|
2
|
+
class SnappyCodec
|
3
|
+
def initialize
|
4
|
+
require "snappy"
|
5
|
+
end
|
6
|
+
|
7
|
+
def codec_id
|
8
|
+
2
|
9
|
+
end
|
10
|
+
|
11
|
+
def compress(data)
|
12
|
+
Snappy.deflate(data)
|
13
|
+
end
|
14
|
+
|
15
|
+
def decompress(data)
|
16
|
+
buffer = StringIO.new(data)
|
17
|
+
Snappy::Reader.new(buffer).read
|
18
|
+
end
|
19
|
+
end
|
20
|
+
end
|
data/lib/kafka/version.rb
CHANGED
data/ruby-kafka.gemspec
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: ruby-kafka
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.6
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Daniel Schierbeck
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2016-02-
|
11
|
+
date: 2016-02-22 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|
@@ -128,6 +128,20 @@ dependencies:
|
|
128
128
|
- - "<"
|
129
129
|
- !ruby/object:Gem::Version
|
130
130
|
version: '5.1'
|
131
|
+
- !ruby/object:Gem::Dependency
|
132
|
+
name: snappy
|
133
|
+
requirement: !ruby/object:Gem::Requirement
|
134
|
+
requirements:
|
135
|
+
- - ">="
|
136
|
+
- !ruby/object:Gem::Version
|
137
|
+
version: '0'
|
138
|
+
type: :development
|
139
|
+
prerelease: false
|
140
|
+
version_requirements: !ruby/object:Gem::Requirement
|
141
|
+
requirements:
|
142
|
+
- - ">="
|
143
|
+
- !ruby/object:Gem::Version
|
144
|
+
version: '0'
|
131
145
|
description: |-
|
132
146
|
A client library for the Kafka distributed commit log.
|
133
147
|
|
@@ -151,13 +165,16 @@ files:
|
|
151
165
|
- examples/simple-consumer.rb
|
152
166
|
- examples/simple-producer.rb
|
153
167
|
- lib/kafka.rb
|
168
|
+
- lib/kafka/async_producer.rb
|
154
169
|
- lib/kafka/broker.rb
|
155
170
|
- lib/kafka/broker_pool.rb
|
156
171
|
- lib/kafka/client.rb
|
157
172
|
- lib/kafka/cluster.rb
|
173
|
+
- lib/kafka/compression.rb
|
158
174
|
- lib/kafka/connection.rb
|
159
175
|
- lib/kafka/fetch_operation.rb
|
160
176
|
- lib/kafka/fetched_message.rb
|
177
|
+
- lib/kafka/gzip_codec.rb
|
161
178
|
- lib/kafka/instrumentation.rb
|
162
179
|
- lib/kafka/message_buffer.rb
|
163
180
|
- lib/kafka/partitioner.rb
|
@@ -178,6 +195,7 @@ files:
|
|
178
195
|
- lib/kafka/protocol/produce_response.rb
|
179
196
|
- lib/kafka/protocol/request_message.rb
|
180
197
|
- lib/kafka/protocol/topic_metadata_request.rb
|
198
|
+
- lib/kafka/snappy_codec.rb
|
181
199
|
- lib/kafka/socket_with_timeout.rb
|
182
200
|
- lib/kafka/version.rb
|
183
201
|
- lib/ruby-kafka.rb
|