logstash_writer 0.0.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/.gitignore +7 -0
- data/.rubocop.yml +104 -0
- data/.travis.yml +11 -0
- data/.yardopts +1 -0
- data/CODE_OF_CONDUCT.md +49 -0
- data/CONTRIBUTING.md +13 -0
- data/LICENCE +674 -0
- data/README.md +197 -0
- data/lib/logstash_writer.rb +498 -0
- data/logstash_writer.gemspec +48 -0
- metadata +252 -0
data/README.md
ADDED
@@ -0,0 +1,197 @@
|
|
1
|
+
The `LogstashWriter` is an opinionated, reliable, and standards-observant
|
2
|
+
implementation of a means of getting events to a logstash cluster.
|
3
|
+
|
4
|
+
|
5
|
+
# Installation
|
6
|
+
|
7
|
+
It's a gem:
|
8
|
+
|
9
|
+
gem install gemplate
|
10
|
+
|
11
|
+
There's also the wonders of [the Gemfile](http://bundler.io):
|
12
|
+
|
13
|
+
gem 'gemplate'
|
14
|
+
|
15
|
+
If you're the sturdy type that likes to run from git:
|
16
|
+
|
17
|
+
rake install
|
18
|
+
|
19
|
+
Or, if you've eschewed the convenience of Rubygems entirely, then you
|
20
|
+
presumably know what to do already.
|
21
|
+
|
22
|
+
|
23
|
+
# Logstash Configuration
|
24
|
+
|
25
|
+
In order for logstash to receive the events being written, it must have a
|
26
|
+
`json_lines` TCP input configured. Something like this will do the trick:
|
27
|
+
|
28
|
+
input {
|
29
|
+
tcp {
|
30
|
+
id => "json_lines"
|
31
|
+
port => 5151
|
32
|
+
codec => "json_lines"
|
33
|
+
}
|
34
|
+
}
|
35
|
+
|
36
|
+
We'd really like to support the more featureful lumberjack (or, these days,
|
37
|
+
"beats") protocol, but [Elastic refuses to document it
|
38
|
+
properly](https://github.com/elastic/libbeat/issues/279), so until such time
|
39
|
+
as that is fixed, we are stuck with the `json_lines` approach.
|
40
|
+
|
41
|
+
|
42
|
+
# Usage
|
43
|
+
|
44
|
+
An instance of `LogstashWriter` needs to be given the location of a server
|
45
|
+
(or servers) to send the events to. This can be any of:
|
46
|
+
|
47
|
+
# An IPv4 address and port
|
48
|
+
lw = LogstashWriter.new(server_name: "192.0.2.42:5151")
|
49
|
+
|
50
|
+
# An IPv6 address and port
|
51
|
+
lw = LogstashWriter.new(server_name: "[2001:db8::42]:5151")
|
52
|
+
# ... or without the brackets, if you like to live dangerously:
|
53
|
+
lw = LogstashWriter.new(server_name: "2001:db8::42:5151")
|
54
|
+
|
55
|
+
# A hostname that resolves to one or more A/AAAA addresses, and port
|
56
|
+
lw = LogstashWriter.new(server_name: "logstash:5151")
|
57
|
+
|
58
|
+
# A DNS name that resolves to one or more SRV records (which
|
59
|
+
# specify the port as part of the record)
|
60
|
+
lw = LogstashWriter.new(server_name: "_logstash._tcp")
|
61
|
+
|
62
|
+
Once you have your `LogstashWriter` instance, you can start firing
|
63
|
+
events:
|
64
|
+
|
65
|
+
lw.send_event(any: "hash", you: "like")
|
66
|
+
|
67
|
+
However they won't actually be sent to the logstash server until you start
|
68
|
+
the background worker thread:
|
69
|
+
|
70
|
+
lw.run
|
71
|
+
|
72
|
+
When it comes time to shutdown, you can do so gracefully, like this:
|
73
|
+
|
74
|
+
lw.stop
|
75
|
+
|
76
|
+
This will wait for all events in the queue to drain to the logstash server
|
77
|
+
before returning.
|
78
|
+
|
79
|
+
In the event that a logstash server is unavailable at the time your events
|
80
|
+
are sent, events will be queued until a server is contactable. However,
|
81
|
+
because memory is a finite resource, the backlog is limited to 1,000 events
|
82
|
+
by default. If you want a larger (or smaller) limit, tell the writer when
|
83
|
+
you create it:
|
84
|
+
|
85
|
+
lw = LogstashWriter.new(server_name: "...", backlog: 1_000_000)
|
86
|
+
|
87
|
+
If you want to know what your writer is doing, give it a logger:
|
88
|
+
|
89
|
+
lw = LogstashWriter.new(server_name: "...", logger: Logger.new("/dev/stderr")
|
90
|
+
|
91
|
+
|
92
|
+
## Prometheus Metrics
|
93
|
+
|
94
|
+
If you're instrumentally inclined, you can get Prometheus metrics
|
95
|
+
out of the writer by passing a client registry (which you'll presumably know
|
96
|
+
what to do with if you're into that sort of thing):
|
97
|
+
|
98
|
+
reg = Prometheus::Client::Registry.new
|
99
|
+
lw = LogstashWriter.new(server_name: "...", metrics_registry: reg)
|
100
|
+
|
101
|
+
The metrics that are exposed are:
|
102
|
+
|
103
|
+
* **`logstash_writer_events_received_total`** -- the number of events that
|
104
|
+
have been submitted for writing by calling `#send_event`.
|
105
|
+
|
106
|
+
* **`logstash_writer_events_written_total`** -- the number of events that
|
107
|
+
have been submitted to the logstash server, labelled by `server` (the
|
108
|
+
`address:port` pair for the server that each event was submitted to).
|
109
|
+
|
110
|
+
* **`logstash_writer_events_dropped_total`** -- the number of events
|
111
|
+
that were dropped due to the backlog buffer filling up. An increase
|
112
|
+
in this value over time indicates that your logstash servers are either
|
113
|
+
unreliable, or unable to cope with peak event ingestion loads.
|
114
|
+
|
115
|
+
* **`logstash_writer_queue_size`** -- the number of events currently in
|
116
|
+
the backlog queue awaiting transmission. In *theory*, this value should
|
117
|
+
always be `received - (sent + dropped)`, but this gauge is maintained
|
118
|
+
separately as a cross-check in case of bugs.
|
119
|
+
|
120
|
+
* **`logstash_writer_last_sent_event_timestamp`** -- the UTC timestamp,
|
121
|
+
represented as the number of (fractional) seconds since the Unix epoch, at
|
122
|
+
which the most recent event sent to a logstash server was originally
|
123
|
+
submitted via `#send_event`. This might require some unpacking.
|
124
|
+
|
125
|
+
If everything is going along swimmingly, there's no queued events, and
|
126
|
+
events submitted are immediately forwarded to logstash, this gauge will
|
127
|
+
be whenever the last event was sent. No big problem. However, in the
|
128
|
+
event of problems, this timestamp can tell you several things.
|
129
|
+
|
130
|
+
Firstly, if there are queued events, you can tell how far behind in real
|
131
|
+
time your logstash event history is, by calculating `NOW() -
|
132
|
+
logstash_writer_last_sent_event_timestamp`. Thus, if you're not finding
|
133
|
+
events in your Kibana dashboard you were expecting to see, you can tell
|
134
|
+
that there's a clog in the pipes by looking at this.
|
135
|
+
|
136
|
+
Alternately, if the queue is empty, but this timestamp is perhaps older
|
137
|
+
than you'd expect, then you know the problem is "upstream" of
|
138
|
+
`LogstashWriter`. If your code isn't calling `#send_event`, then this
|
139
|
+
timestamp won't be progressing, and you can go look for a deadlock or
|
140
|
+
something in your code, and don't need to check if logstash is misbehaving
|
141
|
+
(again).
|
142
|
+
|
143
|
+
* **`logstash_writer_connected_to_server`** -- this flag timeseries (can be
|
144
|
+
either `1` or `0`) is simply a way for you to quickly determine whether
|
145
|
+
the writer has a server to talk to, if it wants one. That is, this time
|
146
|
+
series will only be `0` if there's an event to write but no logstash
|
147
|
+
server can be found to write it to.
|
148
|
+
|
149
|
+
* **`logstash_writer_connect_exceptions_total`** -- a count of exceptions
|
150
|
+
raised whilst attempting to connect to a logstash server, labelled by the
|
151
|
+
exception class and the server to which the connection was attempted.
|
152
|
+
|
153
|
+
* **`logstash_writer_write_exceptions_total`** -- a count of exceptions
|
154
|
+
raised whilst attempting to write data to a connected logstash server,
|
155
|
+
labelled by the exception class and the server to which the write was
|
156
|
+
directed.
|
157
|
+
|
158
|
+
* **`logstash_writer_write_loop_exceptions_total`** -- a count of exceptions
|
159
|
+
raised in the "write loop", which is the main infinite loop executed by
|
160
|
+
the background worker thread. Exceptions which occur here are...
|
161
|
+
concerning, because whilst exceptions are expected while connecting and
|
162
|
+
writing to logstash servers, the write loop *itself* shouldn't normally
|
163
|
+
be flinging exceptions around.
|
164
|
+
|
165
|
+
* **`logstash_writer_write_loop_ok`** -- a flag (can be either `1` or `0`)
|
166
|
+
indicating whether the write loop is dead or not. This is, essentially,
|
167
|
+
the `up` series for the logstash writer; if this is `0`, nothing useful is
|
168
|
+
happening in the logstash writer.
|
169
|
+
|
170
|
+
|
171
|
+
# Contributing
|
172
|
+
|
173
|
+
Patches can be sent as [a Github pull
|
174
|
+
request](https://github.com/discourse/logstash-writer). This project is
|
175
|
+
intended to be a safe, welcoming space for collaboration, and contributors
|
176
|
+
are expected to adhere to the [Contributor Covenant code of
|
177
|
+
conduct](CODE_OF_CONDUCT.md).
|
178
|
+
|
179
|
+
|
180
|
+
# Licence
|
181
|
+
|
182
|
+
Unless otherwise stated, everything in this repo is covered by the following
|
183
|
+
copyright notice:
|
184
|
+
|
185
|
+
Copyright (C) 2015 Civilized Discourse Construction Kit, Inc.
|
186
|
+
|
187
|
+
This program is free software: you can redistribute it and/or modify it
|
188
|
+
under the terms of the GNU General Public License version 3, as
|
189
|
+
published by the Free Software Foundation.
|
190
|
+
|
191
|
+
This program is distributed in the hope that it will be useful,
|
192
|
+
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
193
|
+
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
194
|
+
GNU General Public License for more details.
|
195
|
+
|
196
|
+
You should have received a copy of the GNU General Public License
|
197
|
+
along with this program. If not, see <http://www.gnu.org/licenses/>.
|
@@ -0,0 +1,498 @@
|
|
1
|
+
require 'ipaddr'
|
2
|
+
require 'json'
|
3
|
+
require 'resolv'
|
4
|
+
require 'socket'
|
5
|
+
require 'prometheus/client'
|
6
|
+
|
7
|
+
# Write messages to a logstash server.
|
8
|
+
#
|
9
|
+
# Flings events, represented as JSON objects, to logstash using the
|
10
|
+
# `json_lines` codec (over TCP). Doesn't do any munging or modification of
|
11
|
+
# the event data given to it, other than adding `@timestamp` and `_id`
|
12
|
+
# fields if they do not already exist.
|
13
|
+
#
|
14
|
+
# We support highly-available logstash installations by means of multiple
|
15
|
+
# address records, or via SRV records. See the docs for .new for details
|
16
|
+
# as to the valid formats for the server.
|
17
|
+
#
|
18
|
+
class LogstashWriter
|
19
|
+
# How long, in seconds, to pause the first time an error is encountered.
|
20
|
+
# Each successive error will cause a longer wait, so as to prevent
|
21
|
+
# thundering herds.
|
22
|
+
INITIAL_RETRY_WAIT = 0.5
|
23
|
+
|
24
|
+
# Create a new logstash writer.
|
25
|
+
#
|
26
|
+
# Once the object is created, you're ready to give it messages by
|
27
|
+
# calling #send_event. No messages will actually be *delivered* to
|
28
|
+
# logstash, though, until you call #run.
|
29
|
+
#
|
30
|
+
# If multiple addresses are returned from an A/AAAA resolution, or
|
31
|
+
# multiple SRV records, then the records will all be tried in random
|
32
|
+
# order (for A/AAAA records) or in line with the standard rules for
|
33
|
+
# weight and priority (for SRV records).
|
34
|
+
#
|
35
|
+
# @param server_name [String] details for connecting to the logstash
|
36
|
+
# server(s). This can be:
|
37
|
+
#
|
38
|
+
# * `<IPv4 address>:<port>` -- a literal IPv4 address, and mandatory
|
39
|
+
# port.
|
40
|
+
#
|
41
|
+
# * `[<IPv6 address>]:<port>` -- a literal IPv6 address, and mandatory
|
42
|
+
# port. enclosing the address in square brackets isn't required, but
|
43
|
+
# it's a serving suggestion to make it a little easier to discern
|
44
|
+
# address from port. Forgetting the include the port will end in
|
45
|
+
# confusion.
|
46
|
+
#
|
47
|
+
# * `<hostname>:<port>` -- the given hostname will be resolved for
|
48
|
+
# A/AAAA records, and all returned addresses will be tried in random
|
49
|
+
# order until one is found that accepts a connection.
|
50
|
+
#
|
51
|
+
# * `<dnsname>` -- the given dnsname will be resolved for SRV records,
|
52
|
+
# and the returned target hostnames and ports will be tried in the
|
53
|
+
# RFC2782-approved manner according to priority and weight until one
|
54
|
+
# is found which accepts a connection.
|
55
|
+
#
|
56
|
+
# @param logger [Logger] something to which we can write log entries
|
57
|
+
# for debugging and error-reporting purposes.
|
58
|
+
#
|
59
|
+
# @param backlog [Integer] a non-negative integer specifying the maximum
|
60
|
+
# number of events that should be queued during periods when the
|
61
|
+
# logstash server is unavailable. If the limit is exceeded, the oldest
|
62
|
+
# (= first event to be queued) will be dropped.
|
63
|
+
#
|
64
|
+
# @param metrics_registry [Prometheus::Client::Registry] where to register
|
65
|
+
# the metrics instrumenting the operation of the writer instance.
|
66
|
+
#
|
67
|
+
# @param metrics_prefix [#to_s] what to prefix all of the metrics used to
|
68
|
+
# instrument the operation of the writer instance. If you instantiate
|
69
|
+
# multiple LogstashWriter instances with the same `stats_registry`, this
|
70
|
+
# parameter *must* be different for each of them, or you will get some
|
71
|
+
# inscrutable exception raised from the registry.
|
72
|
+
#
|
73
|
+
def initialize(server_name:, logger: Logger.new("/dev/null"), backlog: 1_000, metrics_registry: Prometheus::Client::Registry.new, metrics_prefix: :logstash_writer)
|
74
|
+
@server_name, @logger, @backlog = server_name, logger, backlog
|
75
|
+
|
76
|
+
@metrics = {
|
77
|
+
received: metrics_registry.counter(:"#{metrics_prefix}_events_received_total", "The number of logstash events which have been submitted for delivery"),
|
78
|
+
sent: metrics_registry.counter(:"#{metrics_prefix}_events_written_total", "The number of logstash events which have been delivered to the logstash server"),
|
79
|
+
queue_size: metrics_registry.gauge(:"#{metrics_prefix}_queue_size", "The number of events currently in the queue to be sent"),
|
80
|
+
dropped: metrics_registry.counter(:"#{metrics_prefix}_events_dropped_total", "The number of events which have been dropped from the queue"),
|
81
|
+
|
82
|
+
lag: metrics_registry.gauge(:"#{metrics_prefix}_last_sent_event_timestamp", "When the last event successfully sent to logstash was originally received"),
|
83
|
+
|
84
|
+
connected: metrics_registry.gauge(:"#{metrics_prefix}_connected_to_server", "Boolean flag indicating whether we are currently connected to a logstash server"),
|
85
|
+
connect_exception: metrics_registry.counter(:"#{metrics_prefix}_connect_exceptions_total", "The number of exceptions that have occurred whilst attempting to connect to a logstash server"),
|
86
|
+
write_exception: metrics_registry.counter(:"#{metrics_prefix}_write_exceptions_total", "The number of exceptions that have occurred whilst attempting to write an event to a logstash server"),
|
87
|
+
|
88
|
+
write_loop_exception: metrics_registry.counter(:"#{metrics_prefix}_write_loop_exceptions_total", "The number of exceptions that have occurred in the writing loop"),
|
89
|
+
write_loop_ok: metrics_registry.gauge(:"#{metrics_prefix}_write_loop_ok", "Boolean flag indicating whether the writing loop is currently operating correctly, or is in a post-apocalyptic hellscape of never-ending exceptions"),
|
90
|
+
}
|
91
|
+
|
92
|
+
@metrics[:lag].set({}, 0)
|
93
|
+
@metrics[:queue_size].set({}, 0)
|
94
|
+
|
95
|
+
@queue = []
|
96
|
+
@queue_mutex = Mutex.new
|
97
|
+
@queue_cv = ConditionVariable.new
|
98
|
+
|
99
|
+
@socket_mutex = Mutex.new
|
100
|
+
@worker_mutex = Mutex.new
|
101
|
+
end
|
102
|
+
|
103
|
+
# Add an event to the queue, to be sent to logstash. Actual event
|
104
|
+
# delivery will happen in a worker thread that is started with
|
105
|
+
# #run. If the event does not have a `@timestamp` or `_id` element, they
|
106
|
+
# will be added with appropriate values.
|
107
|
+
#
|
108
|
+
# @param e [Hash] the event data to be sent.
|
109
|
+
#
|
110
|
+
# @return [NilClass]
|
111
|
+
#
|
112
|
+
def send_event(e)
|
113
|
+
unless e.is_a?(Hash)
|
114
|
+
raise ArgumentError, "Event must be a hash"
|
115
|
+
end
|
116
|
+
|
117
|
+
unless e.has_key?(:@timestamp) || e.has_key?("@timestamp")
|
118
|
+
e[:@timestamp] = Time.now.utc.strftime("%FT%TZ")
|
119
|
+
end
|
120
|
+
|
121
|
+
unless e.has_key?(:_id) || e.has_key?("_id")
|
122
|
+
# This is the quickest way I've found to get a long, random string.
|
123
|
+
# We don't need any sort of cryptographic or unpredictability
|
124
|
+
# guarantees for what we're doing here, so SecureRandom is unnecessary
|
125
|
+
# overhead.
|
126
|
+
e[:_id] = rand(0x1000_0000_0000_0000_0000_0000_0000_0000).to_s(36)
|
127
|
+
end
|
128
|
+
|
129
|
+
@queue_mutex.synchronize do
|
130
|
+
@queue << { content: e, arrival_timestamp: Time.now }
|
131
|
+
while @queue.length > @backlog
|
132
|
+
@queue.shift
|
133
|
+
stat_dropped
|
134
|
+
end
|
135
|
+
@queue_cv.signal
|
136
|
+
|
137
|
+
stat_received
|
138
|
+
end
|
139
|
+
|
140
|
+
nil
|
141
|
+
end
|
142
|
+
|
143
|
+
# Start sending events.
|
144
|
+
#
|
145
|
+
# This method will return almost immediately, and actual event
|
146
|
+
# transmission will commence in a separate thread.
|
147
|
+
#
|
148
|
+
# @return [NilClass]
|
149
|
+
#
|
150
|
+
def run
|
151
|
+
@worker_mutex.synchronize do
|
152
|
+
if @worker_thread.nil?
|
153
|
+
m, cv = Mutex.new, ConditionVariable.new
|
154
|
+
|
155
|
+
@worker_thread = Thread.new { cv.signal; write_loop }
|
156
|
+
|
157
|
+
# Don't return until the thread has *actually* started
|
158
|
+
m.synchronize { cv.wait(m) }
|
159
|
+
end
|
160
|
+
end
|
161
|
+
|
162
|
+
nil
|
163
|
+
end
|
164
|
+
|
165
|
+
# Stop the worker thread.
|
166
|
+
#
|
167
|
+
# Politely ask the worker thread to please finish up once it's
|
168
|
+
# finished sending all messages that have been queued. This will
|
169
|
+
# return once the worker thread has finished.
|
170
|
+
#
|
171
|
+
# @return [NilClass]
|
172
|
+
#
|
173
|
+
def stop
|
174
|
+
@worker_mutex.synchronize do
|
175
|
+
if @worker_thread
|
176
|
+
@terminate = true
|
177
|
+
@queue_cv.signal
|
178
|
+
begin
|
179
|
+
@worker_thread.join
|
180
|
+
rescue Exception => ex
|
181
|
+
@logger.error("LogstashWriter") { (["Worker thread terminated with exception: #{ex.message} (#{ex.class})"] + ex.backtrace).join("\n ") }
|
182
|
+
end
|
183
|
+
@worker_thread = nil
|
184
|
+
@socket_mutex.synchronize { (@current_socket.close; @current_socket = nil) if @current_socket }
|
185
|
+
end
|
186
|
+
end
|
187
|
+
|
188
|
+
nil
|
189
|
+
end
|
190
|
+
|
191
|
+
# Disconnect from the currently-active server.
|
192
|
+
#
|
193
|
+
# In certain circumstances, you may wish to force the writer to stop
|
194
|
+
# sending messages to the currently-connected logstash server, and
|
195
|
+
# re-resolve the `server_name` to get new a new address to talk to.
|
196
|
+
# Calling this method will cause that to happen.
|
197
|
+
#
|
198
|
+
# @return [NilClass]
|
199
|
+
#
|
200
|
+
def force_disconnect!
|
201
|
+
@socket_mutex.synchronize do
|
202
|
+
return if @current_socket.nil?
|
203
|
+
|
204
|
+
@logger.info("LogstashWriter") { "Forced disconnect from #{describe_peer(@current_socket) }" }
|
205
|
+
@current_socket.close if @current_socket
|
206
|
+
@current_socket = nil
|
207
|
+
end
|
208
|
+
|
209
|
+
nil
|
210
|
+
end
|
211
|
+
|
212
|
+
private
|
213
|
+
|
214
|
+
# The main "worker" method for getting events out of the queue and
|
215
|
+
# firing them at logstash.
|
216
|
+
#
|
217
|
+
def write_loop
|
218
|
+
error_wait = INITIAL_RETRY_WAIT
|
219
|
+
|
220
|
+
catch :terminate do
|
221
|
+
loop do
|
222
|
+
event = nil
|
223
|
+
|
224
|
+
begin
|
225
|
+
@queue_mutex.synchronize do
|
226
|
+
while @queue.empty? && !@terminate
|
227
|
+
@queue_cv.wait(@queue_mutex)
|
228
|
+
end
|
229
|
+
|
230
|
+
if @queue.empty? && @terminate
|
231
|
+
@terminate = false
|
232
|
+
throw :terminate
|
233
|
+
end
|
234
|
+
|
235
|
+
event = @queue.shift
|
236
|
+
end
|
237
|
+
|
238
|
+
current_socket do |s|
|
239
|
+
s.puts event[:content].to_json
|
240
|
+
stat_sent(describe_peer(s), event[:arrival_timestamp])
|
241
|
+
@metrics[:write_loop_ok].set({}, 1)
|
242
|
+
error_wait = INITIAL_RETRY_WAIT
|
243
|
+
end
|
244
|
+
rescue StandardError => ex
|
245
|
+
@logger.error("LogstashWriter") { (["Exception in write_loop: #{ex.message} (#{ex.class})"] + ex.backtrace).join("\n ") }
|
246
|
+
@queue_mutex.synchronize { @queue.unshift(event) if event }
|
247
|
+
@metrics[:write_loop_exception].increment(class: ex.class.to_s)
|
248
|
+
@metrics[:write_loop_ok].set({}, 0)
|
249
|
+
sleep error_wait
|
250
|
+
# Increase the error wait timeout for next time, up to a maximum
|
251
|
+
# interval of about 60 seconds
|
252
|
+
error_wait *= 1.1
|
253
|
+
error_wait = 60 if error_wait > 60
|
254
|
+
error_wait += rand / 0.5
|
255
|
+
end
|
256
|
+
end
|
257
|
+
end
|
258
|
+
end
|
259
|
+
|
260
|
+
# Yield a TCPSocket connected to the server we currently believe to be
|
261
|
+
# accepting log entries, so that something can send log entries to it.
|
262
|
+
#
|
263
|
+
# The yielding allows us to centralise all error detection and handling
|
264
|
+
# within this one method, and retry sending just by calling `yield` again
|
265
|
+
# when we've connected to another server.
|
266
|
+
#
|
267
|
+
def current_socket
|
268
|
+
# This could all be handled more cleanly with recursion, but I don't
|
269
|
+
# want to fill the stack if we have to retry a lot of times. Also
|
270
|
+
# can't just use `retry` because not all of the "go around again"
|
271
|
+
# conditions are due to exceptions.
|
272
|
+
done = false
|
273
|
+
|
274
|
+
until done
|
275
|
+
@socket_mutex.synchronize do
|
276
|
+
if @current_socket
|
277
|
+
begin
|
278
|
+
@logger.debug("LogstashWriter") { "Using current server #{describe_peer(@current_socket)}" }
|
279
|
+
yield @current_socket
|
280
|
+
@metrics[:connected].set({}, 1)
|
281
|
+
done = true
|
282
|
+
rescue SystemCallError => ex
|
283
|
+
# Something went wrong during the send; disconnect from this
|
284
|
+
# server and recycle
|
285
|
+
@metrics[:write_exception].increment(server: describe_peer(@current_socket), class: ex.class.to_s)
|
286
|
+
@logger.info("LogstashWriter") { "Error while writing to current server: #{ex.message} (#{ex.class})" }
|
287
|
+
@current_socket.close
|
288
|
+
@current_socket = nil
|
289
|
+
@metrics[:connected].set({}, 0)
|
290
|
+
|
291
|
+
sleep INITIAL_RETRY_WAIT
|
292
|
+
end
|
293
|
+
else
|
294
|
+
retry_delay = INITIAL_RETRY_WAIT * 10
|
295
|
+
candidates = resolve_server_name
|
296
|
+
@logger.debug("LogstashWriter") { "Server candidates: #{candidates.inspect}" }
|
297
|
+
|
298
|
+
if candidates.empty?
|
299
|
+
# A useful error message will (should?) have been logged by something
|
300
|
+
# down in the bowels of resolve_server_name, so all we have to do
|
301
|
+
# is wait a little while, then let the loop retry.
|
302
|
+
sleep INITIAL_RETRY_WAIT * 10
|
303
|
+
else
|
304
|
+
begin
|
305
|
+
next_server = candidates.shift
|
306
|
+
|
307
|
+
if next_server
|
308
|
+
@logger.debug("LogstashWriter") { "Trying to connect to #{next_server.to_s}" }
|
309
|
+
@current_socket = next_server.socket
|
310
|
+
else
|
311
|
+
@logger.debug("LogstashWriter") { "Could not connect to any server; pausing before trying again" }
|
312
|
+
@current_socket = nil
|
313
|
+
sleep retry_delay
|
314
|
+
|
315
|
+
# Calculate a longer retry delay next time we fail to connect
|
316
|
+
# to every server in the list, up to a maximum of (roughly) 60
|
317
|
+
# seconds.
|
318
|
+
retry_delay *= 1.5
|
319
|
+
retry_delay = 60 if retry_delay > 60
|
320
|
+
# A bit of randomness to prevent the thundering herd never goes
|
321
|
+
# amiss
|
322
|
+
retry_delay += rand
|
323
|
+
end
|
324
|
+
rescue SystemCallError => ex
|
325
|
+
# Connection failed for any number of reasons; try the next one in the list
|
326
|
+
@metrics[:connect_exception].increment(server: next_server.to_s, class: ex.class.to_s)
|
327
|
+
@logger.error("LogstashWriter") { "Failed to connect to #{next_server.to_s}: #{ex.message} (#{ex.class})" }
|
328
|
+
sleep INITIAL_RETRY_WAIT
|
329
|
+
retry
|
330
|
+
end
|
331
|
+
end
|
332
|
+
end
|
333
|
+
end
|
334
|
+
end
|
335
|
+
end
|
336
|
+
|
337
|
+
# Generate a human-readable description of the remote end of the given
|
338
|
+
# socket.
|
339
|
+
#
|
340
|
+
def describe_peer(s)
|
341
|
+
pa = s.peeraddr
|
342
|
+
if pa[0] == "AF_INET6"
|
343
|
+
"[#{pa[3]}]:#{pa[1]}"
|
344
|
+
else
|
345
|
+
"#{pa[3]}:#{pa[1]}"
|
346
|
+
end
|
347
|
+
end
|
348
|
+
|
349
|
+
# Turn the server_name given in the constructor into a list of Target
|
350
|
+
# objects, suitable for iterating through to find someone to talk to.
|
351
|
+
#
|
352
|
+
def resolve_server_name
|
353
|
+
return [static_target] if static_target
|
354
|
+
|
355
|
+
# The IPv6 literal case should have been taken care of by
|
356
|
+
# static_target, so the only two cases we have to deal with
|
357
|
+
# here are specified-port (assume A/AAAA) or no port (assume SRV).
|
358
|
+
if @server_name =~ /:/
|
359
|
+
host, port = @server_name.split(":", 2)
|
360
|
+
targets_from_address_record(host, port)
|
361
|
+
else
|
362
|
+
targets_from_srv_record(host)
|
363
|
+
end
|
364
|
+
end
|
365
|
+
|
366
|
+
# Figure out whether the server spec we were given looks like an address:port
|
367
|
+
# combo (in which case return a memoised target), else return `nil` to let
|
368
|
+
# the DNS take over.
|
369
|
+
def static_target
|
370
|
+
# It is valid to memoize this because address literals don't change
|
371
|
+
# their resolution over time.
|
372
|
+
@static_target ||= begin
|
373
|
+
if @server_name =~ /\A(.*):(\d+)\z/
|
374
|
+
begin
|
375
|
+
IPAddr.new($1)
|
376
|
+
rescue ArgumentError
|
377
|
+
# Whatever is on the LHS isn't a recognisable address literal;
|
378
|
+
# assume hostname
|
379
|
+
nil
|
380
|
+
else
|
381
|
+
Target.new($1, $2.to_i)
|
382
|
+
end
|
383
|
+
end
|
384
|
+
end
|
385
|
+
end
|
386
|
+
|
387
|
+
# Resolve hostname as A/AAAA, and generate randomly-sorted list of Target
|
388
|
+
# records from the list of addresses resolved.
|
389
|
+
#
|
390
|
+
def targets_from_address_record(hostname, port)
|
391
|
+
addrs = Resolv::DNS.new.getaddresses(hostname)
|
392
|
+
if addrs.empty?
|
393
|
+
@logger.warn("LogstashWriter") { "No addresses resolved for server_name #{hostname.inspect}" }
|
394
|
+
end
|
395
|
+
addrs.sort_by { rand }.map { |a| Target.new(a.to_s, port.to_i) }
|
396
|
+
end
|
397
|
+
|
398
|
+
# Resolve the given hostname as a SRV record, and generate a list of
|
399
|
+
# Target records from the resources returned. The list will be arranged
|
400
|
+
# in line with the RFC2782-specified algorithm, respecting the weight and
|
401
|
+
# priority of the records.
|
402
|
+
#
|
403
|
+
def targets_from_srv_record(hostname)
|
404
|
+
[].tap do |list|
|
405
|
+
left = Resolv::DNS.new.getresources(@server_name, Resolv::DNS::Resource::IN::SRV)
|
406
|
+
if left.empty?
|
407
|
+
@logger.warn("LogstashWriter") { "No SRV records found for server_name #{@server_name.inspect}" }
|
408
|
+
end
|
409
|
+
|
410
|
+
# Let the soft-SRV shuffle... BEGIN!
|
411
|
+
until left.empty?
|
412
|
+
prio = left.map { |rr| rr.priority }.uniq.min
|
413
|
+
candidates = left.select { |rr| rr.priority == prio }
|
414
|
+
left -= candidates
|
415
|
+
candidates.sort_by! { |rr| [rr.weight, rr.target.to_s] }
|
416
|
+
until candidates.empty?
|
417
|
+
selector = rand(candidates.inject(1) { |n, rr| n + rr.weight })
|
418
|
+
chosen = candidates.inject(0) do |n, rr|
|
419
|
+
break rr if n + rr.weight >= selector
|
420
|
+
n + rr.weight
|
421
|
+
end
|
422
|
+
candidates.delete(chosen)
|
423
|
+
list << Target.new(chosen.target.to_s, chosen.port)
|
424
|
+
end
|
425
|
+
end
|
426
|
+
end
|
427
|
+
end
|
428
|
+
|
429
|
+
def stat_received
|
430
|
+
@metrics[:received].increment({})
|
431
|
+
@metrics[:queue_size].increment({})
|
432
|
+
end
|
433
|
+
|
434
|
+
def stat_sent(peer, arrived_time)
|
435
|
+
@metrics[:sent].increment(server: peer)
|
436
|
+
@metrics[:queue_size].decrement({})
|
437
|
+
@metrics[:lag].set({}, arrived_time.to_f)
|
438
|
+
end
|
439
|
+
|
440
|
+
def stat_dropped
|
441
|
+
@metrics[:queue_size].decrement({})
|
442
|
+
@metrics[:dropped].increment({})
|
443
|
+
end
|
444
|
+
|
445
|
+
# An individual target for logstash messages
|
446
|
+
#
|
447
|
+
# Takes a host and port, gives back a socket to send data down.
|
448
|
+
#
|
449
|
+
class Target
|
450
|
+
# Create a new target.
|
451
|
+
#
|
452
|
+
# @param addr [String] an IP address or hostname to which to connect.
|
453
|
+
#
|
454
|
+
# @param port [Integer] the TCP port number, in the range 1-65535.
|
455
|
+
#
|
456
|
+
# @raise [ArgumentError] if `addr` is not a valid-looking IP address or
|
457
|
+
# hostname, or if the port number is not in the valid range.
|
458
|
+
#
|
459
|
+
def initialize(addr, port)
|
460
|
+
#:nocov:
|
461
|
+
unless addr.is_a? String
|
462
|
+
raise ArgumentError, "addr #{addr.inspect} is not a string"
|
463
|
+
end
|
464
|
+
|
465
|
+
unless port.is_a? Integer
|
466
|
+
raise ArgumentError, "port #{port.inspect} is not an integer"
|
467
|
+
end
|
468
|
+
|
469
|
+
unless (1..65535).include?(port)
|
470
|
+
raise ArgumentError, "invalid port number #{port.inspect} (must be in range 1-65535)"
|
471
|
+
end
|
472
|
+
#:nocov:
|
473
|
+
|
474
|
+
@addr, @port = addr, port
|
475
|
+
end
|
476
|
+
|
477
|
+
# Create a connection.
|
478
|
+
#
|
479
|
+
# @return [IO] a socket to the target.
|
480
|
+
#
|
481
|
+
# @raise [SystemCallError] if connection cannot be established
|
482
|
+
# for any reason.
|
483
|
+
#
|
484
|
+
def socket
|
485
|
+
TCPSocket.new(@addr, @port)
|
486
|
+
end
|
487
|
+
|
488
|
+
# Simple string representation of the target.
|
489
|
+
#
|
490
|
+
# @return [String]
|
491
|
+
#
|
492
|
+
def to_s
|
493
|
+
"#{@addr}:#{@port}"
|
494
|
+
end
|
495
|
+
end
|
496
|
+
|
497
|
+
private_constant :Target
|
498
|
+
end
|