logstash_writer 0.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.gitignore +7 -0
- data/.rubocop.yml +104 -0
- data/.travis.yml +11 -0
- data/.yardopts +1 -0
- data/CODE_OF_CONDUCT.md +49 -0
- data/CONTRIBUTING.md +13 -0
- data/LICENCE +674 -0
- data/README.md +197 -0
- data/lib/logstash_writer.rb +498 -0
- data/logstash_writer.gemspec +48 -0
- metadata +252 -0
data/README.md
ADDED
@@ -0,0 +1,197 @@
|
|
1
|
+
The `LogstashWriter` is an opinionated, reliable, and standards-observant
|
2
|
+
implementation of a means of getting events to a logstash cluster.
|
3
|
+
|
4
|
+
|
5
|
+
# Installation
|
6
|
+
|
7
|
+
It's a gem:
|
8
|
+
|
9
|
+
gem install gemplate
|
10
|
+
|
11
|
+
There's also the wonders of [the Gemfile](http://bundler.io):
|
12
|
+
|
13
|
+
gem 'gemplate'
|
14
|
+
|
15
|
+
If you're the sturdy type that likes to run from git:
|
16
|
+
|
17
|
+
rake install
|
18
|
+
|
19
|
+
Or, if you've eschewed the convenience of Rubygems entirely, then you
|
20
|
+
presumably know what to do already.
|
21
|
+
|
22
|
+
|
23
|
+
# Logstash Configuration
|
24
|
+
|
25
|
+
In order for logstash to receive the events being written, it must have a
|
26
|
+
`json_lines` TCP input configured. Something like this will do the trick:
|
27
|
+
|
28
|
+
input {
|
29
|
+
tcp {
|
30
|
+
id => "json_lines"
|
31
|
+
port => 5151
|
32
|
+
codec => "json_lines"
|
33
|
+
}
|
34
|
+
}
|
35
|
+
|
36
|
+
We'd really like to support the more featureful lumberjack (or, these days,
|
37
|
+
"beats") protocol, but [Elastic refuses to document it
|
38
|
+
properly](https://github.com/elastic/libbeat/issues/279), so until such time
|
39
|
+
as that is fixed, we are stuck with the `json_lines` approach.
|
40
|
+
|
41
|
+
|
42
|
+
# Usage
|
43
|
+
|
44
|
+
An instance of `LogstashWriter` needs to be given the location of a server
|
45
|
+
(or servers) to send the events to. This can be any of:
|
46
|
+
|
47
|
+
# An IPv4 address and port
|
48
|
+
lw = LogstashWriter.new(server_name: "192.0.2.42:5151")
|
49
|
+
|
50
|
+
# An IPv6 address and port
|
51
|
+
lw = LogstashWriter.new(server_name: "[2001:db8::42]:5151")
|
52
|
+
# ... or without the brackets, if you like to live dangerously:
|
53
|
+
lw = LogstashWriter.new(server_name: "2001:db8::42:5151")
|
54
|
+
|
55
|
+
# A hostname that resolves to one or more A/AAAA addresses, and port
|
56
|
+
lw = LogstashWriter.new(server_name: "logstash:5151")
|
57
|
+
|
58
|
+
# A DNS name that resolves to one or more SRV records (which
|
59
|
+
# specify the port as part of the record)
|
60
|
+
lw = LogstashWriter.new(server_name: "_logstash._tcp")
|
61
|
+
|
62
|
+
Once you have your `LogstashWriter` instance, you can start firing
|
63
|
+
events:
|
64
|
+
|
65
|
+
lw.send_event(any: "hash", you: "like")
|
66
|
+
|
67
|
+
However they won't actually be sent to the logstash server until you start
|
68
|
+
the background worker thread:
|
69
|
+
|
70
|
+
lw.run
|
71
|
+
|
72
|
+
When it comes time to shutdown, you can do so gracefully, like this:
|
73
|
+
|
74
|
+
lw.stop
|
75
|
+
|
76
|
+
This will wait for all events in the queue to drain to the logstash server
|
77
|
+
before returning.
|
78
|
+
|
79
|
+
In the event that a logstash server is unavailable at the time your events
|
80
|
+
are sent, events will be queued until a server is contactable. However,
|
81
|
+
because memory is a finite resource, the backlog is limited to 1,000 events
|
82
|
+
by default. If you want a larger (or smaller) limit, tell the writer when
|
83
|
+
you create it:
|
84
|
+
|
85
|
+
lw = LogstashWriter.new(server_name: "...", backlog: 1_000_000)
|
86
|
+
|
87
|
+
If you want to know what your writer is doing, give it a logger:
|
88
|
+
|
89
|
+
lw = LogstashWriter.new(server_name: "...", logger: Logger.new("/dev/stderr")
|
90
|
+
|
91
|
+
|
92
|
+
## Prometheus Metrics
|
93
|
+
|
94
|
+
If you're instrumentally inclined, you can get Prometheus metrics
|
95
|
+
out of the writer by passing a client registry (which you'll presumably know
|
96
|
+
what to do with if you're into that sort of thing):
|
97
|
+
|
98
|
+
reg = Prometheus::Client::Registry.new
|
99
|
+
lw = LogstashWriter.new(server_name: "...", metrics_registry: reg)
|
100
|
+
|
101
|
+
The metrics that are exposed are:
|
102
|
+
|
103
|
+
* **`logstash_writer_events_received_total`** -- the number of events that
|
104
|
+
have been submitted for writing by calling `#send_event`.
|
105
|
+
|
106
|
+
* **`logstash_writer_events_written_total`** -- the number of events that
|
107
|
+
have been submitted to the logstash server, labelled by `server` (the
|
108
|
+
`address:port` pair for the server that each event was submitted to).
|
109
|
+
|
110
|
+
* **`logstash_writer_events_dropped_total`** -- the number of events
|
111
|
+
that were dropped due to the backlog buffer filling up. An increase
|
112
|
+
in this value over time indicates that your logstash servers are either
|
113
|
+
unreliable, or unable to cope with peak event ingestion loads.
|
114
|
+
|
115
|
+
* **`logstash_writer_queue_size`** -- the number of events currently in
|
116
|
+
the backlog queue awaiting transmission. In *theory*, this value should
|
117
|
+
always be `received - (sent + dropped)`, but this gauge is maintained
|
118
|
+
separately as a cross-check in case of bugs.
|
119
|
+
|
120
|
+
* **`logstash_writer_last_sent_event_timestamp`** -- the UTC timestamp,
|
121
|
+
represented as the number of (fractional) seconds since the Unix epoch, at
|
122
|
+
which the most recent event sent to a logstash server was originally
|
123
|
+
submitted via `#send_event`. This might require some unpacking.
|
124
|
+
|
125
|
+
If everything is going along swimmingly, there's no queued events, and
|
126
|
+
events submitted are immediately forwarded to logstash, this gauge will
|
127
|
+
be whenever the last event was sent. No big problem. However, in the
|
128
|
+
event of problems, this timestamp can tell you several things.
|
129
|
+
|
130
|
+
Firstly, if there are queued events, you can tell how far behind in real
|
131
|
+
time your logstash event history is, by calculating `NOW() -
|
132
|
+
logstash_writer_last_sent_event_timestamp`. Thus, if you're not finding
|
133
|
+
events in your Kibana dashboard you were expecting to see, you can tell
|
134
|
+
that there's a clog in the pipes by looking at this.
|
135
|
+
|
136
|
+
Alternately, if the queue is empty, but this timestamp is perhaps older
|
137
|
+
than you'd expect, then you know the problem is "upstream" of
|
138
|
+
`LogstashWriter`. If your code isn't calling `#send_event`, then this
|
139
|
+
timestamp won't be progressing, and you can go look for a deadlock or
|
140
|
+
something in your code, and don't need to check if logstash is misbehaving
|
141
|
+
(again).
|
142
|
+
|
143
|
+
* **`logstash_writer_connected_to_server`** -- this flag timeseries (can be
|
144
|
+
either `1` or `0`) is simply a way for you to quickly determine whether
|
145
|
+
the writer has a server to talk to, if it wants one. That is, this time
|
146
|
+
series will only be `0` if there's an event to write but no logstash
|
147
|
+
server can be found to write it to.
|
148
|
+
|
149
|
+
* **`logstash_writer_connect_exceptions_total`** -- a count of exceptions
|
150
|
+
raised whilst attempting to connect to a logstash server, labelled by the
|
151
|
+
exception class and the server to which the connection was attempted.
|
152
|
+
|
153
|
+
* **`logstash_writer_write_exceptions_total`** -- a count of exceptions
|
154
|
+
raised whilst attempting to write data to a connected logstash server,
|
155
|
+
labelled by the exception class and the server to which the write was
|
156
|
+
directed.
|
157
|
+
|
158
|
+
* **`logstash_writer_write_loop_exceptions_total`** -- a count of exceptions
|
159
|
+
raised in the "write loop", which is the main infinite loop executed by
|
160
|
+
the background worker thread. Exceptions which occur here are...
|
161
|
+
concerning, because whilst exceptions are expected while connecting and
|
162
|
+
writing to logstash servers, the write loop *itself* shouldn't normally
|
163
|
+
be flinging exceptions around.
|
164
|
+
|
165
|
+
* **`logstash_writer_write_loop_ok`** -- a flag (can be either `1` or `0`)
|
166
|
+
indicating whether the write loop is dead or not. This is, essentially,
|
167
|
+
the `up` series for the logstash writer; if this is `0`, nothing useful is
|
168
|
+
happening in the logstash writer.
|
169
|
+
|
170
|
+
|
171
|
+
# Contributing
|
172
|
+
|
173
|
+
Patches can be sent as [a Github pull
|
174
|
+
request](https://github.com/discourse/logstash-writer). This project is
|
175
|
+
intended to be a safe, welcoming space for collaboration, and contributors
|
176
|
+
are expected to adhere to the [Contributor Covenant code of
|
177
|
+
conduct](CODE_OF_CONDUCT.md).
|
178
|
+
|
179
|
+
|
180
|
+
# Licence
|
181
|
+
|
182
|
+
Unless otherwise stated, everything in this repo is covered by the following
|
183
|
+
copyright notice:
|
184
|
+
|
185
|
+
Copyright (C) 2015 Civilized Discourse Construction Kit, Inc.
|
186
|
+
|
187
|
+
This program is free software: you can redistribute it and/or modify it
|
188
|
+
under the terms of the GNU General Public License version 3, as
|
189
|
+
published by the Free Software Foundation.
|
190
|
+
|
191
|
+
This program is distributed in the hope that it will be useful,
|
192
|
+
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
193
|
+
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
194
|
+
GNU General Public License for more details.
|
195
|
+
|
196
|
+
You should have received a copy of the GNU General Public License
|
197
|
+
along with this program. If not, see <http://www.gnu.org/licenses/>.
|
@@ -0,0 +1,498 @@
|
|
1
|
+
require 'ipaddr'
|
2
|
+
require 'json'
|
3
|
+
require 'resolv'
|
4
|
+
require 'socket'
|
5
|
+
require 'prometheus/client'
|
6
|
+
|
7
|
+
# Write messages to a logstash server.
|
8
|
+
#
|
9
|
+
# Flings events, represented as JSON objects, to logstash using the
|
10
|
+
# `json_lines` codec (over TCP). Doesn't do any munging or modification of
|
11
|
+
# the event data given to it, other than adding `@timestamp` and `_id`
|
12
|
+
# fields if they do not already exist.
|
13
|
+
#
|
14
|
+
# We support highly-available logstash installations by means of multiple
|
15
|
+
# address records, or via SRV records. See the docs for .new for details
|
16
|
+
# as to the valid formats for the server.
|
17
|
+
#
|
18
|
+
class LogstashWriter
|
19
|
+
# How long, in seconds, to pause the first time an error is encountered.
|
20
|
+
# Each successive error will cause a longer wait, so as to prevent
|
21
|
+
# thundering herds.
|
22
|
+
INITIAL_RETRY_WAIT = 0.5
|
23
|
+
|
24
|
+
# Create a new logstash writer.
|
25
|
+
#
|
26
|
+
# Once the object is created, you're ready to give it messages by
|
27
|
+
# calling #send_event. No messages will actually be *delivered* to
|
28
|
+
# logstash, though, until you call #run.
|
29
|
+
#
|
30
|
+
# If multiple addresses are returned from an A/AAAA resolution, or
|
31
|
+
# multiple SRV records, then the records will all be tried in random
|
32
|
+
# order (for A/AAAA records) or in line with the standard rules for
|
33
|
+
# weight and priority (for SRV records).
|
34
|
+
#
|
35
|
+
# @param server_name [String] details for connecting to the logstash
|
36
|
+
# server(s). This can be:
|
37
|
+
#
|
38
|
+
# * `<IPv4 address>:<port>` -- a literal IPv4 address, and mandatory
|
39
|
+
# port.
|
40
|
+
#
|
41
|
+
# * `[<IPv6 address>]:<port>` -- a literal IPv6 address, and mandatory
|
42
|
+
# port. enclosing the address in square brackets isn't required, but
|
43
|
+
# it's a serving suggestion to make it a little easier to discern
|
44
|
+
# address from port. Forgetting the include the port will end in
|
45
|
+
# confusion.
|
46
|
+
#
|
47
|
+
# * `<hostname>:<port>` -- the given hostname will be resolved for
|
48
|
+
# A/AAAA records, and all returned addresses will be tried in random
|
49
|
+
# order until one is found that accepts a connection.
|
50
|
+
#
|
51
|
+
# * `<dnsname>` -- the given dnsname will be resolved for SRV records,
|
52
|
+
# and the returned target hostnames and ports will be tried in the
|
53
|
+
# RFC2782-approved manner according to priority and weight until one
|
54
|
+
# is found which accepts a connection.
|
55
|
+
#
|
56
|
+
# @param logger [Logger] something to which we can write log entries
|
57
|
+
# for debugging and error-reporting purposes.
|
58
|
+
#
|
59
|
+
# @param backlog [Integer] a non-negative integer specifying the maximum
|
60
|
+
# number of events that should be queued during periods when the
|
61
|
+
# logstash server is unavailable. If the limit is exceeded, the oldest
|
62
|
+
# (= first event to be queued) will be dropped.
|
63
|
+
#
|
64
|
+
# @param metrics_registry [Prometheus::Client::Registry] where to register
|
65
|
+
# the metrics instrumenting the operation of the writer instance.
|
66
|
+
#
|
67
|
+
# @param metrics_prefix [#to_s] what to prefix all of the metrics used to
|
68
|
+
# instrument the operation of the writer instance. If you instantiate
|
69
|
+
# multiple LogstashWriter instances with the same `stats_registry`, this
|
70
|
+
# parameter *must* be different for each of them, or you will get some
|
71
|
+
# inscrutable exception raised from the registry.
|
72
|
+
#
|
73
|
+
def initialize(server_name:, logger: Logger.new("/dev/null"), backlog: 1_000, metrics_registry: Prometheus::Client::Registry.new, metrics_prefix: :logstash_writer)
|
74
|
+
@server_name, @logger, @backlog = server_name, logger, backlog
|
75
|
+
|
76
|
+
@metrics = {
|
77
|
+
received: metrics_registry.counter(:"#{metrics_prefix}_events_received_total", "The number of logstash events which have been submitted for delivery"),
|
78
|
+
sent: metrics_registry.counter(:"#{metrics_prefix}_events_written_total", "The number of logstash events which have been delivered to the logstash server"),
|
79
|
+
queue_size: metrics_registry.gauge(:"#{metrics_prefix}_queue_size", "The number of events currently in the queue to be sent"),
|
80
|
+
dropped: metrics_registry.counter(:"#{metrics_prefix}_events_dropped_total", "The number of events which have been dropped from the queue"),
|
81
|
+
|
82
|
+
lag: metrics_registry.gauge(:"#{metrics_prefix}_last_sent_event_timestamp", "When the last event successfully sent to logstash was originally received"),
|
83
|
+
|
84
|
+
connected: metrics_registry.gauge(:"#{metrics_prefix}_connected_to_server", "Boolean flag indicating whether we are currently connected to a logstash server"),
|
85
|
+
connect_exception: metrics_registry.counter(:"#{metrics_prefix}_connect_exceptions_total", "The number of exceptions that have occurred whilst attempting to connect to a logstash server"),
|
86
|
+
write_exception: metrics_registry.counter(:"#{metrics_prefix}_write_exceptions_total", "The number of exceptions that have occurred whilst attempting to write an event to a logstash server"),
|
87
|
+
|
88
|
+
write_loop_exception: metrics_registry.counter(:"#{metrics_prefix}_write_loop_exceptions_total", "The number of exceptions that have occurred in the writing loop"),
|
89
|
+
write_loop_ok: metrics_registry.gauge(:"#{metrics_prefix}_write_loop_ok", "Boolean flag indicating whether the writing loop is currently operating correctly, or is in a post-apocalyptic hellscape of never-ending exceptions"),
|
90
|
+
}
|
91
|
+
|
92
|
+
@metrics[:lag].set({}, 0)
|
93
|
+
@metrics[:queue_size].set({}, 0)
|
94
|
+
|
95
|
+
@queue = []
|
96
|
+
@queue_mutex = Mutex.new
|
97
|
+
@queue_cv = ConditionVariable.new
|
98
|
+
|
99
|
+
@socket_mutex = Mutex.new
|
100
|
+
@worker_mutex = Mutex.new
|
101
|
+
end
|
102
|
+
|
103
|
+
# Add an event to the queue, to be sent to logstash. Actual event
|
104
|
+
# delivery will happen in a worker thread that is started with
|
105
|
+
# #run. If the event does not have a `@timestamp` or `_id` element, they
|
106
|
+
# will be added with appropriate values.
|
107
|
+
#
|
108
|
+
# @param e [Hash] the event data to be sent.
|
109
|
+
#
|
110
|
+
# @return [NilClass]
|
111
|
+
#
|
112
|
+
def send_event(e)
|
113
|
+
unless e.is_a?(Hash)
|
114
|
+
raise ArgumentError, "Event must be a hash"
|
115
|
+
end
|
116
|
+
|
117
|
+
unless e.has_key?(:@timestamp) || e.has_key?("@timestamp")
|
118
|
+
e[:@timestamp] = Time.now.utc.strftime("%FT%TZ")
|
119
|
+
end
|
120
|
+
|
121
|
+
unless e.has_key?(:_id) || e.has_key?("_id")
|
122
|
+
# This is the quickest way I've found to get a long, random string.
|
123
|
+
# We don't need any sort of cryptographic or unpredictability
|
124
|
+
# guarantees for what we're doing here, so SecureRandom is unnecessary
|
125
|
+
# overhead.
|
126
|
+
e[:_id] = rand(0x1000_0000_0000_0000_0000_0000_0000_0000).to_s(36)
|
127
|
+
end
|
128
|
+
|
129
|
+
@queue_mutex.synchronize do
|
130
|
+
@queue << { content: e, arrival_timestamp: Time.now }
|
131
|
+
while @queue.length > @backlog
|
132
|
+
@queue.shift
|
133
|
+
stat_dropped
|
134
|
+
end
|
135
|
+
@queue_cv.signal
|
136
|
+
|
137
|
+
stat_received
|
138
|
+
end
|
139
|
+
|
140
|
+
nil
|
141
|
+
end
|
142
|
+
|
143
|
+
# Start sending events.
|
144
|
+
#
|
145
|
+
# This method will return almost immediately, and actual event
|
146
|
+
# transmission will commence in a separate thread.
|
147
|
+
#
|
148
|
+
# @return [NilClass]
|
149
|
+
#
|
150
|
+
def run
|
151
|
+
@worker_mutex.synchronize do
|
152
|
+
if @worker_thread.nil?
|
153
|
+
m, cv = Mutex.new, ConditionVariable.new
|
154
|
+
|
155
|
+
@worker_thread = Thread.new { cv.signal; write_loop }
|
156
|
+
|
157
|
+
# Don't return until the thread has *actually* started
|
158
|
+
m.synchronize { cv.wait(m) }
|
159
|
+
end
|
160
|
+
end
|
161
|
+
|
162
|
+
nil
|
163
|
+
end
|
164
|
+
|
165
|
+
# Stop the worker thread.
|
166
|
+
#
|
167
|
+
# Politely ask the worker thread to please finish up once it's
|
168
|
+
# finished sending all messages that have been queued. This will
|
169
|
+
# return once the worker thread has finished.
|
170
|
+
#
|
171
|
+
# @return [NilClass]
|
172
|
+
#
|
173
|
+
def stop
|
174
|
+
@worker_mutex.synchronize do
|
175
|
+
if @worker_thread
|
176
|
+
@terminate = true
|
177
|
+
@queue_cv.signal
|
178
|
+
begin
|
179
|
+
@worker_thread.join
|
180
|
+
rescue Exception => ex
|
181
|
+
@logger.error("LogstashWriter") { (["Worker thread terminated with exception: #{ex.message} (#{ex.class})"] + ex.backtrace).join("\n ") }
|
182
|
+
end
|
183
|
+
@worker_thread = nil
|
184
|
+
@socket_mutex.synchronize { (@current_socket.close; @current_socket = nil) if @current_socket }
|
185
|
+
end
|
186
|
+
end
|
187
|
+
|
188
|
+
nil
|
189
|
+
end
|
190
|
+
|
191
|
+
# Disconnect from the currently-active server.
|
192
|
+
#
|
193
|
+
# In certain circumstances, you may wish to force the writer to stop
|
194
|
+
# sending messages to the currently-connected logstash server, and
|
195
|
+
# re-resolve the `server_name` to get new a new address to talk to.
|
196
|
+
# Calling this method will cause that to happen.
|
197
|
+
#
|
198
|
+
# @return [NilClass]
|
199
|
+
#
|
200
|
+
def force_disconnect!
|
201
|
+
@socket_mutex.synchronize do
|
202
|
+
return if @current_socket.nil?
|
203
|
+
|
204
|
+
@logger.info("LogstashWriter") { "Forced disconnect from #{describe_peer(@current_socket) }" }
|
205
|
+
@current_socket.close if @current_socket
|
206
|
+
@current_socket = nil
|
207
|
+
end
|
208
|
+
|
209
|
+
nil
|
210
|
+
end
|
211
|
+
|
212
|
+
private
|
213
|
+
|
214
|
+
# The main "worker" method for getting events out of the queue and
|
215
|
+
# firing them at logstash.
|
216
|
+
#
|
217
|
+
def write_loop
|
218
|
+
error_wait = INITIAL_RETRY_WAIT
|
219
|
+
|
220
|
+
catch :terminate do
|
221
|
+
loop do
|
222
|
+
event = nil
|
223
|
+
|
224
|
+
begin
|
225
|
+
@queue_mutex.synchronize do
|
226
|
+
while @queue.empty? && !@terminate
|
227
|
+
@queue_cv.wait(@queue_mutex)
|
228
|
+
end
|
229
|
+
|
230
|
+
if @queue.empty? && @terminate
|
231
|
+
@terminate = false
|
232
|
+
throw :terminate
|
233
|
+
end
|
234
|
+
|
235
|
+
event = @queue.shift
|
236
|
+
end
|
237
|
+
|
238
|
+
current_socket do |s|
|
239
|
+
s.puts event[:content].to_json
|
240
|
+
stat_sent(describe_peer(s), event[:arrival_timestamp])
|
241
|
+
@metrics[:write_loop_ok].set({}, 1)
|
242
|
+
error_wait = INITIAL_RETRY_WAIT
|
243
|
+
end
|
244
|
+
rescue StandardError => ex
|
245
|
+
@logger.error("LogstashWriter") { (["Exception in write_loop: #{ex.message} (#{ex.class})"] + ex.backtrace).join("\n ") }
|
246
|
+
@queue_mutex.synchronize { @queue.unshift(event) if event }
|
247
|
+
@metrics[:write_loop_exception].increment(class: ex.class.to_s)
|
248
|
+
@metrics[:write_loop_ok].set({}, 0)
|
249
|
+
sleep error_wait
|
250
|
+
# Increase the error wait timeout for next time, up to a maximum
|
251
|
+
# interval of about 60 seconds
|
252
|
+
error_wait *= 1.1
|
253
|
+
error_wait = 60 if error_wait > 60
|
254
|
+
error_wait += rand / 0.5
|
255
|
+
end
|
256
|
+
end
|
257
|
+
end
|
258
|
+
end
|
259
|
+
|
260
|
+
# Yield a TCPSocket connected to the server we currently believe to be
|
261
|
+
# accepting log entries, so that something can send log entries to it.
|
262
|
+
#
|
263
|
+
# The yielding allows us to centralise all error detection and handling
|
264
|
+
# within this one method, and retry sending just by calling `yield` again
|
265
|
+
# when we've connected to another server.
|
266
|
+
#
|
267
|
+
def current_socket
|
268
|
+
# This could all be handled more cleanly with recursion, but I don't
|
269
|
+
# want to fill the stack if we have to retry a lot of times. Also
|
270
|
+
# can't just use `retry` because not all of the "go around again"
|
271
|
+
# conditions are due to exceptions.
|
272
|
+
done = false
|
273
|
+
|
274
|
+
until done
|
275
|
+
@socket_mutex.synchronize do
|
276
|
+
if @current_socket
|
277
|
+
begin
|
278
|
+
@logger.debug("LogstashWriter") { "Using current server #{describe_peer(@current_socket)}" }
|
279
|
+
yield @current_socket
|
280
|
+
@metrics[:connected].set({}, 1)
|
281
|
+
done = true
|
282
|
+
rescue SystemCallError => ex
|
283
|
+
# Something went wrong during the send; disconnect from this
|
284
|
+
# server and recycle
|
285
|
+
@metrics[:write_exception].increment(server: describe_peer(@current_socket), class: ex.class.to_s)
|
286
|
+
@logger.info("LogstashWriter") { "Error while writing to current server: #{ex.message} (#{ex.class})" }
|
287
|
+
@current_socket.close
|
288
|
+
@current_socket = nil
|
289
|
+
@metrics[:connected].set({}, 0)
|
290
|
+
|
291
|
+
sleep INITIAL_RETRY_WAIT
|
292
|
+
end
|
293
|
+
else
|
294
|
+
retry_delay = INITIAL_RETRY_WAIT * 10
|
295
|
+
candidates = resolve_server_name
|
296
|
+
@logger.debug("LogstashWriter") { "Server candidates: #{candidates.inspect}" }
|
297
|
+
|
298
|
+
if candidates.empty?
|
299
|
+
# A useful error message will (should?) have been logged by something
|
300
|
+
# down in the bowels of resolve_server_name, so all we have to do
|
301
|
+
# is wait a little while, then let the loop retry.
|
302
|
+
sleep INITIAL_RETRY_WAIT * 10
|
303
|
+
else
|
304
|
+
begin
|
305
|
+
next_server = candidates.shift
|
306
|
+
|
307
|
+
if next_server
|
308
|
+
@logger.debug("LogstashWriter") { "Trying to connect to #{next_server.to_s}" }
|
309
|
+
@current_socket = next_server.socket
|
310
|
+
else
|
311
|
+
@logger.debug("LogstashWriter") { "Could not connect to any server; pausing before trying again" }
|
312
|
+
@current_socket = nil
|
313
|
+
sleep retry_delay
|
314
|
+
|
315
|
+
# Calculate a longer retry delay next time we fail to connect
|
316
|
+
# to every server in the list, up to a maximum of (roughly) 60
|
317
|
+
# seconds.
|
318
|
+
retry_delay *= 1.5
|
319
|
+
retry_delay = 60 if retry_delay > 60
|
320
|
+
# A bit of randomness to prevent the thundering herd never goes
|
321
|
+
# amiss
|
322
|
+
retry_delay += rand
|
323
|
+
end
|
324
|
+
rescue SystemCallError => ex
|
325
|
+
# Connection failed for any number of reasons; try the next one in the list
|
326
|
+
@metrics[:connect_exception].increment(server: next_server.to_s, class: ex.class.to_s)
|
327
|
+
@logger.error("LogstashWriter") { "Failed to connect to #{next_server.to_s}: #{ex.message} (#{ex.class})" }
|
328
|
+
sleep INITIAL_RETRY_WAIT
|
329
|
+
retry
|
330
|
+
end
|
331
|
+
end
|
332
|
+
end
|
333
|
+
end
|
334
|
+
end
|
335
|
+
end
|
336
|
+
|
337
|
+
# Generate a human-readable description of the remote end of the given
|
338
|
+
# socket.
|
339
|
+
#
|
340
|
+
def describe_peer(s)
|
341
|
+
pa = s.peeraddr
|
342
|
+
if pa[0] == "AF_INET6"
|
343
|
+
"[#{pa[3]}]:#{pa[1]}"
|
344
|
+
else
|
345
|
+
"#{pa[3]}:#{pa[1]}"
|
346
|
+
end
|
347
|
+
end
|
348
|
+
|
349
|
+
# Turn the server_name given in the constructor into a list of Target
|
350
|
+
# objects, suitable for iterating through to find someone to talk to.
|
351
|
+
#
|
352
|
+
def resolve_server_name
|
353
|
+
return [static_target] if static_target
|
354
|
+
|
355
|
+
# The IPv6 literal case should have been taken care of by
|
356
|
+
# static_target, so the only two cases we have to deal with
|
357
|
+
# here are specified-port (assume A/AAAA) or no port (assume SRV).
|
358
|
+
if @server_name =~ /:/
|
359
|
+
host, port = @server_name.split(":", 2)
|
360
|
+
targets_from_address_record(host, port)
|
361
|
+
else
|
362
|
+
targets_from_srv_record(host)
|
363
|
+
end
|
364
|
+
end
|
365
|
+
|
366
|
+
# Figure out whether the server spec we were given looks like an address:port
|
367
|
+
# combo (in which case return a memoised target), else return `nil` to let
|
368
|
+
# the DNS take over.
|
369
|
+
def static_target
|
370
|
+
# It is valid to memoize this because address literals don't change
|
371
|
+
# their resolution over time.
|
372
|
+
@static_target ||= begin
|
373
|
+
if @server_name =~ /\A(.*):(\d+)\z/
|
374
|
+
begin
|
375
|
+
IPAddr.new($1)
|
376
|
+
rescue ArgumentError
|
377
|
+
# Whatever is on the LHS isn't a recognisable address literal;
|
378
|
+
# assume hostname
|
379
|
+
nil
|
380
|
+
else
|
381
|
+
Target.new($1, $2.to_i)
|
382
|
+
end
|
383
|
+
end
|
384
|
+
end
|
385
|
+
end
|
386
|
+
|
387
|
+
# Resolve hostname as A/AAAA, and generate randomly-sorted list of Target
|
388
|
+
# records from the list of addresses resolved.
|
389
|
+
#
|
390
|
+
def targets_from_address_record(hostname, port)
|
391
|
+
addrs = Resolv::DNS.new.getaddresses(hostname)
|
392
|
+
if addrs.empty?
|
393
|
+
@logger.warn("LogstashWriter") { "No addresses resolved for server_name #{hostname.inspect}" }
|
394
|
+
end
|
395
|
+
addrs.sort_by { rand }.map { |a| Target.new(a.to_s, port.to_i) }
|
396
|
+
end
|
397
|
+
|
398
|
+
# Resolve the given hostname as a SRV record, and generate a list of
|
399
|
+
# Target records from the resources returned. The list will be arranged
|
400
|
+
# in line with the RFC2782-specified algorithm, respecting the weight and
|
401
|
+
# priority of the records.
|
402
|
+
#
|
403
|
+
def targets_from_srv_record(hostname)
|
404
|
+
[].tap do |list|
|
405
|
+
left = Resolv::DNS.new.getresources(@server_name, Resolv::DNS::Resource::IN::SRV)
|
406
|
+
if left.empty?
|
407
|
+
@logger.warn("LogstashWriter") { "No SRV records found for server_name #{@server_name.inspect}" }
|
408
|
+
end
|
409
|
+
|
410
|
+
# Let the soft-SRV shuffle... BEGIN!
|
411
|
+
until left.empty?
|
412
|
+
prio = left.map { |rr| rr.priority }.uniq.min
|
413
|
+
candidates = left.select { |rr| rr.priority == prio }
|
414
|
+
left -= candidates
|
415
|
+
candidates.sort_by! { |rr| [rr.weight, rr.target.to_s] }
|
416
|
+
until candidates.empty?
|
417
|
+
selector = rand(candidates.inject(1) { |n, rr| n + rr.weight })
|
418
|
+
chosen = candidates.inject(0) do |n, rr|
|
419
|
+
break rr if n + rr.weight >= selector
|
420
|
+
n + rr.weight
|
421
|
+
end
|
422
|
+
candidates.delete(chosen)
|
423
|
+
list << Target.new(chosen.target.to_s, chosen.port)
|
424
|
+
end
|
425
|
+
end
|
426
|
+
end
|
427
|
+
end
|
428
|
+
|
429
|
+
def stat_received
|
430
|
+
@metrics[:received].increment({})
|
431
|
+
@metrics[:queue_size].increment({})
|
432
|
+
end
|
433
|
+
|
434
|
+
def stat_sent(peer, arrived_time)
|
435
|
+
@metrics[:sent].increment(server: peer)
|
436
|
+
@metrics[:queue_size].decrement({})
|
437
|
+
@metrics[:lag].set({}, arrived_time.to_f)
|
438
|
+
end
|
439
|
+
|
440
|
+
def stat_dropped
|
441
|
+
@metrics[:queue_size].decrement({})
|
442
|
+
@metrics[:dropped].increment({})
|
443
|
+
end
|
444
|
+
|
445
|
+
# An individual target for logstash messages
|
446
|
+
#
|
447
|
+
# Takes a host and port, gives back a socket to send data down.
|
448
|
+
#
|
449
|
+
class Target
|
450
|
+
# Create a new target.
|
451
|
+
#
|
452
|
+
# @param addr [String] an IP address or hostname to which to connect.
|
453
|
+
#
|
454
|
+
# @param port [Integer] the TCP port number, in the range 1-65535.
|
455
|
+
#
|
456
|
+
# @raise [ArgumentError] if `addr` is not a valid-looking IP address or
|
457
|
+
# hostname, or if the port number is not in the valid range.
|
458
|
+
#
|
459
|
+
def initialize(addr, port)
|
460
|
+
#:nocov:
|
461
|
+
unless addr.is_a? String
|
462
|
+
raise ArgumentError, "addr #{addr.inspect} is not a string"
|
463
|
+
end
|
464
|
+
|
465
|
+
unless port.is_a? Integer
|
466
|
+
raise ArgumentError, "port #{port.inspect} is not an integer"
|
467
|
+
end
|
468
|
+
|
469
|
+
unless (1..65535).include?(port)
|
470
|
+
raise ArgumentError, "invalid port number #{port.inspect} (must be in range 1-65535)"
|
471
|
+
end
|
472
|
+
#:nocov:
|
473
|
+
|
474
|
+
@addr, @port = addr, port
|
475
|
+
end
|
476
|
+
|
477
|
+
# Create a connection.
|
478
|
+
#
|
479
|
+
# @return [IO] a socket to the target.
|
480
|
+
#
|
481
|
+
# @raise [SystemCallError] if connection cannot be established
|
482
|
+
# for any reason.
|
483
|
+
#
|
484
|
+
def socket
|
485
|
+
TCPSocket.new(@addr, @port)
|
486
|
+
end
|
487
|
+
|
488
|
+
# Simple string representation of the target.
|
489
|
+
#
|
490
|
+
# @return [String]
|
491
|
+
#
|
492
|
+
def to_s
|
493
|
+
"#{@addr}:#{@port}"
|
494
|
+
end
|
495
|
+
end
|
496
|
+
|
497
|
+
private_constant :Target
|
498
|
+
end
|