bosh-monitor 1.5.0.pre.1113
Sign up to get free protection for your applications and to get access to all the features.
- data/README +80 -0
- data/bin/bosh-monitor +30 -0
- data/bin/bosh-monitor-console +51 -0
- data/bin/listener +58 -0
- data/lib/bosh/monitor.rb +72 -0
- data/lib/bosh/monitor/agent.rb +51 -0
- data/lib/bosh/monitor/agent_manager.rb +295 -0
- data/lib/bosh/monitor/api_controller.rb +18 -0
- data/lib/bosh/monitor/config.rb +71 -0
- data/lib/bosh/monitor/core_ext.rb +8 -0
- data/lib/bosh/monitor/director.rb +76 -0
- data/lib/bosh/monitor/director_monitor.rb +33 -0
- data/lib/bosh/monitor/errors.rb +19 -0
- data/lib/bosh/monitor/event_processor.rb +109 -0
- data/lib/bosh/monitor/events/alert.rb +92 -0
- data/lib/bosh/monitor/events/base.rb +70 -0
- data/lib/bosh/monitor/events/heartbeat.rb +139 -0
- data/lib/bosh/monitor/metric.rb +16 -0
- data/lib/bosh/monitor/plugins/base.rb +27 -0
- data/lib/bosh/monitor/plugins/cloud_watch.rb +56 -0
- data/lib/bosh/monitor/plugins/datadog.rb +78 -0
- data/lib/bosh/monitor/plugins/dummy.rb +20 -0
- data/lib/bosh/monitor/plugins/email.rb +135 -0
- data/lib/bosh/monitor/plugins/http_request_helper.rb +25 -0
- data/lib/bosh/monitor/plugins/logger.rb +13 -0
- data/lib/bosh/monitor/plugins/nats.rb +43 -0
- data/lib/bosh/monitor/plugins/pagerduty.rb +48 -0
- data/lib/bosh/monitor/plugins/paging_datadog_client.rb +24 -0
- data/lib/bosh/monitor/plugins/resurrector.rb +82 -0
- data/lib/bosh/monitor/plugins/resurrector_helper.rb +84 -0
- data/lib/bosh/monitor/plugins/tsdb.rb +43 -0
- data/lib/bosh/monitor/plugins/varz.rb +17 -0
- data/lib/bosh/monitor/protocols/tsdb.rb +68 -0
- data/lib/bosh/monitor/runner.rb +162 -0
- data/lib/bosh/monitor/version.rb +5 -0
- data/lib/bosh/monitor/yaml_helper.rb +18 -0
- metadata +246 -0
data/README
ADDED
@@ -0,0 +1,80 @@
|
|
1
|
+
h4. Synopsis
|
2
|
+
|
3
|
+
BOSH Health Monitor (BHM) is a component that monitors health of one or multiple BOSH deployments. It processes heartbeats and alerts from BOSH agents and notifies interested parties if something goes wrong.
|
4
|
+
|
5
|
+
h4. Heartbeats
|
6
|
+
|
7
|
+
Agent sends periodic heartbeats to HM. Heartbeats are sent via message bus and have the following format:
|
8
|
+
|
9
|
+
| *Subject* | hm.agent.heartbeat.<agent_id> |
|
10
|
+
| *Payload* | none |
|
11
|
+
|
12
|
+
h6. Heartbeat processing
|
13
|
+
|
14
|
+
# If the agent is known to HM the last heartbeat timestamp gets updated. No analysis is attempted at this point, analyze agents routine is asynchronous to heartbeat processing.
|
15
|
+
# If the agent is unknown it gets registered with HM with a warning flag set (we call them rogue agents). Next director poll will possibly include this agent to a list of managed agents and clear the flag. We might generate the alert if the flag hasn't been cleared for some (configurable) time.
|
16
|
+
|
17
|
+
h4. Agents discovery
|
18
|
+
|
19
|
+
HM polls director periodically to get the list of managed VMs:
|
20
|
+
|
21
|
+
| *Endpoint* | GET /deployments/<deployment_name>/vms |
|
22
|
+
| *Response* | JSON including agent ids, job names and indices for all managed VMs |
|
23
|
+
|
24
|
+
When new agent is discovered it gets registered and added to a managed deployment. No active operations are performed to reach the agent and query it, we only rely on heartbeats and agent alerts.
|
25
|
+
|
26
|
+
h4. Agents analysis
|
27
|
+
|
28
|
+
This is a periodic operation that goes through all known agents. First it tries to go through all managed deployments, then analyzes rogue agents as well. The following procedure is used:
|
29
|
+
|
30
|
+
# If agent missed more than N heartbeats the "Agent Missing" alert is generated.
|
31
|
+
|
32
|
+
h4. Alerts
|
33
|
+
|
34
|
+
Alert is a concept used by HM to flag and deliver information about important events. It includes the following data:
|
35
|
+
|
36
|
+
# Id
|
37
|
+
# Severity
|
38
|
+
# Source (usually deployment/job/index tuple)
|
39
|
+
# Timestamp
|
40
|
+
# Description
|
41
|
+
# Long description (optional)
|
42
|
+
# Tags (optional)
|
43
|
+
|
44
|
+
h6. Alert Processor
|
45
|
+
|
46
|
+
Alert Processor is a module that registers incoming alerts and routes them to interested parties via appropriate delivery agent. It should conform to the following interface:
|
47
|
+
|
48
|
+
| *Method* | *Arguments* | *Description* |
|
49
|
+
| *register_alert* | alert (object responding to :id, :severity, :timestamp, :description, :long_description, :source and :tags) | Registers an alert and invokes a delivery agent. Delivery agent might or might not deliver alert immediately depending on the implementation, so Alert Processor shouldn't make any assumptions about delivery (i.e. agent might queue up several alerts and send them asynchronously. |
|
50
|
+
| *add_delivery_agent* | delivery_agent, options | Adds a delivery agent to a processor |
|
51
|
+
|
52
|
+
Alert id can be an arbitrary string however Alert Processor might use it to keep track of registered alerts and don't process the same alert twice. This way other HM modules can just blindly register any incoming alerts and leave the dedup step to the alert processor).
|
53
|
+
|
54
|
+
Alerts are only persisted in HM memory (at least in the initial version) so losing HM leads to losing any undelivered alerts that might have been queued by a delivery agent or alert processor).
|
55
|
+
|
56
|
+
If alert processor has more than one delivery agents associated with it then it notifies all of them in order (i.e. we want to notify both Zabbix and Pager Duty).
|
57
|
+
|
58
|
+
h6. Delivery Agent
|
59
|
+
|
60
|
+
Delivery Agent is a module that takes care of an alert delivery mechanism (such as an email, Pager Duty alert, writing to a journal or even silently discarding the alert). It should conform to the following interface:
|
61
|
+
|
62
|
+
| *Method* | *Arguments* | *Description* |
|
63
|
+
| *deliver* | alert | Delivers alert or queues it for delivery. |
|
64
|
+
|
65
|
+
The initial implementation will have email and Pager Duty delivery agents.
|
66
|
+
|
67
|
+
Alert Processor is not pluggable, it's just one of HM classes. Delivery agents are pluggable but generally not changed in a runtime but initialized using an HM configuration file on HM startup.
|
68
|
+
|
69
|
+
h4. Alerts from agent
|
70
|
+
|
71
|
+
HM subscribes to agent alerts on a message bus:
|
72
|
+
|
73
|
+
| *Subject* | hm.agent.alert.<agent_id> |
|
74
|
+
| *Payload* | JSON containing the following keys: id, service, event, action, description, timestamp, tags |
|
75
|
+
|
76
|
+
BOSH Agent is responsible for mapping any underlying supervisor alert format to the expected JSON payload and send it to HM.
|
77
|
+
|
78
|
+
HM is responsible for interpreting JSON payload and mapping it to a sequence of HM actions and possibly creating an HM alert compatible with Alert Processor module. HM never dedups incoming alerts outside of Alert Processor (this adds some overhead to an incoming alert parser but shouldn't be too bad). Malformed payloads are ignored.
|
79
|
+
|
80
|
+
Job name and index are not featured in agent incoming alert, those are looked up in director. If heartbeat came from a rogue agent and we have no job name and/or index then we note that fact in alert description but don't try to be too worried about that (service name and agent id should be enough). We might consider including agent IP address as a part of heartbeat so we can track down rogue agents.
|
data/bin/bosh-monitor
ADDED
@@ -0,0 +1,30 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require "bosh/monitor"
|
4
|
+
require "optparse"
|
5
|
+
|
6
|
+
config_file = nil
|
7
|
+
|
8
|
+
opts = OptionParser.new do |opts|
|
9
|
+
opts.on("-c", "--config FILE", "configuration file") do |opt|
|
10
|
+
config_file = opt
|
11
|
+
end
|
12
|
+
end
|
13
|
+
|
14
|
+
opts.parse!(ARGV.dup)
|
15
|
+
|
16
|
+
if config_file.nil?
|
17
|
+
puts opts
|
18
|
+
exit 1
|
19
|
+
end
|
20
|
+
|
21
|
+
runner = Bosh::Monitor::Runner.new(config_file)
|
22
|
+
|
23
|
+
Signal.trap("INT") do
|
24
|
+
runner.stop
|
25
|
+
exit(1)
|
26
|
+
end
|
27
|
+
|
28
|
+
runner.run
|
29
|
+
|
30
|
+
|
@@ -0,0 +1,51 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require 'bosh/monitor'
|
4
|
+
require 'irb'
|
5
|
+
require 'irb/completion'
|
6
|
+
|
7
|
+
module Bosh
|
8
|
+
module Monitor
|
9
|
+
|
10
|
+
class Console
|
11
|
+
include YamlHelper
|
12
|
+
|
13
|
+
def self.start(context)
|
14
|
+
new.start(context)
|
15
|
+
end
|
16
|
+
|
17
|
+
def start(context)
|
18
|
+
config_file = nil
|
19
|
+
|
20
|
+
opts = OptionParser.new do |opt|
|
21
|
+
opt.on("-c", "--config [ARG]", "configuration file") { |c| config_file = c }
|
22
|
+
end
|
23
|
+
|
24
|
+
opts.parse!(ARGV)
|
25
|
+
|
26
|
+
if config_file.nil?
|
27
|
+
puts opts
|
28
|
+
exit 1
|
29
|
+
end
|
30
|
+
|
31
|
+
puts "=> Loading #{config_file}"
|
32
|
+
Bhm.config = load_yaml_file(config_file)
|
33
|
+
|
34
|
+
begin
|
35
|
+
require 'ruby-debug'
|
36
|
+
puts "=> Debugger enabled"
|
37
|
+
rescue LoadError
|
38
|
+
puts "=> ruby-debug not found, debugger disabled"
|
39
|
+
end
|
40
|
+
|
41
|
+
puts "=> Welcome to BOSH Health Monitor console"
|
42
|
+
|
43
|
+
IRB.start
|
44
|
+
end
|
45
|
+
|
46
|
+
end
|
47
|
+
end
|
48
|
+
end
|
49
|
+
|
50
|
+
Bhm::Console.start(self)
|
51
|
+
|
data/bin/listener
ADDED
@@ -0,0 +1,58 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require "eventmachine"
|
4
|
+
require "nats/client"
|
5
|
+
|
6
|
+
class Listener
|
7
|
+
|
8
|
+
def self.start
|
9
|
+
new.start
|
10
|
+
end
|
11
|
+
|
12
|
+
def start
|
13
|
+
filter = nil
|
14
|
+
nats_uri = nil
|
15
|
+
nats_subject = nil
|
16
|
+
|
17
|
+
opts = OptionParser.new do |opt|
|
18
|
+
opt.on("-f", "--filter ARG") { |f| filter = f }
|
19
|
+
opt.on("-n", "--nats URI") { |n| nats_uri = n }
|
20
|
+
opt.on("-s", "--subject ARG") { |s| nats_subject = s }
|
21
|
+
end
|
22
|
+
|
23
|
+
opts.parse!(ARGV)
|
24
|
+
|
25
|
+
if nats_uri.nil?
|
26
|
+
puts "Usage: listener [options] <nats_uri>"
|
27
|
+
end
|
28
|
+
|
29
|
+
nats_client_options = {
|
30
|
+
:uri => nats_uri,
|
31
|
+
:autostart => false
|
32
|
+
}
|
33
|
+
|
34
|
+
@nats = NATS.connect(nats_client_options)
|
35
|
+
|
36
|
+
if nats_subject
|
37
|
+
puts "> NATS subject is set to `#{nats_subject}'"
|
38
|
+
else
|
39
|
+
nats_subject = "bosh.hm.events"
|
40
|
+
end
|
41
|
+
|
42
|
+
if filter
|
43
|
+
puts "> Filter is set to `#{filter}'"
|
44
|
+
end
|
45
|
+
|
46
|
+
puts "> Subscribing to events"
|
47
|
+
@nats.subscribe(nats_subject) do |msg|
|
48
|
+
if filter.nil? || msg =~ Regexp.new(Regexp.quote(filter))
|
49
|
+
puts "#{Time.now.strftime("%Y-%m-%d %H:%M:%S")} >> " + msg
|
50
|
+
end
|
51
|
+
end
|
52
|
+
end
|
53
|
+
end
|
54
|
+
|
55
|
+
EM.run do
|
56
|
+
Listener.start
|
57
|
+
end
|
58
|
+
|
data/lib/bosh/monitor.rb
ADDED
@@ -0,0 +1,72 @@
|
|
1
|
+
module Bosh
|
2
|
+
module Monitor
|
3
|
+
end
|
4
|
+
end
|
5
|
+
|
6
|
+
Bhm = Bosh::Monitor
|
7
|
+
|
8
|
+
begin
|
9
|
+
require 'fiber'
|
10
|
+
rescue LoadError
|
11
|
+
unless defined? Fiber
|
12
|
+
$stderr.puts 'FATAL: HealthMonitor requires Ruby implementation that supports fibers'
|
13
|
+
exit 1
|
14
|
+
end
|
15
|
+
end
|
16
|
+
|
17
|
+
require 'ostruct'
|
18
|
+
require 'set'
|
19
|
+
|
20
|
+
require 'em-http-request'
|
21
|
+
require 'eventmachine'
|
22
|
+
require 'logging'
|
23
|
+
require 'nats/client'
|
24
|
+
require 'sinatra'
|
25
|
+
require 'thin'
|
26
|
+
require 'securerandom'
|
27
|
+
require 'yajl'
|
28
|
+
|
29
|
+
# Helpers
|
30
|
+
require 'bosh/monitor/yaml_helper'
|
31
|
+
|
32
|
+
# Basic blocks
|
33
|
+
require 'bosh/monitor/agent'
|
34
|
+
require 'bosh/monitor/config'
|
35
|
+
require 'bosh/monitor/core_ext'
|
36
|
+
require 'bosh/monitor/director'
|
37
|
+
require 'bosh/monitor/director_monitor'
|
38
|
+
require 'bosh/monitor/errors'
|
39
|
+
require 'bosh/monitor/metric'
|
40
|
+
require 'bosh/monitor/runner'
|
41
|
+
require 'bosh/monitor/version'
|
42
|
+
|
43
|
+
# Processing
|
44
|
+
require 'bosh/monitor/agent_manager'
|
45
|
+
require 'bosh/monitor/event_processor'
|
46
|
+
|
47
|
+
# HTTP endpoints
|
48
|
+
require 'bosh/monitor/api_controller'
|
49
|
+
|
50
|
+
# Protocols
|
51
|
+
require 'bosh/monitor/protocols/tsdb'
|
52
|
+
|
53
|
+
# Events
|
54
|
+
require 'bosh/monitor/events/base'
|
55
|
+
require 'bosh/monitor/events/alert'
|
56
|
+
require 'bosh/monitor/events/heartbeat'
|
57
|
+
|
58
|
+
# Plugins
|
59
|
+
require 'bosh/monitor/plugins/base'
|
60
|
+
require 'bosh/monitor/plugins/dummy'
|
61
|
+
require 'bosh/monitor/plugins/http_request_helper'
|
62
|
+
require 'bosh/monitor/plugins/resurrector_helper'
|
63
|
+
require 'bosh/monitor/plugins/cloud_watch'
|
64
|
+
require 'bosh/monitor/plugins/datadog'
|
65
|
+
require 'bosh/monitor/plugins/paging_datadog_client'
|
66
|
+
require 'bosh/monitor/plugins/email'
|
67
|
+
require 'bosh/monitor/plugins/logger'
|
68
|
+
require 'bosh/monitor/plugins/nats'
|
69
|
+
require 'bosh/monitor/plugins/pagerduty'
|
70
|
+
require 'bosh/monitor/plugins/resurrector'
|
71
|
+
require 'bosh/monitor/plugins/tsdb'
|
72
|
+
require 'bosh/monitor/plugins/varz'
|
@@ -0,0 +1,51 @@
|
|
1
|
+
module Bosh::Monitor
|
2
|
+
class Agent
|
3
|
+
|
4
|
+
attr_reader :id
|
5
|
+
attr_reader :discovered_at
|
6
|
+
attr_accessor :updated_at
|
7
|
+
|
8
|
+
ATTRIBUTES = [ :deployment, :job, :index, :cid ]
|
9
|
+
|
10
|
+
ATTRIBUTES.each do |attribute|
|
11
|
+
attr_accessor attribute
|
12
|
+
end
|
13
|
+
|
14
|
+
def initialize(id, opts={})
|
15
|
+
raise ArgumentError, "Agent must have an id" if id.nil?
|
16
|
+
|
17
|
+
@id = id
|
18
|
+
@discovered_at = Time.now
|
19
|
+
@updated_at = Time.now
|
20
|
+
@logger = Bhm.logger
|
21
|
+
@intervals = Bhm.intervals
|
22
|
+
|
23
|
+
@deployment = opts[:deployment]
|
24
|
+
@job = opts[:job]
|
25
|
+
@index = opts[:index]
|
26
|
+
@cid = opts[:cid]
|
27
|
+
end
|
28
|
+
|
29
|
+
def name
|
30
|
+
if @deployment && @job && @index
|
31
|
+
"#{@deployment}: #{@job}(#{@index}) [id=#{@id}, cid=#{@cid}]"
|
32
|
+
else
|
33
|
+
state = ATTRIBUTES.inject([]) do |acc, attribute|
|
34
|
+
value = send(attribute)
|
35
|
+
acc << "#{attribute}=#{value}" if value
|
36
|
+
acc
|
37
|
+
end
|
38
|
+
|
39
|
+
"agent #{@id} [#{state.join(", ")}]"
|
40
|
+
end
|
41
|
+
end
|
42
|
+
|
43
|
+
def timed_out?
|
44
|
+
(Time.now - @updated_at) > @intervals.agent_timeout
|
45
|
+
end
|
46
|
+
|
47
|
+
def rogue?
|
48
|
+
(Time.now - @discovered_at) > @intervals.rogue_agent_alert && @deployment.nil?
|
49
|
+
end
|
50
|
+
end
|
51
|
+
end
|
@@ -0,0 +1,295 @@
|
|
1
|
+
module Bosh::Monitor
|
2
|
+
|
3
|
+
class AgentManager
|
4
|
+
attr_reader :heartbeats_received
|
5
|
+
attr_reader :alerts_received
|
6
|
+
attr_reader :alerts_processed
|
7
|
+
|
8
|
+
attr_accessor :processor
|
9
|
+
|
10
|
+
def initialize(event_processor)
|
11
|
+
# hash of agent id to agent structure (see add_agent())
|
12
|
+
@agents = { }
|
13
|
+
|
14
|
+
# hash of deployment name to set of agent ids
|
15
|
+
@deployments = { }
|
16
|
+
|
17
|
+
@logger = Bhm.logger
|
18
|
+
@heartbeats_received = 0
|
19
|
+
@alerts_received = 0
|
20
|
+
@alerts_processed = 0
|
21
|
+
|
22
|
+
@processor = event_processor
|
23
|
+
end
|
24
|
+
|
25
|
+
# Get a hash of agent id -> agent object for all agents associated with the deployment
|
26
|
+
def get_agents_for_deployment(deployment_name)
|
27
|
+
agent_ids = @deployments[deployment_name]
|
28
|
+
@agents.select { |key, value| agent_ids.include?(key) }
|
29
|
+
end
|
30
|
+
|
31
|
+
def lookup_plugin(name, options = {})
|
32
|
+
plugin_class = nil
|
33
|
+
begin
|
34
|
+
class_name = name.to_s.split("_").map(&:capitalize).join
|
35
|
+
plugin_class = Bosh::Monitor::Plugins.const_get(class_name)
|
36
|
+
rescue NameError => e
|
37
|
+
raise PluginError, "Cannot find `#{name}' plugin"
|
38
|
+
end
|
39
|
+
|
40
|
+
plugin_class.new(options)
|
41
|
+
end
|
42
|
+
|
43
|
+
def setup_events
|
44
|
+
Bhm.set_varz("heartbeats_received", 0)
|
45
|
+
|
46
|
+
@processor.enable_pruning(Bhm.intervals.prune_events)
|
47
|
+
Bhm.plugins.each do |plugin|
|
48
|
+
@processor.add_plugin(lookup_plugin(plugin["name"], plugin["options"]), plugin["events"])
|
49
|
+
end
|
50
|
+
|
51
|
+
Bhm.nats.subscribe("hm.agent.heartbeat.*") do |message, reply, subject|
|
52
|
+
process_event(:heartbeat, subject, message)
|
53
|
+
end
|
54
|
+
|
55
|
+
Bhm.nats.subscribe("hm.agent.alert.*") do |message, reply, subject|
|
56
|
+
process_event(:alert, subject, message)
|
57
|
+
end
|
58
|
+
|
59
|
+
Bhm.nats.subscribe("hm.agent.shutdown.*") do |message, reply, subject|
|
60
|
+
process_event(:shutdown, subject, message)
|
61
|
+
end
|
62
|
+
end
|
63
|
+
|
64
|
+
def agents_count
|
65
|
+
@agents.size
|
66
|
+
end
|
67
|
+
|
68
|
+
def deployments_count
|
69
|
+
@deployments.size
|
70
|
+
end
|
71
|
+
|
72
|
+
# Syncs deployments list received from director
|
73
|
+
# with HM deployments.
|
74
|
+
# @param deployments Array list of deployments returned by director
|
75
|
+
def sync_deployments(deployments)
|
76
|
+
managed = Set.new(deployments.map { |d| d["name"] })
|
77
|
+
all = Set.new(@deployments.keys)
|
78
|
+
|
79
|
+
(all - managed).each do |stale_deployment|
|
80
|
+
@logger.warn("Found stale deployment #{stale_deployment}, removing...")
|
81
|
+
remove_deployment(stale_deployment)
|
82
|
+
end
|
83
|
+
end
|
84
|
+
|
85
|
+
def sync_agents(deployment, vms)
|
86
|
+
managed_agent_ids = @deployments[deployment] || Set.new
|
87
|
+
active_agent_ids = Set.new
|
88
|
+
|
89
|
+
vms.each do |vm|
|
90
|
+
if add_agent(deployment, vm)
|
91
|
+
active_agent_ids << vm["agent_id"]
|
92
|
+
end
|
93
|
+
end
|
94
|
+
|
95
|
+
(managed_agent_ids - active_agent_ids).each do |agent_id|
|
96
|
+
remove_agent(agent_id)
|
97
|
+
end
|
98
|
+
end
|
99
|
+
|
100
|
+
def remove_deployment(name)
|
101
|
+
agent_ids = @deployments[name]
|
102
|
+
|
103
|
+
agent_ids.to_a.each do |agent_id|
|
104
|
+
@agents.delete(agent_id)
|
105
|
+
end
|
106
|
+
|
107
|
+
@deployments.delete(name)
|
108
|
+
end
|
109
|
+
|
110
|
+
def remove_agent(agent_id)
|
111
|
+
@agents.delete(agent_id)
|
112
|
+
@deployments.each_pair do |deployment, agents|
|
113
|
+
agents.delete(agent_id)
|
114
|
+
end
|
115
|
+
end
|
116
|
+
|
117
|
+
# Processes VM data from BOSH Director,
|
118
|
+
# extracts relevant agent data, wraps it into Agent object
|
119
|
+
# and adds it to a list of managed agents.
|
120
|
+
def add_agent(deployment_name, vm_data)
|
121
|
+
unless vm_data.kind_of?(Hash)
|
122
|
+
@logger.error("Invalid format for VM data: expected Hash, got #{vm_data.class}: #{vm_data}")
|
123
|
+
return false
|
124
|
+
end
|
125
|
+
|
126
|
+
agent_id = vm_data["agent_id"]
|
127
|
+
agent_cid = vm_data["cid"]
|
128
|
+
|
129
|
+
if agent_id.nil?
|
130
|
+
@logger.warn("No agent id for VM: #{vm_data}")
|
131
|
+
return false
|
132
|
+
end
|
133
|
+
|
134
|
+
# Idle VMs, we don't care about them, but we still want to track them
|
135
|
+
if vm_data["job"].nil?
|
136
|
+
@logger.debug("VM with no job found: #{agent_id}")
|
137
|
+
end
|
138
|
+
|
139
|
+
agent = @agents[agent_id]
|
140
|
+
|
141
|
+
if agent.nil?
|
142
|
+
@logger.debug("Discovered agent #{agent_id}")
|
143
|
+
agent = Agent.new(agent_id)
|
144
|
+
@agents[agent_id] = agent
|
145
|
+
end
|
146
|
+
|
147
|
+
agent.deployment = deployment_name
|
148
|
+
agent.job = vm_data["job"]
|
149
|
+
agent.index = vm_data["index"]
|
150
|
+
agent.cid = vm_data["cid"]
|
151
|
+
|
152
|
+
@deployments[deployment_name] ||= Set.new
|
153
|
+
@deployments[deployment_name] << agent_id
|
154
|
+
true
|
155
|
+
end
|
156
|
+
|
157
|
+
def analyze_agents
|
158
|
+
@logger.info "Analyzing agents..."
|
159
|
+
started = Time.now
|
160
|
+
|
161
|
+
processed = Set.new
|
162
|
+
count = 0
|
163
|
+
|
164
|
+
# Agents from managed deployments
|
165
|
+
@deployments.each_pair do |deployment_name, agent_ids|
|
166
|
+
agent_ids.each do |agent_id|
|
167
|
+
analyze_agent(agent_id)
|
168
|
+
processed << agent_id
|
169
|
+
count += 1
|
170
|
+
end
|
171
|
+
end
|
172
|
+
|
173
|
+
# Rogue agents (hey there Solid Snake)
|
174
|
+
(@agents.keys.to_set - processed).each do |agent_id|
|
175
|
+
@logger.warn("Agent #{agent_id} is not a part of any deployment")
|
176
|
+
analyze_agent(agent_id)
|
177
|
+
count += 1
|
178
|
+
end
|
179
|
+
|
180
|
+
@logger.info("Analyzed %s, took %s seconds" % [ pluralize(count, "agent"), Time.now - started ])
|
181
|
+
count
|
182
|
+
end
|
183
|
+
|
184
|
+
def analyze_agent(agent_id)
|
185
|
+
agent = @agents[agent_id]
|
186
|
+
ts = Time.now.to_i
|
187
|
+
|
188
|
+
if agent.nil?
|
189
|
+
@logger.error("Can't analyze agent #{agent_id} as it is missing from agents index, skipping...")
|
190
|
+
return false
|
191
|
+
end
|
192
|
+
|
193
|
+
if agent.timed_out? && agent.rogue?
|
194
|
+
# Agent has timed out but it was never
|
195
|
+
# actually a proper member of the deployment,
|
196
|
+
# so we don't really care about it
|
197
|
+
remove_agent(agent.id)
|
198
|
+
return
|
199
|
+
end
|
200
|
+
|
201
|
+
if agent.timed_out?
|
202
|
+
@processor.process(:alert,
|
203
|
+
severity: 2,
|
204
|
+
source: agent.name,
|
205
|
+
title: "#{agent.id} has timed out",
|
206
|
+
created_at: ts,
|
207
|
+
deployment: agent.deployment,
|
208
|
+
job: agent.job,
|
209
|
+
index: agent.index)
|
210
|
+
end
|
211
|
+
|
212
|
+
if agent.rogue?
|
213
|
+
@processor.process(:alert,
|
214
|
+
:severity => 2,
|
215
|
+
:source => agent.name,
|
216
|
+
:title => "#{agent.id} is not a part of any deployment",
|
217
|
+
:created_at => ts)
|
218
|
+
end
|
219
|
+
|
220
|
+
true
|
221
|
+
end
|
222
|
+
|
223
|
+
def process_event(kind, subject, payload = {})
|
224
|
+
kind = kind.to_s
|
225
|
+
agent_id = subject.split('.', 4).last
|
226
|
+
agent = @agents[agent_id]
|
227
|
+
|
228
|
+
if agent.nil?
|
229
|
+
# There might be more than a single shutdown event,
|
230
|
+
# we are only interested in processing it if agent
|
231
|
+
# is still managed
|
232
|
+
return if kind == "shutdown"
|
233
|
+
|
234
|
+
@logger.warn("Received #{kind} from unmanaged agent: #{agent_id}")
|
235
|
+
agent = Agent.new(agent_id)
|
236
|
+
@agents[agent_id] = agent
|
237
|
+
else
|
238
|
+
@logger.debug("Received #{kind} from #{agent_id}: #{payload}")
|
239
|
+
end
|
240
|
+
|
241
|
+
case payload
|
242
|
+
when String
|
243
|
+
message = Yajl::Parser.parse(payload)
|
244
|
+
when Hash
|
245
|
+
message = payload
|
246
|
+
end
|
247
|
+
|
248
|
+
case kind.to_s
|
249
|
+
when "alert"
|
250
|
+
on_alert(agent, message)
|
251
|
+
when "heartbeat"
|
252
|
+
on_heartbeat(agent, message)
|
253
|
+
when "shutdown"
|
254
|
+
on_shutdown(agent, message)
|
255
|
+
else
|
256
|
+
@logger.warn("No handler found for `#{kind}' event")
|
257
|
+
end
|
258
|
+
|
259
|
+
rescue Yajl::ParseError => e
|
260
|
+
@logger.error("Cannot parse incoming event: #{e}")
|
261
|
+
rescue Bhm::InvalidEvent => e
|
262
|
+
@logger.error("Invalid event: #{e}")
|
263
|
+
end
|
264
|
+
|
265
|
+
def on_alert(agent, message)
|
266
|
+
if message.is_a?(Hash) && !message.has_key?("source")
|
267
|
+
message["source"] = agent.name
|
268
|
+
end
|
269
|
+
|
270
|
+
@processor.process(:alert, message)
|
271
|
+
@alerts_processed += 1
|
272
|
+
Bhm.set_varz("alerts_processed", @alerts_processed)
|
273
|
+
end
|
274
|
+
|
275
|
+
def on_heartbeat(agent, message)
|
276
|
+
agent.updated_at = Time.now
|
277
|
+
|
278
|
+
if message.is_a?(Hash)
|
279
|
+
message["timestamp"] = Time.now.to_i if message["timestamp"].nil?
|
280
|
+
message["agent_id"] = agent.id
|
281
|
+
message["deployment"] = agent.deployment
|
282
|
+
end
|
283
|
+
|
284
|
+
@processor.process(:heartbeat, message)
|
285
|
+
@heartbeats_received += 1
|
286
|
+
Bhm.set_varz("heartbeats_received", @heartbeats_received)
|
287
|
+
end
|
288
|
+
|
289
|
+
def on_shutdown(agent, message)
|
290
|
+
@logger.info("Agent `#{agent.id}' shutting down...")
|
291
|
+
remove_agent(agent.id)
|
292
|
+
end
|
293
|
+
|
294
|
+
end
|
295
|
+
end
|