RubyGems - log2json - Versions diffs - 0.1.5 - Mend

log2json 0.1.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

data/Gemfile +2 -0
data/Gemfile.lock +24 -0
data/README +66 -0
data/bin/lines2redis +73 -0
data/bin/nginxlog2json +58 -0
data/bin/redis2es +146 -0
data/bin/syslog2json +23 -0
data/bin/tail +0 -0
data/bin/tail-log +7 -0
data/bin/tail-log.sh +67 -0
data/bin/track-tails +54 -0
data/lib/log2json.rb +217 -0
data/lib/log2json/filters/base.patterns +93 -0
data/lib/log2json/filters/nginx_access.rb +46 -0
data/lib/log2json/filters/syslog.rb +62 -0
data/lib/log2json/railslogger.rb +96 -0
data/log2json.gemspec +18 -0
data/src/coreutils-8.13_tail.patch +9 -0
data/src/tail.c +2224 -0
metadata +192 -0

data/Gemfile ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ source "https://rubygems.org"
2	+ gemspec

data/Gemfile.lock ADDED Viewed

@@ -0,0 +1,24 @@
+PATH
+  remote: .
+  specs:
+    log2json (0.1.0)
+      jls-grok (~> 0.10.10)
+      persistent_http (~> 1.0.5)
+      redis (~> 3.0.2)
+GEM
+  remote: https://rubygems.org/
+  specs:
+    cabin (0.5.0)
+    gene_pool (1.3.0)
+    jls-grok (0.10.10)
+      cabin (~> 0.5.0)
+    persistent_http (1.0.5)
+      gene_pool (>= 1.3)
+    redis (3.0.3)
+PLATFORMS
+  ruby
+DEPENDENCIES
+  log2json!

data/README ADDED Viewed

@@ -0,0 +1,66 @@
+Log2json lets you read, filter and send logs as JSON objects via Unix pipes.
+It is inspired by Logstash, and is meant to be compatible with it at the JSON
+event/record level so that it can easily work with Kibana.
+Reading logs is done via a shell script(eg, `tail`) running in its own process.
+You then configure(see the `syslog2json` or the `nginxlog2json` script for
+examples) and run your filters in Ruby using the `Log2Json` module and its
+contained helper classes.
+`Log2Json` reads from stdin the logs(one log record per line), parses the log
+lines into JSON records, and then serializes and writes the records to stdout,
+which then can be piped to another process for processing or sending it to
+somewhere else.
+Currently, Log2json ships with a `tail-log` script that can be run as the input
+process. It's the same as using the Linux `tail` utility with the `-v -F`
+options except that it also tracks the positions(as the numbers of lines read
+from the beginning of the files) in a few files in the file system so that if the
+input process is interrupted, it can continue reading from where it left off
+next time if the files had been followed. This feature is similar to the sincedb
+feature in Logstash's file input.
+Note: If you don't need the tracking feature(ie, you are fine with always
+tailling from the end of file with `-v -F -n0`), then you can just use the `tail`
+utility that comes with your Linux distribution.(Or more specifically, the
+`tail` from the GNU coreutils). Other versions of the `tail` utility may also
+work, but are not tested. The input protocol expected by Log2json is very
+simple and documented in the source code.
+** The `tail-log` script uses a patched version of `tail` from the GNU coreutils
+   package. A binary of the `tail` utility compiled for Ubuntu 12.04 LTS is
+   included with the Log2json gem. If the binary doesn't work for your
+   distribution, then you'll need to get GNU coreutils-8.13, apply the patch(it
+   can be found in the src/ directory of the installed gem), and then replace
+   the bin/tail binary in the directory of the installed gem with your version
+   of the binary. **
+P.S. If you know of a way to configure and compile ONLY the tail program in
+     coreutils, please let me know! The reason I'm not building tail post gem
+     installation is that it takes too long to configure && make because that
+     actually builds every utilties in coreutils.
+For shipping logs to Redis, there's the `lines2redis` script that can be used as
+the output process in the pipe. For shipping logs from Redis to ElasticSearch,
+Log2json provides a `redis2es` script.
+Finally here's an example of Log2json in action:
+From a client machine:
+  tail-log /var/log/{sys,mail}log /var/log/{kern,auth}.log | syslog2json |
+        queue=jsonlogs \
+        flush_size=20 \
+        flush_interval=30 \
+        lines2redis host.to.redis.server 6379 0  # use redis DB 0
+On the Redis server:
+  redis_queue=jsonlogs redis2es host.to.es.server

data/bin/lines2redis ADDED Viewed

@@ -0,0 +1,73 @@
+#!/usr/bin/env ruby
+#
+# A simple script that reads lines from STDIN and dump them to a Redis list.
+require 'thread'
+require 'redis'
+require 'logger'
+@log = Logger.new(STDOUT)
+def const(name, default)
+  name = name.to_s.downcase
+  val = ENV["lines2redis_#{name}"] || ENV[name]
+  val = val.to_i() if !val.nil? && default.is_a?(Fixnum)
+  Object.const_set(name.upcase, val || default)
+end
+const(:REDIS_QUEUE, 'jsonlogs')
+const(:FLUSH_SIZE, 100)
+const(:FLUSH_INTERVAL, 30) # seconds
+config = {}
+[:host, :port, :db].each_with_index do |s, i|
+  config[s] = ARGV[i] if ARGV[i]
+end
+@redis = Redis.new(config)
+ARGV.clear()
+@lock = Mutex.new
+@queue = []
+def main
+  Thread.new do
+    loop do
+      sleep(FLUSH_INTERVAL)
+      @lock.synchronize do
+        flush()
+      end
+    end
+  end
+  while line=gets()
+    line.chomp!
+    @lock.synchronize do
+      @queue << line
+      flush() if @queue.size >= FLUSH_SIZE
+    end
+  end
+  flush()
+end
+def flush
+  return if @queue.empty?
+  if @queue.size >= FLUSH_SIZE*2
+    @log.warn('Aborting, dumping queued log messages to stdout!')
+    @queue.each { |msg| puts msg }
+    raise "Queue has grown too big(size=#{@queue.size})!"
+  end
+  begin
+    @redis.rpush(REDIS_QUEUE, @queue)
+    @queue.clear()
+  rescue Redis::BaseConnectionError
+    @log.error($!)
+  end
+end
+begin
+  main()
+ensure
+  @log.warn("Terminating! Flushing the queue...")
+  flush()
+end

data/bin/nginxlog2json ADDED Viewed

@@ -0,0 +1,58 @@
+#!/usr/bin/env ruby
+require 'date'
+require 'log2json'
+require 'log2json/filters/nginx_access'
+# Require your log2json filter gems here if needed
+# FILTERS will be { type1=>[ filterX, filterY, ...], type2=>[...], ... }
+FILTERS = Hash.new { |hash, key| hash[key] = [] }
+# This method will be used later by the GrokFilter to process the JSON log records.
+def nginx_error_log_proc(record)
+  return nil if record.nil?  # return nil if the recrod doesn't match our regexs
+  fields = record['@fields']
+  record['@timestamp'] = DateTime.strptime(fields['datetime'], '%Y/%m/%d %T')
+  fields.delete('datetime')
+  record['@tags'] << 'nginx' << 'http'
+  record
+end
+# Configure log filters
+[
+  # You can subclass the GrokFilter and use it here like this NginxAccessLogFilter
+  ::Log2Json::Filters::NginxAccessLogFilter.new('NginxAccessLogFilter'),
+  # Or you can use conigure the GrokFilter directly here like this:
+  ::Log2Json::Filters::GrokFilter.new(
+      'nginx-error',         # type
+      'NginxErrorLogFilter', # name
+      # list of Grok regex
+      ['%{DATESTAMP:datetime} \[(?<level>[^\]]+)\] %{NUMBER:pid}#%{NUMBER:tid}: %{GREEDYDATA:message}'],
+      &method(:nginx_error_log_proc)
+  ),
+  # You can add more filters if needed
+].each { |filter| FILTERS[filter.type] << filter }
+# Setup the file path-to-type map
+SPITTER = ::Log2Json::Spitter.new(STDIN,
+  ENV['type'] || {
+    %r</access\.log$> => 'nginx-access',
+    %r</error\.log$> => 'nginx-error',
+    nil => 'unknown'   #  setup a default type to apply when there's no matches.
+  },
+  # So eg, if a log record comes from /var/log/nginx/access.log then it will be marked with type: nginx-access
+  # and all filters of that type will process such log record.
+  # Give users the ability to set tags and fields via ENV vars that will apply to ALL log records.
+  :TAGS => ENV['tags'] || '',
+  :FIELDS => ENV['fields'] || '',
+)
+# Start processing log lines
+::Log2Json.main(FILTERS, :spitter => SPITTER)

data/bin/redis2es ADDED Viewed

@@ -0,0 +1,146 @@
+#!/usr/bin/env ruby
+require 'logger'
+require 'date'
+require 'net/http'
+require 'json'
+require 'redis'
+require 'persistent_http' # 1.0.5
+                          # depends on gene_pool 1.3.0
+def show_usage_and_exit(status=1)
+  puts "Usage: #{$0} <elasticsearch_host> [port]"
+  exit status
+end
+ES_HOST = ARGV[0] || show_usage_and_exit
+ES_PORT = ARGV[1] || 9200
+def const(name, default)
+  name = name.to_s.downcase
+  val = ENV["redis2es_#{name}"] || ENV[name]
+  val = val.to_i() if !val.nil? && default.is_a?(Fixnum)
+  Object.const_set(name.upcase, val || default)
+end
+# These constants can be overriden via environment variables in lower case or
+# also prefixed with redis2es.(the later has higher precedence).
+# Eg., flush_size=100 or redis2es_flush_size=100
+const(:REDIS_HOST, 'localhost')
+const(:REDIS_PORT, 6379)
+# name of the redis list that queues the incoming log messages.
+const(:REDIS_QUEUE, 'jsonlogs')
+# the encoding assumed for the log records.
+const(:LOG_ENCODING, 'UTF-8')
+# name of the ES index for the logs. It will be passed to DateTime.strftime(LOG_INDEX_NAME)
+const(:LOG_INDEX_NAME, 'log2json-%Y.%m.%d')
+# max number of log records allowed in the queue.
+const(:FLUSH_SIZE, 200)
+# flush the queue roughly every FLUSH_TIMEOUT seconds.
+# This value must be >= 2 and it must be a multiple of 2.
+const(:FLUSH_TIMEOUT, 60)
+if FLUSH_TIMEOUT < 2 or FLUSH_TIMEOUT % 2 != 0
+  STDERR.write("Invalid FLUSH_TIMEOUT=#{FLUSH_TIMEOUT}\n")
+  exit 1
+end
+LOG = Logger.new(STDOUT)
+HTTP_LOG = Logger.new(STDOUT)
+HTTP_LOG.level = Logger::WARN
+@@http = PersistentHTTP.new(
+  :name         => 'redis2es_http_client',
+  :logger       => HTTP_LOG,
+  # this script is the only consumer of the pool and it uses only one connection at a time.
+  :pool_size    => 1,
+  # Note: if the ES server can handle the load, we might be able to run multiple instances
+  #       of this script to process the queue and send logs ES with multiple connections.
+  # we'll retry posting to ES since having duplicate data in ES is better than not having them.
+  :force_retry  => true,
+  :url          => "http://#{ES_HOST}:#{ES_PORT}"
+)
+@queue = []
+@redis = Redis.new(host: REDIS_HOST, port: REDIS_PORT)
+def flush_queue
+  if not @queue.empty?
+    req = Net::HTTP::Post.new('/_bulk')
+    req.body = @queue.join("\n") + "\n"
+    response = nil
+    begin
+      response = @@http.request(req)
+    ensure
+      if response.nil? or response.code != '200'
+        LOG.error(response.body) if not response.nil?
+        #FIXME: might be a good idea to push the undelivered log records to another queue in redis.
+        LOG.warn("Failed sending bulk request(#{@queue.size} records) to ES! Logging the request body instead.")
+        LOG.info("Failed request body:\n"+req.body)
+      end
+    end
+    @queue.clear()
+  end
+end
+# Determines the name of the index in ElasticSearch from the given log record's timestamp.
+def es_index(tstamp)
+  begin
+    t = DateTime.parse(tstamp)
+  rescue ArgumentError
+    LOG.warn("Failed parsing timestamp: #{tstamp}")
+    t = DateTime.now
+  end
+  t.strftime(LOG_INDEX_NAME)
+end
+def enqueue(logstr)
+  #FIXME: might be safer to do a transcoding with replacements for invalid or undefined characters.
+  log = JSON.load(logstr.force_encoding(LOG_ENCODING))
+  # add header for each entry according to http://www.elasticsearch.org/guide/reference/api/bulk/
+  @queue << {"index" => {"_index" => es_index(log["@timestamp"]), "_type" => log["@type"]}}.to_json
+  @queue << log.to_json
+end
+def main
+  time_start = Time.now
+  loop do
+    # wait for input from the redis queue
+    ret = @redis.blpop(REDIS_QUEUE, timeout: FLUSH_TIMEOUT/2)
+    enqueue(ret[1]) if ret != nil
+    # try to queue up to FLUSH_SIZE
+    while @queue.size < FLUSH_SIZE do
+      # Logstash's redis input actually uses a Lua script to do the lpop in one request,
+      # but let's keep it simple and stupid first here.
+      body = @redis.lpop(REDIS_QUEUE)
+      break if body.nil?
+      enqueue(body)
+    end
+    # flush when the queue is full or when time is up.
+    if @queue.size == FLUSH_SIZE or (Time.now - time_start) >= FLUSH_TIMEOUT
+      time_start = Time.now  # reset timer upon a flush or timeout
+      flush_queue()
+    end
+  end # loop
+end
+begin
+  main()
+ensure
+  LOG.warn("Terminating! Flushing the queue(size=#{@queue.size})...")
+  flush_queue()
+end

data/bin/syslog2json ADDED Viewed

@@ -0,0 +1,23 @@
+#!/usr/bin/env ruby
+require 'log2json'
+require 'log2json/filters/syslog'
+# Require your log2json filter gems here if needed
+# Configure log filters
+# FILTERS will be { type1=>[ filterX, filterY, ...], type2=>[...], ... }
+FILTERS = Hash.new { |hash, key| hash[key] = [] }
+[
+  # As a demo, we setup the built-in syslog filter here.
+  ::Log2Json::Filters::SyslogFilter.new('SysLogFilter'),
+  # You can add more filters if needed
+].each { |filter| FILTERS[filter.type] << filter }
+# Assume the type of the input logs
+ENV['type'] = 'syslog'
+# Start processing log lines
+::Log2Json.main(FILTERS)

data/bin/tail ADDED Viewed

Binary file

data/bin/tail-log ADDED Viewed

@@ -0,0 +1,7 @@
+#!/usr/bin/env ruby
+#
+# Wrapper for running the tail-log.sh shell script.
+require 'log2json'
+loc = Log2Json.method(:main).source_location[0]
+loc = File.expand_path(File.join(loc, '..', '..', 'bin', 'tail-log.sh'))
+exec(loc, *ARGV)

data/bin/tail-log.sh ADDED Viewed

@@ -0,0 +1,67 @@
+#!/bin/bash
+#
+set -e
+# Find out the absolute path to the tail utility.
+# This is a patched version of the tail utility in GNU coreutils-8.13 compiled for Ubuntu 12.04 LTS.
+# The difference is that if header will be shown(ie, with -v or when multiple files are specified),
+# it will also print "==> file.name <== [event]" to stdout whenever a file truncation or a new file is
+# detected. [event] will be one of "[new_file]" or "[truncated]".
+TAIL=$(
+ruby -- - <<'EOF'
+  require 'log2json'
+  loc = Log2Json.method(:main).source_location[0]
+  puts File.expand_path(File.join(loc, '..', '..', 'bin', 'tail'))
+EOF
+)
+# Turn each path arguments into absolute path.
+OIFS=$IFS
+IFS="
+"
+set -- $(ruby -e "ARGV.each {|p| puts File.absolute_path(p)}" "$@")
+IFS=$OIFS
+# This is where we store the files that track the positions of the
+# files we are tailing.
+SINCEDB_DIR=${SINCEDB_DIR:-~/.tail-log}
+mkdir -p "$SINCEDB_DIR" || true
+# Helper to build the arguments to tail.
+# Specifically, we expect the use of GNU tail as found in GNU coreutils.
+# It allows us to follow(with -F) files across rotations or truncations.
+# It also lets us start tailing from the n-th line of a file.
+build_tail_args() {
+  local i=${#TAIL_ARGS[*]}
+  local fpath t line sincedb_path
+  for fpath in "$@"
+  do
+    sincedb_path=$SINCEDB_DIR/$fpath.since
+    if [ -r "$sincedb_path" ]; then
+      read line < "$sincedb_path"
+      t=($line)
+      # if inode number is unchanged and the current file size is not smaller
+      # then we start tailing from 1 + the line number recorded in the sincedb.
+      if [[ ${t[0]} == $(stat -c "%i" "$fpath") && ${t[1]} -le $(stat -c "%s" "$fpath") ]]; then
+        TAIL_ARGS[$((i++))]="-n+$((t[2] + 1))"
+        # tail -n+N means start tailing from the N-th line of the file
+        # and we're even allowed to specify different -n+N for different files!
+        TAIL_ARGS[$((i++))]=$fpath
+        continue
+      fi
+    fi
+    TAIL_ARGS[$((i++))]="-n+$(($(wc -l "$fpath" | cut -d' ' -f1) + 1))"
+    # Note: we can't just ask tail to seek to the end here(ie, with -n0) since
+    #       then we'd lose track of the line count.
+    # Note: if fpath doesn't exist yet, then the above evaluates to "-n+1", which
+    #       is fine.
+    TAIL_ARGS[$((i++))]=$fpath
+  done
+}
+TAIL_ARGS=(-v -F)
+build_tail_args "$@"
+$TAIL "${TAIL_ARGS[@]}" | track-tails "$SINCEDB_DIR" "${TAIL_ARGS[@]}"