reqsample 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: f40922b479c65c01c72272a9dffe7d596b9d91db
4
+ data.tar.gz: cabd856beb4100130e670a90f55aa93d2f2b461d
5
+ SHA512:
6
+ metadata.gz: 8e060ba69839a2abdd2ef117210c11fe6469ec32a6d4b7710f070747208929cf9e31cd1af109b4be563d4cce306b2cb03f69b5dc0300691ae87f6511c9f1868a
7
+ data.tar.gz: db8c94f6ad928c5926a8463eca20bc31813d6e592e0fa2896c6d8a0658db1f321e55c955f9d8459882022dc2672d2516f1a156d07ead9249ad6625b4646a69ca
data/.document ADDED
@@ -0,0 +1,5 @@
1
+ lib/**/*.rb
2
+ README.md
3
+ ChangeLog.md
4
+
5
+ LICENSE.txt
data/.gitignore ADDED
@@ -0,0 +1,6 @@
1
+ /.bundle
2
+ /Gemfile.lock
3
+ /html/
4
+ /pkg/
5
+ /vendor/cache/*.gem
6
+ *.gem
data/.rdoc_options ADDED
@@ -0,0 +1,16 @@
1
+ --- !ruby/object:RDoc::Options
2
+ encoding: UTF-8
3
+ static_path: []
4
+ rdoc_include:
5
+ - .
6
+ charset: UTF-8
7
+ exclude:
8
+ hyperlink_all: false
9
+ line_numbers: false
10
+ main_page: README.md
11
+ markup: markdown
12
+ show_hash: false
13
+ tab_width: 8
14
+ title: reqsample Documentation
15
+ visibility: :protected
16
+ webcvs:
data/.rspec ADDED
@@ -0,0 +1 @@
1
+ --colour --format documentation
data/ChangeLog.md ADDED
@@ -0,0 +1,12 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project will be documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
6
+ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
7
+
8
+ ## [Unreleased]
9
+
10
+ ## [0.0.1] (Sep 12, 2017)
11
+
12
+ - Initial release
data/Gemfile ADDED
@@ -0,0 +1,3 @@
1
+ source 'https://rubygems.org'
2
+
3
+ gemspec
data/Guardfile ADDED
@@ -0,0 +1,15 @@
1
+ notification ENV['INSIDE_EMACS'].nil? ? :tmux : :emacs,
2
+ display_message: true
3
+
4
+ guard :rspec, cmd: 'bundle exec rspec' do
5
+ require 'guard/rspec/dsl'
6
+ dsl = Guard::RSpec::Dsl.new(self)
7
+
8
+ # RSpec files
9
+ rspec = dsl.rspec
10
+ watch(rspec.spec_helper) { rspec.spec_dir }
11
+ watch(rspec.spec_support) { rspec.spec_dir }
12
+ watch(rspec.spec_files)
13
+
14
+ watch(%r{^lib/(.+)\.rb}) { |m| "spec/lib/#{m[1]}_spec.rb" }
15
+ end
data/LICENSE.txt ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2017 Tyler Langlois
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,93 @@
1
+ # reqsample
2
+
3
+ * [Homepage](https://rubygems.org/gems/reqsample)
4
+ * [Documentation](http://rubydoc.info/gems/reqsample/frames)
5
+
6
+ ## Description
7
+
8
+ `reqsample` is a utility to generate somewhat-realistic public HTTP traffic. If you've ever needed a large corpus of Apache or nginx logs to test geoip processing, a Logstash pipeline, or as the source for a demo; this utility is for you.
9
+
10
+ Data is sampled from publicly available data (sources noted in the [credits](#credits) section) and, whenever possible, the frequency of various datasets is observed and reflected in the random data. For example, Chrome will appear frequently in the `User-Agent` string since it is a common browser, and the most common source IPs originate from China due to the high amount of traffic observed from the country.
11
+
12
+ Note that fine-tuning the generation scheme requires munging with the normal distribution curve and a few other tricky parameters, but usable defaults are used out-of-the-box.
13
+
14
+ ### Quickstart
15
+
16
+ Generate 1,000 combined Apache log-formatted log entries, spanning the last 24 hours which peak 12 hours ago, and print them all to stdout:
17
+
18
+ ```shell
19
+ $ gem install reqsample
20
+ $ reqsample
21
+ ```
22
+
23
+ See `reqsample help` for a list of commands, flags, and options.
24
+
25
+ ## Features
26
+
27
+ - Weighted sampling for country of origin, user agents, and response codes to simulate real traffic.
28
+ - Usable in standalone command form or as a Ruby library.
29
+ - Ability to generate all traffic at once in bulk or streamed over time.
30
+ - Frequency and count of request events following a statistically normal distribution.
31
+
32
+ There are several different parameters that can be changed to modify how data is generated. In general:
33
+
34
+ - A number of logs to be generated over a given period needs to be chosen, which by default is 1,000.
35
+ - These many log events are generated over a normal distribution curve, with a configurable peak, standard deviation, and time cutoff - defaults are chosen with the assumption that you want to generate 1,000 logs over the previous 24 hours.
36
+ - The peak is 12 hours ago by default.
37
+ - The standard deviation is set to 4 by default, which translates to 4 hours in the logic of the random generation.
38
+ - The normal distribution of log data is truncated at 12 hours by default, which means all logs will fall within some timestamp within the past 24 hours.
39
+
40
+ ## Examples
41
+
42
+ There are two methods to use `reqsample`, either through the installed executable or as a library.
43
+
44
+ ### Command-Line Utility
45
+
46
+ Stream 5,000 log events to stdout with a tighter standard deviation:
47
+
48
+ ```
49
+ reqsample stream --count 5000 --stdev 1
50
+ ```
51
+
52
+ ### Ruby Library
53
+
54
+ The `ReqSample::Generator` class needs to be instantiated first, which parses and sets up several enumerables from which values will be sampled.
55
+
56
+ ```ruby
57
+ gen = ReqSample::Generator.new
58
+ ```
59
+
60
+ The `produce` method is the central way to generate log values:
61
+
62
+ ```ruby
63
+ gen.produce
64
+ ```
65
+
66
+ Will return an array of logs with the previously mentioned parameters. If a block is given to the `produce` method, the results will instead be streamed to the block by yielding each log event, simulating live incoming traffic.
67
+
68
+ ## Install
69
+
70
+ ```shell
71
+ $ gem install reqsample
72
+ ```
73
+
74
+ ## Development
75
+
76
+ Standard bundler practices are used, setup your environment with `bundle install` and use `bundle exec rake test` to run the still-incomplete test suite.
77
+
78
+ Note that all of the source data is retrieved with rake tasks and vendored into the final library to avoid continually retrieving and parsing sources. See `rake -T` for what the tasks are and potentially re-run them if needed.
79
+
80
+ ## Credits
81
+
82
+ - Country IP Addres Ranges
83
+ - http://www.nirsoft.net/countryip/
84
+ - Country internet connectivity stats
85
+ - https://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users
86
+ - User-Agents
87
+ - https://techblog.willshouse.com/2012/01/03/most-common-user-agents/
88
+
89
+ ## Copyright
90
+
91
+ Copyright (c) 2017 Tyler Langlois
92
+
93
+ See LICENSE.txt for details.
data/Rakefile ADDED
@@ -0,0 +1,104 @@
1
+ # encoding: utf-8
2
+
3
+ require 'rubygems'
4
+
5
+ begin
6
+ require 'bundler/setup'
7
+ rescue LoadError => e
8
+ abort e.message
9
+ end
10
+
11
+ require 'json'
12
+ require 'iso_country_codes'
13
+ require 'mechanize'
14
+ require 'open-uri'
15
+ require 'rake'
16
+
17
+ COUNTRY_CONNECTIVITY = 'https://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users'.freeze
18
+ CONNECTIVITY_XPATH = '//h2[span[contains(text(), "List")]]/following-sibling::table/tr[not(descendant::th)]'.freeze
19
+ USER_AGENTS = 'https://techblog.willshouse.com/2012/01/03/most-common-user-agents/'
20
+
21
+ require 'rubygems/tasks'
22
+ Gem::Tasks.new
23
+
24
+ require 'rdoc/task'
25
+ RDoc::Task.new
26
+ task :doc => :rdoc
27
+
28
+ require 'rspec/core/rake_task'
29
+ RSpec::Core::RakeTask.new
30
+
31
+ task :test => :spec
32
+ task :default => :spec
33
+
34
+ task :pry do
35
+ require 'pry'
36
+ require 'reqsample'
37
+ subject = ReqSample::Generator.new
38
+ ARGV.clear
39
+ binding.pry
40
+ end
41
+
42
+ desc 'Load in country IP ranges into a unified JSON dump.'
43
+ task :load_country_networks do
44
+ agent = Mechanize.new
45
+ agent.get(URI('http://www.nirsoft.net/countryip/')) do |page|
46
+ page.links_with(:href => /^[a-z]{2}[.]html$/).reduce({}) do |h, country|
47
+ h[country.href.split('.').first] = country.click
48
+ .link_with(:href => /[.]csv$/).click.body
49
+ .strip.split("\n").map(&:strip).map do |ips|
50
+ ips.split(',')[0..1]
51
+ end
52
+ h
53
+ end.tap do |network_hash|
54
+ File.open('vendor/country_networks.json', 'w') do |fh|
55
+ fh.write JSON.dump(network_hash)
56
+ end
57
+ end
58
+ end
59
+ end
60
+
61
+ desc 'Retrieve list of internet-connected users by Country.'
62
+ task :load_country_connectivity do
63
+ Nokogiri::HTML(open(COUNTRY_CONNECTIVITY)).tap do |page|
64
+ page.xpath(CONNECTIVITY_XPATH).map do |row|
65
+ [
66
+ IsoCountryCodes.search_by_name(
67
+ case (c = row.xpath('td')[0].xpath('a').text.strip.downcase)
68
+ when 'vietnam' then 'viet nam'
69
+ when 'south korea' then 'korea (republic'
70
+ when 'czech republic' then 'czech'
71
+ when 'ivory coast' then 'côte'
72
+ when 'laos' then 'lao'
73
+ when /congo/ then 'congo'
74
+ when /gambia/ then 'gambia'
75
+ when /bahama/ then 'bahama'
76
+ when /são/ then 'sao'
77
+ else c
78
+ end
79
+ ).first.alpha2.downcase,
80
+ row.xpath('td')[1].text.delete(',').to_i
81
+ ]
82
+ end.to_h.tap do |statistics|
83
+ File.open('vendor/country_connectivity.json', 'w') do |fh|
84
+ fh.write JSON.dump(statistics)
85
+ end
86
+ end
87
+ end
88
+ end
89
+
90
+ desc 'Retrieve list of common User-Agents.'
91
+ task :load_user_agents do
92
+ Nokogiri::HTML(open(USER_AGENTS)).tap do |page|
93
+ page.at_css('.most-common-user-agents').xpath('tbody/tr').map do |row|
94
+ [
95
+ row.at_css('.useragent').text.strip,
96
+ row.at_css('.percent').text.strip.chomp('%').to_f
97
+ ]
98
+ end.to_h.tap do |list|
99
+ File.open('vendor/user_agents.json', 'w') do |fh|
100
+ fh.write JSON.dump(list)
101
+ end
102
+ end
103
+ end
104
+ end
data/bin/reqsample ADDED
@@ -0,0 +1,6 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ libdir = File.expand_path('../lib', File.dirname(__FILE__))
4
+ $LOAD_PATH << libdir if File.exist?(File.join(libdir, 'reqsample', 'cli.rb'))
5
+ require 'reqsample/cli'
6
+ ReqSample::CLI.start
data/lib/reqsample.rb ADDED
@@ -0,0 +1,2 @@
1
+ require 'reqsample/version'
2
+ require 'reqsample/generator'
@@ -0,0 +1,45 @@
1
+ require 'chronic'
2
+ require 'thor'
3
+ require 'reqsample'
4
+
5
+ module ReqSample
6
+ # Command-line interface to the library
7
+ class CLI < Thor
8
+ class_option :count,
9
+ default: 1000,
10
+ type: :numeric
11
+ class_option :format,
12
+ default: :apache,
13
+ desc: 'Output format of generated logs'
14
+ class_option :stdev,
15
+ default: 4,
16
+ desc: 'Standard deviation to use for timespan normal distribution',
17
+ type: :numeric
18
+ class_option :truncate,
19
+ default: 12,
20
+ desc: 'Cutoff (in hours) that logs should remain +/- within',
21
+ type: :numeric
22
+
23
+ option :peak,
24
+ default: '12 hours ago',
25
+ desc: 'Time at which logs should peak (Chronic-style strings)'
26
+ desc 'generate', 'Generate a sample of webserver logs'
27
+ def generate
28
+ opts = options.dup
29
+ opts[:peak] = Chronic.parse options[:peak]
30
+ puts ReqSample::Generator.new(options[:stdev]).produce(opts).join("\n")
31
+ end
32
+
33
+ option :peak,
34
+ default: 'in 12 hours',
35
+ desc: 'Time at which logs should peak (Chronic-style strings)'
36
+ desc 'stream', 'Gradually stream generated logs over given time'
37
+ def stream
38
+ opts = options.dup
39
+ opts[:peak] = Chronic.parse options[:peak]
40
+ ReqSample::Generator.new(options[:stdev]).produce(opts) do |log|
41
+ puts log
42
+ end
43
+ end
44
+ end
45
+ end
@@ -0,0 +1,132 @@
1
+ require 'json'
2
+ require 'ipaddr'
3
+ require 'rubystats'
4
+
5
+ require 'reqsample/hash'
6
+ require 'reqsample/response_codes'
7
+ require 'reqsample/request_paths'
8
+ require 'reqsample/request_verbs'
9
+ require 'reqsample/time'
10
+
11
+ # Top-level module for ReqSample constants and classes.
12
+ module ReqSample
13
+ # Main class for creating randomized data.
14
+ class Generator
15
+ attr_accessor :agents,
16
+ :codes,
17
+ :connectivity,
18
+ :dist,
19
+ :max_bytes,
20
+ :networks,
21
+ :verbs
22
+
23
+ DEFAULT_COUNT = 1000
24
+ DEFAULT_DOMAIN = 'http://example.com'.freeze
25
+ DEFAULT_FORMAT = :apache
26
+ DEFAULT_MAX_BYTES = 512
27
+
28
+ # @param peak_sd [Float] standard deviation in the normal distribution
29
+ def initialize(peak_sd = 4.0)
30
+ @agents = ReqSample::Hash.weighted(vendor('user_agents.json'))
31
+ @codes = ReqSample::Hash.weighted(ReqSample::RESPONSE_CODES)
32
+ # Peak at zero (will be summed with the Time object)
33
+ @connectivity = ReqSample::Hash.weighted(
34
+ vendor('country_connectivity.json')
35
+ )
36
+ @dist = Rubystats::NormalDistribution.new(0, peak_sd)
37
+ @max_bytes = DEFAULT_MAX_BYTES
38
+ @networks = vendor('country_networks.json')
39
+ @verbs = ReqSample::Hash.weighted(ReqSample::REQUEST_VERBS)
40
+ end
41
+
42
+ # @option opts [Integer] :count how many logs to generate
43
+ # @option opts [String] :format form to return logs, :apache or :hash
44
+ # @option opts [Time] :peak normal distribution peak for log timestamps
45
+ # @option opts [Integer] :truncate hard limit to keep log range within
46
+ #
47
+ # @return [Array<String, Hash>] the collection of generated log events
48
+ def produce(opts = {})
49
+ opts[:count] ||= DEFAULT_COUNT
50
+ opts[:format] ||= DEFAULT_FORMAT
51
+ opts[:peak] ||= Time.now - (12 * 60 * 60)
52
+ opts[:truncate] ||= 12
53
+
54
+ 1.upto(opts[:count]).map do |_|
55
+ sample_time opts[:peak], opts[:truncate]
56
+ end.sort.map do |time|
57
+ if block_given?
58
+ if (delay = time - Time.now) > 0 then sleep delay end
59
+ yield sample time, opts[:format]
60
+ else
61
+ sample time, opts[:format]
62
+ end
63
+ end
64
+ end
65
+
66
+ def sample(time = nil, fmt = nil)
67
+ # Pull a random country, but ensure it's a valid country code for the
68
+ # list of networks that we have available.
69
+ country = connectivity.weighted_sample do |ccodes|
70
+ ccodes.detect do |ccode|
71
+ networks.keys.include? ccode
72
+ end
73
+ end
74
+
75
+ sample = {
76
+ address: sample_address(country),
77
+ agent: agents.weighted_sample,
78
+ bytes: rand(max_bytes),
79
+ code: codes.weighted_sample,
80
+ domain: DEFAULT_DOMAIN,
81
+ path: ReqSample::REQUEST_PATHS.sample,
82
+ time: time || sample_time(opts),
83
+ verb: verbs.weighted_sample
84
+ }
85
+
86
+ format fmt, country, sample
87
+ end
88
+
89
+ def format(style, country, sample)
90
+ case style.to_s
91
+ when 'apache'
92
+ [
93
+ "#{sample[:address]} - -",
94
+ "[#{sample[:time].strftime('%d/%b/%Y:%H:%M:%S %z')}]",
95
+ %("#{sample[:verb]} #{sample[:path]} HTTP/1.1"),
96
+ sample[:code],
97
+ sample[:bytes],
98
+ %("#{sample[:domain]}"),
99
+ %("#{sample[:agent]}")
100
+ ].join ' '
101
+ else
102
+ { country => sample }
103
+ end
104
+ end
105
+
106
+ def sample_address(country = nil)
107
+ country ||= networks.keys.sample
108
+
109
+ head, tail = networks[country].sample
110
+ IPAddr.new(
111
+ rand(IPAddr.new(head).to_i..IPAddr.new(tail).to_i),
112
+ Socket::AF_INET
113
+ )
114
+ end
115
+
116
+ # Limit the normal distribution to +/- 12 hours (assume we want to stay
117
+ # within a 24-hour period).
118
+ def sample_time(peak, truncate)
119
+ loop do
120
+ sample = ReqSample::Time.at((peak + (dist.rng * 60 * 60)).to_i)
121
+ break sample if sample.within peak, truncate
122
+ end
123
+ end
124
+ end
125
+
126
+ private
127
+
128
+ def vendor(file)
129
+ v = File.expand_path('../../../vendor', __FILE__)
130
+ JSON.parse(File.read(File.join(v, file)))
131
+ end
132
+ end