reqsample 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: f40922b479c65c01c72272a9dffe7d596b9d91db
4
+ data.tar.gz: cabd856beb4100130e670a90f55aa93d2f2b461d
5
+ SHA512:
6
+ metadata.gz: 8e060ba69839a2abdd2ef117210c11fe6469ec32a6d4b7710f070747208929cf9e31cd1af109b4be563d4cce306b2cb03f69b5dc0300691ae87f6511c9f1868a
7
+ data.tar.gz: db8c94f6ad928c5926a8463eca20bc31813d6e592e0fa2896c6d8a0658db1f321e55c955f9d8459882022dc2672d2516f1a156d07ead9249ad6625b4646a69ca
data/.document ADDED
@@ -0,0 +1,5 @@
1
+ lib/**/*.rb
2
+ README.md
3
+ ChangeLog.md
4
+
5
+ LICENSE.txt
data/.gitignore ADDED
@@ -0,0 +1,6 @@
1
+ /.bundle
2
+ /Gemfile.lock
3
+ /html/
4
+ /pkg/
5
+ /vendor/cache/*.gem
6
+ *.gem
data/.rdoc_options ADDED
@@ -0,0 +1,16 @@
1
+ --- !ruby/object:RDoc::Options
2
+ encoding: UTF-8
3
+ static_path: []
4
+ rdoc_include:
5
+ - .
6
+ charset: UTF-8
7
+ exclude:
8
+ hyperlink_all: false
9
+ line_numbers: false
10
+ main_page: README.md
11
+ markup: markdown
12
+ show_hash: false
13
+ tab_width: 8
14
+ title: reqsample Documentation
15
+ visibility: :protected
16
+ webcvs:
data/.rspec ADDED
@@ -0,0 +1 @@
1
+ --colour --format documentation
data/ChangeLog.md ADDED
@@ -0,0 +1,12 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project will be documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
6
+ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
7
+
8
+ ## [Unreleased]
9
+
10
+ ## [0.0.1] (Sep 12, 2017)
11
+
12
+ - Initial release
data/Gemfile ADDED
@@ -0,0 +1,3 @@
1
+ source 'https://rubygems.org'
2
+
3
+ gemspec
data/Guardfile ADDED
@@ -0,0 +1,15 @@
1
+ notification ENV['INSIDE_EMACS'].nil? ? :tmux : :emacs,
2
+ display_message: true
3
+
4
+ guard :rspec, cmd: 'bundle exec rspec' do
5
+ require 'guard/rspec/dsl'
6
+ dsl = Guard::RSpec::Dsl.new(self)
7
+
8
+ # RSpec files
9
+ rspec = dsl.rspec
10
+ watch(rspec.spec_helper) { rspec.spec_dir }
11
+ watch(rspec.spec_support) { rspec.spec_dir }
12
+ watch(rspec.spec_files)
13
+
14
+ watch(%r{^lib/(.+)\.rb}) { |m| "spec/lib/#{m[1]}_spec.rb" }
15
+ end
data/LICENSE.txt ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2017 Tyler Langlois
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,93 @@
1
+ # reqsample
2
+
3
+ * [Homepage](https://rubygems.org/gems/reqsample)
4
+ * [Documentation](http://rubydoc.info/gems/reqsample/frames)
5
+
6
+ ## Description
7
+
8
+ `reqsample` is a utility to generate somewhat-realistic public HTTP traffic. If you've ever needed a large corpus of Apache or nginx logs to test geoip processing, a Logstash pipeline, or as the source for a demo; this utility is for you.
9
+
10
+ Data is sampled from publicly available data (sources noted in the [credits](#credits) section) and, whenever possible, the frequency of various datasets is observed and reflected in the random data. For example, Chrome will appear frequently in the `User-Agent` string since it is a common browser, and the most common source IPs originate from China due to the high amount of traffic observed from the country.
11
+
12
+ Note that fine-tuning the generation scheme requires munging with the normal distribution curve and a few other tricky parameters, but usable defaults are used out-of-the-box.
13
+
14
+ ### Quickstart
15
+
16
+ Generate 1,000 combined Apache log-formatted log entries, spanning the last 24 hours which peak 12 hours ago, and print them all to stdout:
17
+
18
+ ```shell
19
+ $ gem install reqsample
20
+ $ reqsample
21
+ ```
22
+
23
+ See `reqsample help` for a list of commands, flags, and options.
24
+
25
+ ## Features
26
+
27
+ - Weighted sampling for country of origin, user agents, and response codes to simulate real traffic.
28
+ - Usable in standalone command form or as a Ruby library.
29
+ - Ability to generate all traffic at once in bulk or streamed over time.
30
+ - Frequency and count of request events following a statistically normal distribution.
31
+
32
+ There are several different parameters that can be changed to modify how data is generated. In general:
33
+
34
+ - A number of logs to be generated over a given period needs to be chosen, which by default is 1,000.
35
+ - These many log events are generated over a normal distribution curve, with a configurable peak, standard deviation, and time cutoff - defaults are chosen with the assumption that you want to generate 1,000 logs over the previous 24 hours.
36
+ - The peak is 12 hours ago by default.
37
+ - The standard deviation is set to 4 by default, which translates to 4 hours in the logic of the random generation.
38
+ - The normal distribution of log data is truncated at 12 hours by default, which means all logs will fall within some timestamp within the past 24 hours.
39
+
40
+ ## Examples
41
+
42
+ There are two methods to use `reqsample`, either through the installed executable or as a library.
43
+
44
+ ### Command-Line Utility
45
+
46
+ Stream 5,000 log events to stdout with a tighter standard deviation:
47
+
48
+ ```
49
+ reqsample stream --count 5000 --stdev 1
50
+ ```
51
+
52
+ ### Ruby Library
53
+
54
+ The `ReqSample::Generator` class needs to be instantiated first, which parses and sets up several enumerables from which values will be sampled.
55
+
56
+ ```ruby
57
+ gen = ReqSample::Generator.new
58
+ ```
59
+
60
+ The `produce` method is the central way to generate log values:
61
+
62
+ ```ruby
63
+ gen.produce
64
+ ```
65
+
66
+ Will return an array of logs with the previously mentioned parameters. If a block is given to the `produce` method, the results will instead be streamed to the block by yielding each log event, simulating live incoming traffic.
67
+
68
+ ## Install
69
+
70
+ ```shell
71
+ $ gem install reqsample
72
+ ```
73
+
74
+ ## Development
75
+
76
+ Standard bundler practices are used, setup your environment with `bundle install` and use `bundle exec rake test` to run the still-incomplete test suite.
77
+
78
+ Note that all of the source data is retrieved with rake tasks and vendored into the final library to avoid continually retrieving and parsing sources. See `rake -T` for what the tasks are and potentially re-run them if needed.
79
+
80
+ ## Credits
81
+
82
+ - Country IP Addres Ranges
83
+ - http://www.nirsoft.net/countryip/
84
+ - Country internet connectivity stats
85
+ - https://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users
86
+ - User-Agents
87
+ - https://techblog.willshouse.com/2012/01/03/most-common-user-agents/
88
+
89
+ ## Copyright
90
+
91
+ Copyright (c) 2017 Tyler Langlois
92
+
93
+ See LICENSE.txt for details.
data/Rakefile ADDED
@@ -0,0 +1,104 @@
1
+ # encoding: utf-8
2
+
3
+ require 'rubygems'
4
+
5
+ begin
6
+ require 'bundler/setup'
7
+ rescue LoadError => e
8
+ abort e.message
9
+ end
10
+
11
+ require 'json'
12
+ require 'iso_country_codes'
13
+ require 'mechanize'
14
+ require 'open-uri'
15
+ require 'rake'
16
+
17
+ COUNTRY_CONNECTIVITY = 'https://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users'.freeze
18
+ CONNECTIVITY_XPATH = '//h2[span[contains(text(), "List")]]/following-sibling::table/tr[not(descendant::th)]'.freeze
19
+ USER_AGENTS = 'https://techblog.willshouse.com/2012/01/03/most-common-user-agents/'
20
+
21
+ require 'rubygems/tasks'
22
+ Gem::Tasks.new
23
+
24
+ require 'rdoc/task'
25
+ RDoc::Task.new
26
+ task :doc => :rdoc
27
+
28
+ require 'rspec/core/rake_task'
29
+ RSpec::Core::RakeTask.new
30
+
31
+ task :test => :spec
32
+ task :default => :spec
33
+
34
+ task :pry do
35
+ require 'pry'
36
+ require 'reqsample'
37
+ subject = ReqSample::Generator.new
38
+ ARGV.clear
39
+ binding.pry
40
+ end
41
+
42
+ desc 'Load in country IP ranges into a unified JSON dump.'
43
+ task :load_country_networks do
44
+ agent = Mechanize.new
45
+ agent.get(URI('http://www.nirsoft.net/countryip/')) do |page|
46
+ page.links_with(:href => /^[a-z]{2}[.]html$/).reduce({}) do |h, country|
47
+ h[country.href.split('.').first] = country.click
48
+ .link_with(:href => /[.]csv$/).click.body
49
+ .strip.split("\n").map(&:strip).map do |ips|
50
+ ips.split(',')[0..1]
51
+ end
52
+ h
53
+ end.tap do |network_hash|
54
+ File.open('vendor/country_networks.json', 'w') do |fh|
55
+ fh.write JSON.dump(network_hash)
56
+ end
57
+ end
58
+ end
59
+ end
60
+
61
+ desc 'Retrieve list of internet-connected users by Country.'
62
+ task :load_country_connectivity do
63
+ Nokogiri::HTML(open(COUNTRY_CONNECTIVITY)).tap do |page|
64
+ page.xpath(CONNECTIVITY_XPATH).map do |row|
65
+ [
66
+ IsoCountryCodes.search_by_name(
67
+ case (c = row.xpath('td')[0].xpath('a').text.strip.downcase)
68
+ when 'vietnam' then 'viet nam'
69
+ when 'south korea' then 'korea (republic'
70
+ when 'czech republic' then 'czech'
71
+ when 'ivory coast' then 'côte'
72
+ when 'laos' then 'lao'
73
+ when /congo/ then 'congo'
74
+ when /gambia/ then 'gambia'
75
+ when /bahama/ then 'bahama'
76
+ when /são/ then 'sao'
77
+ else c
78
+ end
79
+ ).first.alpha2.downcase,
80
+ row.xpath('td')[1].text.delete(',').to_i
81
+ ]
82
+ end.to_h.tap do |statistics|
83
+ File.open('vendor/country_connectivity.json', 'w') do |fh|
84
+ fh.write JSON.dump(statistics)
85
+ end
86
+ end
87
+ end
88
+ end
89
+
90
+ desc 'Retrieve list of common User-Agents.'
91
+ task :load_user_agents do
92
+ Nokogiri::HTML(open(USER_AGENTS)).tap do |page|
93
+ page.at_css('.most-common-user-agents').xpath('tbody/tr').map do |row|
94
+ [
95
+ row.at_css('.useragent').text.strip,
96
+ row.at_css('.percent').text.strip.chomp('%').to_f
97
+ ]
98
+ end.to_h.tap do |list|
99
+ File.open('vendor/user_agents.json', 'w') do |fh|
100
+ fh.write JSON.dump(list)
101
+ end
102
+ end
103
+ end
104
+ end
data/bin/reqsample ADDED
@@ -0,0 +1,6 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ libdir = File.expand_path('../lib', File.dirname(__FILE__))
4
+ $LOAD_PATH << libdir if File.exist?(File.join(libdir, 'reqsample', 'cli.rb'))
5
+ require 'reqsample/cli'
6
+ ReqSample::CLI.start
data/lib/reqsample.rb ADDED
@@ -0,0 +1,2 @@
1
+ require 'reqsample/version'
2
+ require 'reqsample/generator'
@@ -0,0 +1,45 @@
1
+ require 'chronic'
2
+ require 'thor'
3
+ require 'reqsample'
4
+
5
+ module ReqSample
6
+ # Command-line interface to the library
7
+ class CLI < Thor
8
+ class_option :count,
9
+ default: 1000,
10
+ type: :numeric
11
+ class_option :format,
12
+ default: :apache,
13
+ desc: 'Output format of generated logs'
14
+ class_option :stdev,
15
+ default: 4,
16
+ desc: 'Standard deviation to use for timespan normal distribution',
17
+ type: :numeric
18
+ class_option :truncate,
19
+ default: 12,
20
+ desc: 'Cutoff (in hours) that logs should remain +/- within',
21
+ type: :numeric
22
+
23
+ option :peak,
24
+ default: '12 hours ago',
25
+ desc: 'Time at which logs should peak (Chronic-style strings)'
26
+ desc 'generate', 'Generate a sample of webserver logs'
27
+ def generate
28
+ opts = options.dup
29
+ opts[:peak] = Chronic.parse options[:peak]
30
+ puts ReqSample::Generator.new(options[:stdev]).produce(opts).join("\n")
31
+ end
32
+
33
+ option :peak,
34
+ default: 'in 12 hours',
35
+ desc: 'Time at which logs should peak (Chronic-style strings)'
36
+ desc 'stream', 'Gradually stream generated logs over given time'
37
+ def stream
38
+ opts = options.dup
39
+ opts[:peak] = Chronic.parse options[:peak]
40
+ ReqSample::Generator.new(options[:stdev]).produce(opts) do |log|
41
+ puts log
42
+ end
43
+ end
44
+ end
45
+ end
@@ -0,0 +1,132 @@
1
+ require 'json'
2
+ require 'ipaddr'
3
+ require 'rubystats'
4
+
5
+ require 'reqsample/hash'
6
+ require 'reqsample/response_codes'
7
+ require 'reqsample/request_paths'
8
+ require 'reqsample/request_verbs'
9
+ require 'reqsample/time'
10
+
11
+ # Top-level module for ReqSample constants and classes.
12
+ module ReqSample
13
+ # Main class for creating randomized data.
14
+ class Generator
15
+ attr_accessor :agents,
16
+ :codes,
17
+ :connectivity,
18
+ :dist,
19
+ :max_bytes,
20
+ :networks,
21
+ :verbs
22
+
23
+ DEFAULT_COUNT = 1000
24
+ DEFAULT_DOMAIN = 'http://example.com'.freeze
25
+ DEFAULT_FORMAT = :apache
26
+ DEFAULT_MAX_BYTES = 512
27
+
28
+ # @param peak_sd [Float] standard deviation in the normal distribution
29
+ def initialize(peak_sd = 4.0)
30
+ @agents = ReqSample::Hash.weighted(vendor('user_agents.json'))
31
+ @codes = ReqSample::Hash.weighted(ReqSample::RESPONSE_CODES)
32
+ # Peak at zero (will be summed with the Time object)
33
+ @connectivity = ReqSample::Hash.weighted(
34
+ vendor('country_connectivity.json')
35
+ )
36
+ @dist = Rubystats::NormalDistribution.new(0, peak_sd)
37
+ @max_bytes = DEFAULT_MAX_BYTES
38
+ @networks = vendor('country_networks.json')
39
+ @verbs = ReqSample::Hash.weighted(ReqSample::REQUEST_VERBS)
40
+ end
41
+
42
+ # @option opts [Integer] :count how many logs to generate
43
+ # @option opts [String] :format form to return logs, :apache or :hash
44
+ # @option opts [Time] :peak normal distribution peak for log timestamps
45
+ # @option opts [Integer] :truncate hard limit to keep log range within
46
+ #
47
+ # @return [Array<String, Hash>] the collection of generated log events
48
+ def produce(opts = {})
49
+ opts[:count] ||= DEFAULT_COUNT
50
+ opts[:format] ||= DEFAULT_FORMAT
51
+ opts[:peak] ||= Time.now - (12 * 60 * 60)
52
+ opts[:truncate] ||= 12
53
+
54
+ 1.upto(opts[:count]).map do |_|
55
+ sample_time opts[:peak], opts[:truncate]
56
+ end.sort.map do |time|
57
+ if block_given?
58
+ if (delay = time - Time.now) > 0 then sleep delay end
59
+ yield sample time, opts[:format]
60
+ else
61
+ sample time, opts[:format]
62
+ end
63
+ end
64
+ end
65
+
66
+ def sample(time = nil, fmt = nil)
67
+ # Pull a random country, but ensure it's a valid country code for the
68
+ # list of networks that we have available.
69
+ country = connectivity.weighted_sample do |ccodes|
70
+ ccodes.detect do |ccode|
71
+ networks.keys.include? ccode
72
+ end
73
+ end
74
+
75
+ sample = {
76
+ address: sample_address(country),
77
+ agent: agents.weighted_sample,
78
+ bytes: rand(max_bytes),
79
+ code: codes.weighted_sample,
80
+ domain: DEFAULT_DOMAIN,
81
+ path: ReqSample::REQUEST_PATHS.sample,
82
+ time: time || sample_time(opts),
83
+ verb: verbs.weighted_sample
84
+ }
85
+
86
+ format fmt, country, sample
87
+ end
88
+
89
+ def format(style, country, sample)
90
+ case style.to_s
91
+ when 'apache'
92
+ [
93
+ "#{sample[:address]} - -",
94
+ "[#{sample[:time].strftime('%d/%b/%Y:%H:%M:%S %z')}]",
95
+ %("#{sample[:verb]} #{sample[:path]} HTTP/1.1"),
96
+ sample[:code],
97
+ sample[:bytes],
98
+ %("#{sample[:domain]}"),
99
+ %("#{sample[:agent]}")
100
+ ].join ' '
101
+ else
102
+ { country => sample }
103
+ end
104
+ end
105
+
106
+ def sample_address(country = nil)
107
+ country ||= networks.keys.sample
108
+
109
+ head, tail = networks[country].sample
110
+ IPAddr.new(
111
+ rand(IPAddr.new(head).to_i..IPAddr.new(tail).to_i),
112
+ Socket::AF_INET
113
+ )
114
+ end
115
+
116
+ # Limit the normal distribution to +/- 12 hours (assume we want to stay
117
+ # within a 24-hour period).
118
+ def sample_time(peak, truncate)
119
+ loop do
120
+ sample = ReqSample::Time.at((peak + (dist.rng * 60 * 60)).to_i)
121
+ break sample if sample.within peak, truncate
122
+ end
123
+ end
124
+ end
125
+
126
+ private
127
+
128
+ def vendor(file)
129
+ v = File.expand_path('../../../vendor', __FILE__)
130
+ JSON.parse(File.read(File.join(v, file)))
131
+ end
132
+ end