compare_compressors 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 8962f01ee986bcfd6c301100f04565ae6844d4ed
4
+ data.tar.gz: e1e02bb86cc4bc150c599f439e59a06718c5be9d
5
+ SHA512:
6
+ metadata.gz: 41d405711e8947af177e741d819df1416e24838ca77ef672571e811d6419e21e0c732b0feba92d928acee5becac5f5e4dda70ef64e872d96d9d3ed18edbb1812
7
+ data.tar.gz: eb4ce9b89346ee291e9778ea5450180772c2788792a2272bd8757a21fd018e1702a2dfec5fe1ebc053f04784132fd6d57ad0d59e272c38c029c4925eb1add646
@@ -0,0 +1,132 @@
1
+ # compare_compressors
2
+
3
+ https://github.com/jdleesmiller/compare_compressors
4
+
5
+ [![Build Status](https://travis-ci.org/jdleesmiller/compare_compressors.svg?branch=master)](https://travis-ci.org/jdleesmiller/compare_compressors)
6
+
7
+ ## Synopsis
8
+
9
+ Evaluate different compression tools and their settings by running them on a sample of data.
10
+
11
+ See [this blog post for an example of how to use the tool](TODO).
12
+
13
+ ### Usage
14
+
15
+ This utility has many system dependencies, so the easiest way to run it is via Docker:
16
+
17
+ ```shell
18
+ $ docker pull jdleesmiller/compare_compressors
19
+ ```
20
+
21
+ Generally you will run a `compare` step, followed by a `plot` or `summarize` step.
22
+
23
+ #### Compare
24
+
25
+ This step runs the compressors on the sample files and saves the results to a CSV. Assuming that your sample files are in a folder called `data` in the current directory, and they are called `test1`, `test2`, etc.., the command would look like:
26
+
27
+ ```
28
+ docker run --rm \
29
+ --volume `pwd`/data:/home/app/compare_compressors/data:ro \
30
+ --volume /tmp:/tmp \ # optional
31
+ jdleesmiller/compare_compressors compare data/test* >data/compare.csv
32
+ ```
33
+
34
+ where:
35
+
36
+ - The `--rm` flag tells docker to remove the container when it's finished.
37
+
38
+ - The ```--volume `pwd`/data:/home/app/compare_compressors/data:ro``` flag mounts `./data` on the host inside the container, so the utility can access the sample files. The trick here is that `/home/app/compare_compressors` is the utility's working directory inside the container, so the relative paths `data/test*` for the sample files will be the same both inside and outside of the container. The `:ro` makes it a read only mount; this is optional, but it provides added assurance that the utility won't change your data files.
39
+
40
+ - The `--volume /tmp:/tmp` flag is optional but may improve performance. The utility does its compression and decompression in `/tmp` inside the container, and all of the writes inside the container go through Docker's union file system. By mounting `/tmp` on the host, we bypass the union file system. (Ideally, we'd just do this in the Dockerfile, but unfortunately it's 10x slower on Docker for Mac; hopefully that will improve soon.)
41
+
42
+ #### Plot
43
+
44
+ Once you've generated a CSV with results, the tool can read the CSV and generate a `gnuplot` script to plot the results. Note that you need to have `gnuplot` installed on the host for this to work.
45
+
46
+ There are several plotting commands: `plot` gives you a 2D plot of compression time vs compressed size. There is also a `--decompression` option to plot decompression time vs compressed size instead.
47
+
48
+ ```
49
+ docker run --rm \
50
+ --volume `pwd`/data:/home/app/compare_compressors/data:ro \
51
+ --volume /tmp:/tmp \
52
+ jdleesmiller/compare_compressors plot data/compare.csv | gnuplot
53
+ ```
54
+
55
+ The `plot_costs` command takes three cost coefficients: cost per GiB of stored data, cost per hour to run the compression program, and cost per hour to run the decompression program. The program then computes a simple linear cost function. To keep the plot in 2D, the two time costs are added together.
56
+
57
+ ```
58
+ docker run --rm \
59
+ --volume `pwd`/data:/home/app/compare_compressors/data:ro \
60
+ --gibyte-cost 56.05 \
61
+ --compression-hour-cost 32.35 \
62
+ --decompression-hour-cost 177.91 \
63
+ --currency '£' \
64
+ jdleesmiller/compare_compressors plot_costs data/compare.csv | gnuplot
65
+ ```
66
+
67
+ #### Summarize
68
+
69
+ Print a list of the compressors and settings in descending order by cost. The cost function is of the same form as for `plot_costs`.
70
+
71
+ ```
72
+ docker run --rm \
73
+ --volume `pwd`/data:/home/app/compare_compressors/data \
74
+ --gibyte-cost 56.05 \
75
+ --compression-hour-cost 32.35 \
76
+ --decompression-hour-cost 177.91 \
77
+ --currency '£' \
78
+ jdleesmiller/compare_compressors summarize data/compare.csv
79
+ ```
80
+
81
+ ## Requirements
82
+
83
+ A linux-like `/usr/bin/time` utility is required, along with several system packages. See the [Dockerfile](Dockerfile) for a list of the packages that this utility depends on. To make the plot, you will also need `gnuplot`.
84
+
85
+ If you are installing natively, without docker, you will need ruby and then to install the gem:
86
+
87
+ ```
88
+ $ gem install compare_compressors
89
+ ```
90
+
91
+ ## Development
92
+
93
+ For development, you will probably want (1) override the default entrypoint and (2) mount the application root inside the container. To run the tests, for example:
94
+
95
+ ```
96
+ docker run --rm -it --entrypoint='' \
97
+ --volume=compare_compressors_bundle:/home/app/compare_compressors/.bundle \
98
+ --volume=`pwd`:/home/app/compare_compressors \
99
+ compare_compressors bundle exec rake
100
+ ```
101
+
102
+ The only caveat is that you need to preserve the `.bundle` folder inside the container by mounting it as a volume; the above command does this using a named volume, `compare_compressors_bundle`, which will persist between runs and be easier to identify in the `docker volume ls` output.
103
+
104
+ ## Related
105
+
106
+ - See [this blog post for an example of how to use the tool](TODO).
107
+ - For many more compression algorithms, see https://quixdb.github.io/squash-benchmark/
108
+
109
+ ## License
110
+
111
+ (The MIT License)
112
+
113
+ Copyright (c) 2017 John Lees-Miller
114
+
115
+ Permission is hereby granted, free of charge, to any person obtaining
116
+ a copy of this software and associated documentation files (the
117
+ 'Software'), to deal in the Software without restriction, including
118
+ without limitation the rights to use, copy, modify, merge, publish,
119
+ distribute, sublicense, and/or sell copies of the Software, and to
120
+ permit persons to whom the Software is furnished to do so, subject to
121
+ the following conditions:
122
+
123
+ The above copyright notice and this permission notice shall be
124
+ included in all copies or substantial portions of the Software.
125
+
126
+ THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
127
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
128
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
129
+ IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
130
+ CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
131
+ TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
132
+ SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,6 @@
1
+ #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
3
+
4
+ require 'compare_compressors'
5
+
6
+ CompareCompressors::CommandLineInterface.start(ARGV)
@@ -0,0 +1,40 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'English'
4
+
5
+ require_relative 'compare_compressors/version'
6
+
7
+ require_relative 'compare_compressors/comparer'
8
+ require_relative 'compare_compressors/cost_model'
9
+ require_relative 'compare_compressors/result'
10
+ require_relative 'compare_compressors/group_result'
11
+ require_relative 'compare_compressors/costed_group_result'
12
+
13
+ require_relative 'compare_compressors/plotter'
14
+ require_relative 'compare_compressors/plotters/cost_plotter'
15
+ require_relative 'compare_compressors/plotters/raw_plotter'
16
+ require_relative 'compare_compressors/plotters/size_plotter'
17
+
18
+ require_relative 'compare_compressors/compressor'
19
+ require_relative 'compare_compressors/compressors/brotli_compressor'
20
+ require_relative 'compare_compressors/compressors/bzip2_compressor'
21
+ require_relative 'compare_compressors/compressors/gzip_compressor'
22
+ require_relative 'compare_compressors/compressors/seven_zip_compressor'
23
+ require_relative 'compare_compressors/compressors/xz_compressor'
24
+ require_relative 'compare_compressors/compressors/zstd_compressor'
25
+
26
+ require_relative 'compare_compressors/command_line_interface'
27
+
28
+ #
29
+ # Compare compression algorithms.
30
+ #
31
+ module CompareCompressors
32
+ COMPRESSORS = [
33
+ BrotliCompressor,
34
+ Bzip2Compressor,
35
+ GzipCompressor,
36
+ SevenZipCompressor,
37
+ XzCompressor,
38
+ ZstdCompressor
39
+ ].map(&:new).freeze
40
+ end
@@ -0,0 +1,223 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'csv'
4
+ require 'thor'
5
+
6
+ module CompareCompressors
7
+ #
8
+ # Handle generic command line options and run the relevant command.
9
+ #
10
+ class CommandLineInterface < Thor
11
+ desc \
12
+ 'version',
13
+ 'print version (also available as --version)'
14
+ def version
15
+ puts "compare_compressors-#{CompareCompressors::VERSION}"
16
+ COMPRESSORS.each do |compressor|
17
+ puts format('%10s: %s', compressor.name, compressor.version || '?')
18
+ end
19
+ end
20
+ map %w(--version -v) => :version
21
+
22
+ desc \
23
+ 'compare <target files>',
24
+ 'Run compression tools on targets and write a CSV'
25
+ def compare(*targets)
26
+ CSV do |csv|
27
+ Comparer.new.run(csv, COMPRESSORS, targets)
28
+ end
29
+ end
30
+
31
+ class <<self
32
+ def scale_option
33
+ option \
34
+ :scale,
35
+ type: :numeric,
36
+ desc: 'scale factor from sample targets to full dataset',
37
+ default: 1.0
38
+ end
39
+
40
+ def use_cpu_time_option
41
+ option \
42
+ :use_cpu_time,
43
+ type: :boolean,
44
+ desc: 'use CPU time rather than elapsed time',
45
+ default: CostModel::DEFAULT_USE_CPU_TIME
46
+ end
47
+
48
+ def cost_options
49
+ option \
50
+ :gibyte_cost,
51
+ type: :numeric,
52
+ desc: 'storage cost per gigabyte of compressed output',
53
+ default: CostModel::DEFAULT_GIBYTE_COST
54
+ option \
55
+ :compression_hour_cost,
56
+ type: :numeric,
57
+ desc: 'compute cost per hour of CPU time for compression',
58
+ default: CostModel::DEFAULT_HOUR_COST
59
+ option \
60
+ :decompression_hour_cost,
61
+ type: :numeric,
62
+ desc: 'compute cost per hour of CPU time for decompression',
63
+ default: CostModel::DEFAULT_HOUR_COST
64
+ option \
65
+ :currency,
66
+ type: :string,
67
+ desc: 'currency symbol for display',
68
+ default: CostModel::DEFAULT_CURRENCY
69
+ end
70
+
71
+ def plot_options
72
+ option \
73
+ :terminal,
74
+ desc: 'the terminal line for gnuplot',
75
+ default: Plotter::DEFAULT_TERMINAL
76
+ option \
77
+ :output,
78
+ desc: 'the output name for gnuplot',
79
+ default: Plotter::DEFAULT_OUTPUT
80
+ option \
81
+ :pareto_only,
82
+ desc: 'plot only non-dominated compressor-level pairs',
83
+ type: :boolean,
84
+ default: true
85
+ option \
86
+ :logscale_size,
87
+ desc: 'use a log10 scale for the size (lucky you if you need this)',
88
+ type: :boolean,
89
+ default: Plotter::DEFAULT_LOGSCALE_SIZE
90
+ option \
91
+ :autoscale_fix,
92
+ desc: 'zoom axes to fit the points tightly',
93
+ type: :boolean,
94
+ default: Plotter::DEFAULT_AUTOSCALE_FIX
95
+ option \
96
+ :show_labels,
97
+ desc: 'show compression level labels on the plot',
98
+ type: :boolean,
99
+ default: Plotter::DEFAULT_SHOW_LABELS
100
+ option \
101
+ :lmargin,
102
+ desc: 'adjust lmargin (workaround if y label is cut off on png)',
103
+ type: :numeric,
104
+ default: Plotter::DEFAULT_LMARGIN
105
+ option \
106
+ :title,
107
+ desc: 'main title (must not contain double quotes)',
108
+ type: :string,
109
+ default: Plotter::DEFAULT_TITLE
110
+ end
111
+ end
112
+
113
+ desc \
114
+ 'plot [csv file]',
115
+ 'Write a gnuplot script for a basic 2D plot with the CSV from compare'
116
+ scale_option
117
+ use_cpu_time_option
118
+ plot_options
119
+ option \
120
+ :decompression,
121
+ desc: 'show decompression time instead of compression time',
122
+ type: :boolean,
123
+ default: SizePlotter::DEFAULT_DECOMPRESSION
124
+ def plot(csv_file = nil)
125
+ results = read_results(csv_file)
126
+ group_results = GroupResult.group(results, scale: options[:scale])
127
+ plotter = make_plotter(
128
+ SizePlotter, options,
129
+ decompression: options[:decompression]
130
+ )
131
+ plotter.plot(group_results, pareto_only: options[:pareto_only])
132
+ end
133
+
134
+ desc \
135
+ 'plot_3d [csv file]',
136
+ 'Write a gnuplot script for a 3D plot with the CSV from compare'
137
+ scale_option
138
+ use_cpu_time_option
139
+ plot_options
140
+ def plot_3d(csv_file = nil)
141
+ results = read_results(csv_file)
142
+ group_results = GroupResult.group(results, scale: options[:scale])
143
+ plotter = make_plotter(RawPlotter, options)
144
+ plotter.plot(group_results, pareto_only: options[:pareto_only])
145
+ end
146
+
147
+ desc \
148
+ 'plot_costs [csv file]',
149
+ 'Write a gnuplot script for a 2D cost plot with the CSV from compare'
150
+ scale_option
151
+ use_cpu_time_option
152
+ cost_options
153
+ plot_options
154
+ option \
155
+ :show_cost_contours,
156
+ desc: 'show cost function contours',
157
+ type: :boolean,
158
+ default: CostPlotter::DEFAULT_SHOW_COST_CONTOURS
159
+ def plot_costs(csv_file = nil)
160
+ results = read_results(csv_file)
161
+ group_results = GroupResult.group(results, scale: options[:scale])
162
+ cost_model = make_cost_model(options)
163
+ costed_group_results =
164
+ CostedGroupResult.from_group_results(cost_model, group_results)
165
+ plotter = make_plotter(
166
+ CostPlotter, options, cost_model,
167
+ show_cost_contours: options[:show_cost_contours]
168
+ )
169
+ plotter.plot(costed_group_results, pareto_only: options[:pareto_only])
170
+ end
171
+
172
+ desc \
173
+ 'summarize [csv file]',
174
+ 'Read CSV from compare and write out a summary'
175
+ scale_option
176
+ use_cpu_time_option
177
+ cost_options
178
+ option \
179
+ :top,
180
+ desc: 'number of results to include',
181
+ type: :numeric,
182
+ default: CostModel::DEFAULT_SUMMARIZE_TOP
183
+ def summarize(csv_file = nil)
184
+ results = read_results(csv_file)
185
+ group_results = GroupResult.group(results, scale: options[:scale])
186
+ cost_model = make_cost_model(options)
187
+ costed_group_results =
188
+ CostedGroupResult.from_group_results(cost_model, group_results)
189
+ puts cost_model.summarize(costed_group_results, options[:top])
190
+ end
191
+
192
+ private
193
+
194
+ def make_cost_model(options)
195
+ CostModel.new(
196
+ gibyte_cost: options[:gibyte_cost],
197
+ compression_hour_cost: options[:compression_hour_cost],
198
+ decompression_hour_cost: options[:decompression_hour_cost],
199
+ use_cpu_time: options[:use_cpu_time],
200
+ currency: options[:currency]
201
+ )
202
+ end
203
+
204
+ def make_plotter(klass, options, *args, **kwargs)
205
+ klass.new(
206
+ *args,
207
+ terminal: options[:terminal],
208
+ output: options[:output],
209
+ logscale_size: options[:logscale_size],
210
+ autoscale_fix: options[:autoscale_fix],
211
+ show_labels: options[:show_labels],
212
+ lmargin: options[:lmargin],
213
+ title: options[:title],
214
+ use_cpu_time: options[:use_cpu_time],
215
+ **kwargs
216
+ )
217
+ end
218
+
219
+ def read_results(csv_file)
220
+ Result.read_csv(csv_file ? File.read(csv_file) : STDIN)
221
+ end
222
+ end
223
+ end
@@ -0,0 +1,70 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'digest'
4
+ require 'fileutils'
5
+ require 'tmpdir'
6
+
7
+ module CompareCompressors
8
+ #
9
+ # Run compressors on targets and record the results.
10
+ #
11
+ # The general approach is, for each target:
12
+ #
13
+ # 1. Copy the original target (read only) to a temporary folder (read write);
14
+ # the copy is the 'work target'.
15
+ # 2. Hash the work target so we can make sure we don't change it.
16
+ # 3. For each compressor and level, compress the work target.
17
+ # 4. Remove the work target (if the compressor left it)
18
+ # 5. Decompress the compressed target; this should restore the work target.
19
+ # 6. Check the work target's hash before we start the next compressor or
20
+ # level, to make sure the compression hasn't broken it somehow.
21
+ #
22
+ # This approach is a bit complicated, but it lets us (1) make sure we don't
23
+ # change the original targets, since they're copied, (2) make sure we
24
+ # don't accidentally change the work target during the run, which would
25
+ # invalidate the results, and (3) avoid copying the work target from the
26
+ # target repeatedly.
27
+ #
28
+ class Comparer
29
+ #
30
+ # @param [CSV] csv CSV writer for output
31
+ # @param [Array<Compressor>] compressors
32
+ # @param [Array<String>] targets pathnames of targets (read only)
33
+ #
34
+ def run(csv, compressors, targets)
35
+ csv << Result.members
36
+ targets.each do |target|
37
+ Dir.mktmpdir do |tmp|
38
+ work_target = stage_target(tmp, target)
39
+ evaluate_target(csv, compressors, target, work_target)
40
+ end
41
+ end
42
+ nil
43
+ end
44
+
45
+ private
46
+
47
+ def stage_target(tmp, target)
48
+ pathname = File.join(tmp, 'data')
49
+ FileUtils.cp target, pathname
50
+ pathname
51
+ end
52
+
53
+ def evaluate_target(csv, compressors, target, work_target)
54
+ target_digest = find_digest(work_target)
55
+ compressors.each do |compressor|
56
+ compressor.levels.each do |level|
57
+ if find_digest(work_target) != target_digest
58
+ raise "digest mismatch: #{compressor.name}" \
59
+ " level #{level} on #{target}"
60
+ end
61
+ csv << compressor.evaluate(target, work_target, level)
62
+ end
63
+ end
64
+ end
65
+
66
+ def find_digest(pathname)
67
+ Digest::SHA256.file(pathname).hexdigest
68
+ end
69
+ end
70
+ end