compare_compressors 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 8962f01ee986bcfd6c301100f04565ae6844d4ed
4
+ data.tar.gz: e1e02bb86cc4bc150c599f439e59a06718c5be9d
5
+ SHA512:
6
+ metadata.gz: 41d405711e8947af177e741d819df1416e24838ca77ef672571e811d6419e21e0c732b0feba92d928acee5becac5f5e4dda70ef64e872d96d9d3ed18edbb1812
7
+ data.tar.gz: eb4ce9b89346ee291e9778ea5450180772c2788792a2272bd8757a21fd018e1702a2dfec5fe1ebc053f04784132fd6d57ad0d59e272c38c029c4925eb1add646
@@ -0,0 +1,132 @@
1
+ # compare_compressors
2
+
3
+ https://github.com/jdleesmiller/compare_compressors
4
+
5
+ [![Build Status](https://travis-ci.org/jdleesmiller/compare_compressors.svg?branch=master)](https://travis-ci.org/jdleesmiller/compare_compressors)
6
+
7
+ ## Synopsis
8
+
9
+ Evaluate different compression tools and their settings by running them on a sample of data.
10
+
11
+ See [this blog post for an example of how to use the tool](TODO).
12
+
13
+ ### Usage
14
+
15
+ This utility has many system dependencies, so the easiest way to run it is via Docker:
16
+
17
+ ```shell
18
+ $ docker pull jdleesmiller/compare_compressors
19
+ ```
20
+
21
+ Generally you will run a `compare` step, followed by a `plot` or `summarize` step.
22
+
23
+ #### Compare
24
+
25
+ This step runs the compressors on the sample files and saves the results to a CSV. Assuming that your sample files are in a folder called `data` in the current directory, and they are called `test1`, `test2`, etc.., the command would look like:
26
+
27
+ ```
28
+ docker run --rm \
29
+ --volume `pwd`/data:/home/app/compare_compressors/data:ro \
30
+ --volume /tmp:/tmp \ # optional
31
+ jdleesmiller/compare_compressors compare data/test* >data/compare.csv
32
+ ```
33
+
34
+ where:
35
+
36
+ - The `--rm` flag tells docker to remove the container when it's finished.
37
+
38
+ - The ```--volume `pwd`/data:/home/app/compare_compressors/data:ro``` flag mounts `./data` on the host inside the container, so the utility can access the sample files. The trick here is that `/home/app/compare_compressors` is the utility's working directory inside the container, so the relative paths `data/test*` for the sample files will be the same both inside and outside of the container. The `:ro` makes it a read only mount; this is optional, but it provides added assurance that the utility won't change your data files.
39
+
40
+ - The `--volume /tmp:/tmp` flag is optional but may improve performance. The utility does its compression and decompression in `/tmp` inside the container, and all of the writes inside the container go through Docker's union file system. By mounting `/tmp` on the host, we bypass the union file system. (Ideally, we'd just do this in the Dockerfile, but unfortunately it's 10x slower on Docker for Mac; hopefully that will improve soon.)
41
+
42
+ #### Plot
43
+
44
+ Once you've generated a CSV with results, the tool can read the CSV and generate a `gnuplot` script to plot the results. Note that you need to have `gnuplot` installed on the host for this to work.
45
+
46
+ There are several plotting commands: `plot` gives you a 2D plot of compression time vs compressed size. There is also a `--decompression` option to plot decompression time vs compressed size instead.
47
+
48
+ ```
49
+ docker run --rm \
50
+ --volume `pwd`/data:/home/app/compare_compressors/data:ro \
51
+ --volume /tmp:/tmp \
52
+ jdleesmiller/compare_compressors plot data/compare.csv | gnuplot
53
+ ```
54
+
55
+ The `plot_costs` command takes three cost coefficients: cost per GiB of stored data, cost per hour to run the compression program, and cost per hour to run the decompression program. The program then computes a simple linear cost function. To keep the plot in 2D, the two time costs are added together.
56
+
57
+ ```
58
+ docker run --rm \
59
+ --volume `pwd`/data:/home/app/compare_compressors/data:ro \
60
+ --gibyte-cost 56.05 \
61
+ --compression-hour-cost 32.35 \
62
+ --decompression-hour-cost 177.91 \
63
+ --currency '£' \
64
+ jdleesmiller/compare_compressors plot_costs data/compare.csv | gnuplot
65
+ ```
66
+
67
+ #### Summarize
68
+
69
+ Print a list of the compressors and settings in descending order by cost. The cost function is of the same form as for `plot_costs`.
70
+
71
+ ```
72
+ docker run --rm \
73
+ --volume `pwd`/data:/home/app/compare_compressors/data \
74
+ --gibyte-cost 56.05 \
75
+ --compression-hour-cost 32.35 \
76
+ --decompression-hour-cost 177.91 \
77
+ --currency '£' \
78
+ jdleesmiller/compare_compressors summarize data/compare.csv
79
+ ```
80
+
81
+ ## Requirements
82
+
83
+ A linux-like `/usr/bin/time` utility is required, along with several system packages. See the [Dockerfile](Dockerfile) for a list of the packages that this utility depends on. To make the plot, you will also need `gnuplot`.
84
+
85
+ If you are installing natively, without docker, you will need ruby and then to install the gem:
86
+
87
+ ```
88
+ $ gem install compare_compressors
89
+ ```
90
+
91
+ ## Development
92
+
93
+ For development, you will probably want (1) override the default entrypoint and (2) mount the application root inside the container. To run the tests, for example:
94
+
95
+ ```
96
+ docker run --rm -it --entrypoint='' \
97
+ --volume=compare_compressors_bundle:/home/app/compare_compressors/.bundle \
98
+ --volume=`pwd`:/home/app/compare_compressors \
99
+ compare_compressors bundle exec rake
100
+ ```
101
+
102
+ The only caveat is that you need to preserve the `.bundle` folder inside the container by mounting it as a volume; the above command does this using a named volume, `compare_compressors_bundle`, which will persist between runs and be easier to identify in the `docker volume ls` output.
103
+
104
+ ## Related
105
+
106
+ - See [this blog post for an example of how to use the tool](TODO).
107
+ - For many more compression algorithms, see https://quixdb.github.io/squash-benchmark/
108
+
109
+ ## License
110
+
111
+ (The MIT License)
112
+
113
+ Copyright (c) 2017 John Lees-Miller
114
+
115
+ Permission is hereby granted, free of charge, to any person obtaining
116
+ a copy of this software and associated documentation files (the
117
+ 'Software'), to deal in the Software without restriction, including
118
+ without limitation the rights to use, copy, modify, merge, publish,
119
+ distribute, sublicense, and/or sell copies of the Software, and to
120
+ permit persons to whom the Software is furnished to do so, subject to
121
+ the following conditions:
122
+
123
+ The above copyright notice and this permission notice shall be
124
+ included in all copies or substantial portions of the Software.
125
+
126
+ THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
127
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
128
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
129
+ IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
130
+ CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
131
+ TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
132
+ SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,6 @@
1
+ #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
3
+
4
+ require 'compare_compressors'
5
+
6
+ CompareCompressors::CommandLineInterface.start(ARGV)
@@ -0,0 +1,40 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'English'
4
+
5
+ require_relative 'compare_compressors/version'
6
+
7
+ require_relative 'compare_compressors/comparer'
8
+ require_relative 'compare_compressors/cost_model'
9
+ require_relative 'compare_compressors/result'
10
+ require_relative 'compare_compressors/group_result'
11
+ require_relative 'compare_compressors/costed_group_result'
12
+
13
+ require_relative 'compare_compressors/plotter'
14
+ require_relative 'compare_compressors/plotters/cost_plotter'
15
+ require_relative 'compare_compressors/plotters/raw_plotter'
16
+ require_relative 'compare_compressors/plotters/size_plotter'
17
+
18
+ require_relative 'compare_compressors/compressor'
19
+ require_relative 'compare_compressors/compressors/brotli_compressor'
20
+ require_relative 'compare_compressors/compressors/bzip2_compressor'
21
+ require_relative 'compare_compressors/compressors/gzip_compressor'
22
+ require_relative 'compare_compressors/compressors/seven_zip_compressor'
23
+ require_relative 'compare_compressors/compressors/xz_compressor'
24
+ require_relative 'compare_compressors/compressors/zstd_compressor'
25
+
26
+ require_relative 'compare_compressors/command_line_interface'
27
+
28
+ #
29
+ # Compare compression algorithms.
30
+ #
31
+ module CompareCompressors
32
+ COMPRESSORS = [
33
+ BrotliCompressor,
34
+ Bzip2Compressor,
35
+ GzipCompressor,
36
+ SevenZipCompressor,
37
+ XzCompressor,
38
+ ZstdCompressor
39
+ ].map(&:new).freeze
40
+ end
@@ -0,0 +1,223 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'csv'
4
+ require 'thor'
5
+
6
+ module CompareCompressors
7
+ #
8
+ # Handle generic command line options and run the relevant command.
9
+ #
10
+ class CommandLineInterface < Thor
11
+ desc \
12
+ 'version',
13
+ 'print version (also available as --version)'
14
+ def version
15
+ puts "compare_compressors-#{CompareCompressors::VERSION}"
16
+ COMPRESSORS.each do |compressor|
17
+ puts format('%10s: %s', compressor.name, compressor.version || '?')
18
+ end
19
+ end
20
+ map %w(--version -v) => :version
21
+
22
+ desc \
23
+ 'compare <target files>',
24
+ 'Run compression tools on targets and write a CSV'
25
+ def compare(*targets)
26
+ CSV do |csv|
27
+ Comparer.new.run(csv, COMPRESSORS, targets)
28
+ end
29
+ end
30
+
31
+ class <<self
32
+ def scale_option
33
+ option \
34
+ :scale,
35
+ type: :numeric,
36
+ desc: 'scale factor from sample targets to full dataset',
37
+ default: 1.0
38
+ end
39
+
40
+ def use_cpu_time_option
41
+ option \
42
+ :use_cpu_time,
43
+ type: :boolean,
44
+ desc: 'use CPU time rather than elapsed time',
45
+ default: CostModel::DEFAULT_USE_CPU_TIME
46
+ end
47
+
48
+ def cost_options
49
+ option \
50
+ :gibyte_cost,
51
+ type: :numeric,
52
+ desc: 'storage cost per gigabyte of compressed output',
53
+ default: CostModel::DEFAULT_GIBYTE_COST
54
+ option \
55
+ :compression_hour_cost,
56
+ type: :numeric,
57
+ desc: 'compute cost per hour of CPU time for compression',
58
+ default: CostModel::DEFAULT_HOUR_COST
59
+ option \
60
+ :decompression_hour_cost,
61
+ type: :numeric,
62
+ desc: 'compute cost per hour of CPU time for decompression',
63
+ default: CostModel::DEFAULT_HOUR_COST
64
+ option \
65
+ :currency,
66
+ type: :string,
67
+ desc: 'currency symbol for display',
68
+ default: CostModel::DEFAULT_CURRENCY
69
+ end
70
+
71
+ def plot_options
72
+ option \
73
+ :terminal,
74
+ desc: 'the terminal line for gnuplot',
75
+ default: Plotter::DEFAULT_TERMINAL
76
+ option \
77
+ :output,
78
+ desc: 'the output name for gnuplot',
79
+ default: Plotter::DEFAULT_OUTPUT
80
+ option \
81
+ :pareto_only,
82
+ desc: 'plot only non-dominated compressor-level pairs',
83
+ type: :boolean,
84
+ default: true
85
+ option \
86
+ :logscale_size,
87
+ desc: 'use a log10 scale for the size (lucky you if you need this)',
88
+ type: :boolean,
89
+ default: Plotter::DEFAULT_LOGSCALE_SIZE
90
+ option \
91
+ :autoscale_fix,
92
+ desc: 'zoom axes to fit the points tightly',
93
+ type: :boolean,
94
+ default: Plotter::DEFAULT_AUTOSCALE_FIX
95
+ option \
96
+ :show_labels,
97
+ desc: 'show compression level labels on the plot',
98
+ type: :boolean,
99
+ default: Plotter::DEFAULT_SHOW_LABELS
100
+ option \
101
+ :lmargin,
102
+ desc: 'adjust lmargin (workaround if y label is cut off on png)',
103
+ type: :numeric,
104
+ default: Plotter::DEFAULT_LMARGIN
105
+ option \
106
+ :title,
107
+ desc: 'main title (must not contain double quotes)',
108
+ type: :string,
109
+ default: Plotter::DEFAULT_TITLE
110
+ end
111
+ end
112
+
113
+ desc \
114
+ 'plot [csv file]',
115
+ 'Write a gnuplot script for a basic 2D plot with the CSV from compare'
116
+ scale_option
117
+ use_cpu_time_option
118
+ plot_options
119
+ option \
120
+ :decompression,
121
+ desc: 'show decompression time instead of compression time',
122
+ type: :boolean,
123
+ default: SizePlotter::DEFAULT_DECOMPRESSION
124
+ def plot(csv_file = nil)
125
+ results = read_results(csv_file)
126
+ group_results = GroupResult.group(results, scale: options[:scale])
127
+ plotter = make_plotter(
128
+ SizePlotter, options,
129
+ decompression: options[:decompression]
130
+ )
131
+ plotter.plot(group_results, pareto_only: options[:pareto_only])
132
+ end
133
+
134
+ desc \
135
+ 'plot_3d [csv file]',
136
+ 'Write a gnuplot script for a 3D plot with the CSV from compare'
137
+ scale_option
138
+ use_cpu_time_option
139
+ plot_options
140
+ def plot_3d(csv_file = nil)
141
+ results = read_results(csv_file)
142
+ group_results = GroupResult.group(results, scale: options[:scale])
143
+ plotter = make_plotter(RawPlotter, options)
144
+ plotter.plot(group_results, pareto_only: options[:pareto_only])
145
+ end
146
+
147
+ desc \
148
+ 'plot_costs [csv file]',
149
+ 'Write a gnuplot script for a 2D cost plot with the CSV from compare'
150
+ scale_option
151
+ use_cpu_time_option
152
+ cost_options
153
+ plot_options
154
+ option \
155
+ :show_cost_contours,
156
+ desc: 'show cost function contours',
157
+ type: :boolean,
158
+ default: CostPlotter::DEFAULT_SHOW_COST_CONTOURS
159
+ def plot_costs(csv_file = nil)
160
+ results = read_results(csv_file)
161
+ group_results = GroupResult.group(results, scale: options[:scale])
162
+ cost_model = make_cost_model(options)
163
+ costed_group_results =
164
+ CostedGroupResult.from_group_results(cost_model, group_results)
165
+ plotter = make_plotter(
166
+ CostPlotter, options, cost_model,
167
+ show_cost_contours: options[:show_cost_contours]
168
+ )
169
+ plotter.plot(costed_group_results, pareto_only: options[:pareto_only])
170
+ end
171
+
172
+ desc \
173
+ 'summarize [csv file]',
174
+ 'Read CSV from compare and write out a summary'
175
+ scale_option
176
+ use_cpu_time_option
177
+ cost_options
178
+ option \
179
+ :top,
180
+ desc: 'number of results to include',
181
+ type: :numeric,
182
+ default: CostModel::DEFAULT_SUMMARIZE_TOP
183
+ def summarize(csv_file = nil)
184
+ results = read_results(csv_file)
185
+ group_results = GroupResult.group(results, scale: options[:scale])
186
+ cost_model = make_cost_model(options)
187
+ costed_group_results =
188
+ CostedGroupResult.from_group_results(cost_model, group_results)
189
+ puts cost_model.summarize(costed_group_results, options[:top])
190
+ end
191
+
192
+ private
193
+
194
+ def make_cost_model(options)
195
+ CostModel.new(
196
+ gibyte_cost: options[:gibyte_cost],
197
+ compression_hour_cost: options[:compression_hour_cost],
198
+ decompression_hour_cost: options[:decompression_hour_cost],
199
+ use_cpu_time: options[:use_cpu_time],
200
+ currency: options[:currency]
201
+ )
202
+ end
203
+
204
+ def make_plotter(klass, options, *args, **kwargs)
205
+ klass.new(
206
+ *args,
207
+ terminal: options[:terminal],
208
+ output: options[:output],
209
+ logscale_size: options[:logscale_size],
210
+ autoscale_fix: options[:autoscale_fix],
211
+ show_labels: options[:show_labels],
212
+ lmargin: options[:lmargin],
213
+ title: options[:title],
214
+ use_cpu_time: options[:use_cpu_time],
215
+ **kwargs
216
+ )
217
+ end
218
+
219
+ def read_results(csv_file)
220
+ Result.read_csv(csv_file ? File.read(csv_file) : STDIN)
221
+ end
222
+ end
223
+ end
@@ -0,0 +1,70 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'digest'
4
+ require 'fileutils'
5
+ require 'tmpdir'
6
+
7
+ module CompareCompressors
8
+ #
9
+ # Run compressors on targets and record the results.
10
+ #
11
+ # The general approach is, for each target:
12
+ #
13
+ # 1. Copy the original target (read only) to a temporary folder (read write);
14
+ # the copy is the 'work target'.
15
+ # 2. Hash the work target so we can make sure we don't change it.
16
+ # 3. For each compressor and level, compress the work target.
17
+ # 4. Remove the work target (if the compressor left it)
18
+ # 5. Decompress the compressed target; this should restore the work target.
19
+ # 6. Check the work target's hash before we start the next compressor or
20
+ # level, to make sure the compression hasn't broken it somehow.
21
+ #
22
+ # This approach is a bit complicated, but it lets us (1) make sure we don't
23
+ # change the original targets, since they're copied, (2) make sure we
24
+ # don't accidentally change the work target during the run, which would
25
+ # invalidate the results, and (3) avoid copying the work target from the
26
+ # target repeatedly.
27
+ #
28
+ class Comparer
29
+ #
30
+ # @param [CSV] csv CSV writer for output
31
+ # @param [Array<Compressor>] compressors
32
+ # @param [Array<String>] targets pathnames of targets (read only)
33
+ #
34
+ def run(csv, compressors, targets)
35
+ csv << Result.members
36
+ targets.each do |target|
37
+ Dir.mktmpdir do |tmp|
38
+ work_target = stage_target(tmp, target)
39
+ evaluate_target(csv, compressors, target, work_target)
40
+ end
41
+ end
42
+ nil
43
+ end
44
+
45
+ private
46
+
47
+ def stage_target(tmp, target)
48
+ pathname = File.join(tmp, 'data')
49
+ FileUtils.cp target, pathname
50
+ pathname
51
+ end
52
+
53
+ def evaluate_target(csv, compressors, target, work_target)
54
+ target_digest = find_digest(work_target)
55
+ compressors.each do |compressor|
56
+ compressor.levels.each do |level|
57
+ if find_digest(work_target) != target_digest
58
+ raise "digest mismatch: #{compressor.name}" \
59
+ " level #{level} on #{target}"
60
+ end
61
+ csv << compressor.evaluate(target, work_target, level)
62
+ end
63
+ end
64
+ end
65
+
66
+ def find_digest(pathname)
67
+ Digest::SHA256.file(pathname).hexdigest
68
+ end
69
+ end
70
+ end