readorder 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/HISTORY ADDED
@@ -0,0 +1,4 @@
1
+ = Changelog
2
+ == Version 1.0.0
3
+
4
+ * Initial public release
data/LICENSE ADDED
@@ -0,0 +1,13 @@
1
+ Copyright (c) 2009, Jeremy Hinegardner
2
+
3
+ Permission to use, copy, modify, and/or distribute this software for any
4
+ purpose with or without fee is hereby granted, provided that the above
5
+ copyright notice and this permission notice appear in all copies.
6
+
7
+ THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
8
+ WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
9
+ MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
10
+ ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
11
+ WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
12
+ ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
13
+ OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
data/README ADDED
@@ -0,0 +1,158 @@
1
+ == Readorder
2
+
3
+ * Homepage[http://copiousfreetime.rubyforge.org/readorder/]
4
+ * {Rubyforge Project}[http://rubyforge.org/projects/copiousfreetime/]
5
+ * email jeremy at copiousfreetime dot org
6
+ * git clone git://github.com/copiousfreetime/readorder.git
7
+
8
+ == DESCRIPTION
9
+
10
+ Readorder orders a list of files into a more effective read order.
11
+
12
+ You would possibly want to use readorder in a case where you know ahead
13
+ of time that you have a large quantity of files on disc to process. You
14
+ can give that list off those files and it will report back to you the
15
+ order in which you should process them to make most effective use of
16
+ your disc I/O.
17
+
18
+ Given a list of filenames, either on the command line or via stdin,
19
+ readorder will output the filenames in an order that should increase
20
+ the I/O throughput when the files corresponding to the filenames are
21
+ read off of disc.
22
+
23
+ The output order of the filenames can either be in inode order or
24
+ physical disc block order. This is dependent upon operating system
25
+ support and permission level of the user running readorder.
26
+
27
+ == COMMANDS
28
+
29
+ === Sort
30
+
31
+ Given a list of filenames, either on the command line or via stdin,
32
+ output the filenames in an order that should increase the I/O
33
+ throughput when the contents files are read from disc.
34
+
35
+ ==== Synopsis
36
+
37
+ readorder sort [filelist*] [options]+
38
+
39
+ filelist (-1 ~> filelist=#<IO:0x1277e4>)
40
+ The files containing filenames
41
+ --inode
42
+ Only use inode order do not attempt physical block order
43
+ --log-level=log-level (0 ~> log-level=info)
44
+ The verbosity of logging, one of [ debug, info, warn, error, fatal ]
45
+ --log-file=log-file (0 ~> log-file)
46
+ Log to this file instead of stderr
47
+ --output=output (0 ~> output)
48
+ Where to write the output
49
+ --error-filelist=error-filelist (0 ~> error-filelist)
50
+ Write all the files from the filelist that had errors to this file
51
+ --help, -h
52
+
53
+ ==== Example Output
54
+
55
+ === Analyze
56
+
57
+ Take the list of filenames and output an analysis of the volume of
58
+ data in those files.
59
+
60
+ ==== Synopsis
61
+
62
+ readorder analyze [filelist*] [options]+
63
+
64
+ filelist (-1 ~> filelist=#<IO:0x1277e4>)
65
+ The files containing filenames
66
+ --log-level=log-level (0 ~> log-level=info)
67
+ The verbosity of logging, one of [ debug, info, warn, error, fatal ]
68
+ --log-file=log-file (0 ~> log-file)
69
+ Log to this file instead of stderr
70
+ --output=output (0 ~> output)
71
+ Where to write the output
72
+ --error-filelist=error-filelist (0 ~> error-filelist)
73
+ Write all the files from the filelist that had errors to this file
74
+ --data-csv=data-csv (0 ~> data-csv)
75
+ Write the raw data collected to this csv file
76
+ --help, -h
77
+
78
+ ==== Example Output
79
+
80
+ === Test
81
+
82
+ Give a list of filenames, either on the commandline or via stdin,
83
+ take a random subsample of them and read all the contents of those
84
+ files in different orders.
85
+
86
+ * in initial given order
87
+ * in inode order
88
+ * in physical block order
89
+
90
+ Output a report of the various times take to read the files.
91
+
92
+ This command requires elevated priveleges to run. It will purge your disc
93
+ cache multiple times while running, and will spike the I/O of your machine.
94
+ Run with care.
95
+
96
+ ==== Synopsis
97
+
98
+ readorder test [filelist*] [options]+
99
+
100
+ filelist (-1 ~> filelist=#<IO:0x1277e4>)
101
+ The files containing filenames
102
+ --percentage=percentage (0 ~> int(percentage))
103
+ What random percentage of input files to select
104
+ --log-level=log-level (0 ~> log-level=info)
105
+ The verbosity of logging, one of [ debug, info, warn, error, fatal ]
106
+ --log-file=log-file (0 ~> log-file)
107
+ Log to this file instead of stderr
108
+ --error-filelist=error-filelist (0 ~> error-filelist)
109
+ Write all the files from the filelist that had errors to this file
110
+ --help, -h
111
+
112
+ ==== Example result
113
+
114
+
115
+ Test Using First Of
116
+ ========================================================================
117
+
118
+ Total files read : 8052
119
+ Total bytes read : 6575824
120
+ Minimum filesize : 637
121
+ Average filesize : 816.670
122
+ Maximum filesize : 1393
123
+ Stddev of sizes : 86.936
124
+
125
+ read order Elapsed time (sec) Read rate (bytes/sec)
126
+ ------------------------------------------------------------------------
127
+ original_order 352.403 18659.944
128
+ inode_number 53.606 122669.175
129
+ first_physical_block_number 47.520 138379.024
130
+
131
+ This is the output of a a <tt>readorder test</tt> command run on a directory on
132
+ a ReiserFS filesytem containing 805,038 files, constituting 657,543,700 bytes
133
+ of data. A sample of 1% of the files was used for the test.
134
+
135
+ If we process them in their original order we can see that this will
136
+ potentially take us 9.78 hours. If we process them in physical block number
137
+ order that is reduces to 1.31 hours.
138
+
139
+ == CREDITS
140
+
141
+ * Linux System Programming by Robert Love
142
+ * {readahead project}[https://fedorahosted.org/readahead/]
143
+
144
+ == ISC LICENSE
145
+
146
+ Copyright (c) 2009, Jeremy Hinegardner
147
+
148
+ Permission to use, copy, modify, and/or distribute this software for any
149
+ purpose with or without fee is hereby granted, provided that the above
150
+ copyright notice and this permission notice appear in all copies.
151
+
152
+ THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
153
+ WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
154
+ MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
155
+ ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
156
+ WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
157
+ ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
158
+ OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
data/bin/readorder ADDED
@@ -0,0 +1,11 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ #--
4
+ # Copyright (c) 2009
5
+ # All rights reserved. See LICENSE and/or COPYING for details.
6
+ #++
7
+
8
+ $:.unshift File.expand_path(File.join(File.dirname(__FILE__),"..","lib"))
9
+ require 'readorder'
10
+
11
+ ::Readorder::Cli.new( ARGV, ENV ).run
data/gemspec.rb ADDED
@@ -0,0 +1,53 @@
1
+ require 'rubygems'
2
+ require 'readorder/version'
3
+ require 'tasks/config'
4
+
5
+ Readorder::GEM_SPEC = Gem::Specification.new do |spec|
6
+ proj = Configuration.for('project')
7
+ spec.name = proj.name
8
+ spec.version = Readorder::VERSION
9
+
10
+ spec.author = proj.author
11
+ spec.email = proj.email
12
+ spec.homepage = proj.homepage
13
+ spec.summary = proj.summary
14
+ spec.description = proj.description
15
+ spec.platform = Gem::Platform::RUBY
16
+
17
+
18
+ pkg = Configuration.for('packaging')
19
+ spec.files = pkg.files.all
20
+ spec.executables = pkg.files.bin.collect { |b| File.basename(b) }
21
+
22
+ # add dependencies here
23
+ spec.add_dependency("configuration", "~> 0.0.5")
24
+ spec.add_dependency("rbtree", "~> 0.2.1")
25
+ spec.add_dependency("main", "~> 2.8.3")
26
+ spec.add_dependency("logging", "~> 1.1.4")
27
+ spec.add_dependency("hitimes", "~> 1.0.1")
28
+
29
+ spec.add_development_dependency( "rake", "~> 0.8.3")
30
+
31
+ if ext_conf = Configuration.for_if_exist?("extension") then
32
+ spec.extensions << ext_conf.configs
33
+ spec.extensions.flatten!
34
+ spec.require_paths << "ext"
35
+ end
36
+
37
+ if rdoc = Configuration.for_if_exist?('rdoc') then
38
+ spec.has_rdoc = true
39
+ spec.extra_rdoc_files = pkg.files.rdoc
40
+ spec.rdoc_options = rdoc.options + [ "--main" , rdoc.main_page ]
41
+ else
42
+ spec.has_rdoc = false
43
+ end
44
+
45
+ if test = Configuration.for_if_exist?('testing') then
46
+ spec.test_files = test.files
47
+ end
48
+
49
+ if rf = Configuration.for_if_exist?('rubyforge') then
50
+ spec.rubyforge_project = rf.project
51
+ end
52
+
53
+ end
@@ -0,0 +1,170 @@
1
+ require 'hitimes'
2
+ require 'readorder/datum'
3
+ require 'rbtree'
4
+
5
+ module Readorder
6
+ #
7
+ # Use the given Filelist and traverse all the file collecting the
8
+ # appropriate Datum instances
9
+ #
10
+ class Analyzer
11
+ # an Array of Datum instances for files that cannot be processed
12
+ attr_accessor :bad_data
13
+
14
+ # an Array of Datum instances in the order they were processed
15
+ attr_accessor :good_data
16
+
17
+ # an RBTree of Datum instances of those files that were analyzed
18
+ # in order by phyiscal disc block number. This only has items if
19
+ # the physical block number was obtained. It is empty otherwise
20
+ attr_accessor :physical_order
21
+
22
+ # an RBTree of Datum instances of those files that were analyzed
23
+ # in order by inode
24
+ attr_accessor :inode_order
25
+
26
+ #
27
+ # Initialize the Analyzer with the Filelist object and whether or
28
+ # not to gather the physical block size.
29
+ #
30
+ def initialize( filelist, get_physical = true )
31
+ @filelist = filelist
32
+ @bad_data = []
33
+ @good_data = []
34
+ @physical_order = ::MultiRBTree.new
35
+ @inode_order = ::MultiRBTree.new
36
+ @get_physical = get_physical
37
+ @size_metric = ::Hitimes::ValueMetric.new( 'size' )
38
+ @time_metric = ::Hitimes::TimedMetric.new( 'time' )
39
+ end
40
+
41
+ #
42
+ # call-seq:
43
+ # analyzer.logger -> Logger
44
+ #
45
+ # return the Logger instance for the Analyzer
46
+ #
47
+ def logger
48
+ ::Logging::Logger[self]
49
+ end
50
+
51
+ #
52
+ # call-seq:
53
+ # analyzer.collect_data -> nil
54
+ #
55
+ # Run data collections over the Filelist and store the results into
56
+ # *good_data* or *bad_data* as appropriate. A status message is written to the
57
+ # log every 10,000 files processed
58
+ #
59
+ def collect_data
60
+ logger.info "Begin data collection"
61
+ original_order = 0
62
+ @filelist.each_line do |fname|
63
+ #logger.debug " analyzing #{fname.strip}"
64
+ @time_metric.measure do
65
+ d = Datum.new( fname )
66
+ d.collect( @get_physical )
67
+ d.original_order = original_order
68
+ if d.valid? then
69
+ @good_data << d
70
+ @size_metric.measure d.stat.size
71
+ @inode_order[d.inode_number] = d
72
+ if @get_physical then
73
+ @physical_order[d.first_physical_block_number] = d
74
+ end
75
+ else
76
+ @bad_data << d
77
+ end
78
+ end
79
+
80
+ if @time_metric.count % 10_000 == 0 then
81
+ logger.info " processed #{@time_metric.count} at #{"%0.3f" % @time_metric.rate} files/sec"
82
+ end
83
+ original_order += 1
84
+ end
85
+ logger.info " processed #{@time_metric.count} at #{"%0.3f" % @time_metric.rate} files/sec"
86
+ logger.info " yielded #{@good_data.size} data points"
87
+ logger.info "End data collection"
88
+ nil
89
+ end
90
+
91
+ #
92
+ # call-seq:
93
+ # analyzer.log_summary_report -> nil
94
+ #
95
+ # Write the summary report to the #logger
96
+ #
97
+ def log_summary_report
98
+ summary_report.split("\n").each do |l|
99
+ logger.info l
100
+ end
101
+ end
102
+
103
+ #
104
+ # call-seq:
105
+ # analyzer.summary_report -> String
106
+ #
107
+ # Generate a summary report of how long it took to analyze the files and the
108
+ # filesizes found. return it as a String
109
+ #
110
+ def summary_report
111
+ s = StringIO.new
112
+ s.puts "Files analyzed : #{"%12d" % @time_metric.count}"
113
+ s.puts "Elapsed time : #{"%12d" % @time_metric.duration} seconds"
114
+ s.puts "Collection Rate : #{"%16.3f" % @time_metric.rate} files/sec"
115
+ s.puts "Good files : #{"%12d" % @good_data.size}"
116
+ s.puts " average size : #{"%16.3f" % @size_metric.mean} bytes"
117
+ s.puts " minimum size : #{"%16.3f" % @size_metric.min} bytes"
118
+ s.puts " maximum size : #{"%16.3f" % @size_metric.max} bytes"
119
+ s.puts " sum of sizes : #{"%12d" % @size_metric.sum} bytes"
120
+ s.puts "Bad files : #{"%12d" % @bad_data.size}"
121
+ return s.string
122
+ end
123
+
124
+ #
125
+ # call-seq:
126
+ # analyzer.dump_data_to( IO ) -> nil
127
+ #
128
+ # write a csv to the _IO_ object passed in. The format is:
129
+ #
130
+ # error reason,filename
131
+ #
132
+ # If there are no bad Datum instances then do not write anything.
133
+ #
134
+ def dump_bad_data_to( io )
135
+ if bad_data.size > 0 then
136
+ io.puts "error_reason,filename"
137
+ bad_data.each do |d|
138
+ io.puts "#{d.error_reason},#{d.filename}"
139
+ end
140
+ end
141
+ nil
142
+ end
143
+
144
+
145
+ #
146
+ # call-seq:
147
+ # analyzer.dump_good_data_to( IO ) -> nil
148
+ #
149
+ # Write a csv fo the _IO_ object passed in. The format is:
150
+ #
151
+ # filename,size,inode_number,physical_block_count,first_physical_block_number
152
+ #
153
+ # The last two fields *physical_block_count* and *first_physical_block_number* are
154
+ # only written if the analyzer was able to gather physical block information
155
+ #
156
+ def dump_good_data_to( io )
157
+ fields = %w[ filename size inode_number ]
158
+ if @get_physical then
159
+ fields << 'physical_block_count'
160
+ fields << 'first_physical_block_number'
161
+ end
162
+
163
+ io.puts fields.join(",")
164
+ good_data.each do |d|
165
+ f = fields.collect { |f| d.send( f ) }
166
+ io.puts f.join(",")
167
+ end
168
+ end
169
+ end
170
+ end
@@ -0,0 +1,159 @@
1
+ require 'main'
2
+ require 'readorder/runner'
3
+
4
+ module Readorder
5
+ Cli = Main.create {
6
+ author "Copyright 2009 (c) Jeremy Hinegardner"
7
+ version ::Readorder::VERSION
8
+
9
+ description <<-txt
10
+ Readorder orders a list of files into a more efficient read order.
11
+
12
+ Given a list of filenames, either on the command line or via stdin,
13
+ output the filenames in an order that should increase the I/O
14
+ throughput when the contents files are read from disc.
15
+ txt
16
+
17
+ run { help! }
18
+
19
+ ## --- Modes --
20
+ ## Default mode is sort, which is when no mode is given
21
+
22
+ mode( :sort ) {
23
+ description <<-txt
24
+ Given a list of filenames, either on the command line or via stdin,
25
+ output the filenames in an order that should increase the I/O
26
+ throughput when the contents files are read from disc.
27
+ txt
28
+
29
+ option( 'inode' ) {
30
+ description "Only use inode order do not attempt physical block order"
31
+ cast :boolean
32
+ }
33
+
34
+ mixin :option_log_level
35
+ mixin :option_log_file
36
+ mixin :argument_filelist
37
+ mixin :option_output
38
+ mixin :option_error_filelist
39
+
40
+ run { Cli.run_command_with_params( 'sort', params ) }
41
+ }
42
+
43
+ mode( :analyze ) {
44
+ description <<-txt
45
+ Take the list of filenames and output an analysis of the volume of
46
+ data in those files.
47
+ txt
48
+
49
+ mixin :option_log_level
50
+ mixin :option_log_file
51
+ mixin :argument_filelist
52
+ mixin :option_output
53
+ mixin :option_error_filelist
54
+
55
+ option( 'data-csv' ) {
56
+ description "Write the raw data collected to this csv file"
57
+ argument :required
58
+ validate { |f| File.directory?( File.dirname(File.expand_path( f ) ) ) }
59
+ }
60
+
61
+ run { Cli.run_command_with_params( 'analyze', params ) }
62
+ }
63
+
64
+ mode( :test ) {
65
+ description <<-txt
66
+ Give a list of filenames, either on the commandline or via stdin,
67
+ take a random subsample of them and read all the contents of those
68
+ files in different orders.
69
+
70
+ 1) in initial given order
71
+ 2) in inode order
72
+ 3) in physical block order
73
+
74
+ Output a report of the various times take to read the files.
75
+
76
+ This command requires elevated priveleges to run and will spike the
77
+ I/O of your machine. Run with care.
78
+ txt
79
+ option( :percentage ) {
80
+ description "What random percentage of input files to select"
81
+ argument :required
82
+ default "10"
83
+ validate { |p|
84
+ pi = Float(p)
85
+ (pi > 0) and (pi <= 100)
86
+ }
87
+ cast :float
88
+ }
89
+ mixin :option_log_level
90
+ mixin :option_log_file
91
+ mixin :option_output
92
+ mixin :argument_filelist
93
+ mixin :option_error_filelist
94
+
95
+ run { Cli.run_command_with_params( 'test', params ) }
96
+ }
97
+
98
+ ## --- Mixins ---
99
+ mixin :argument_filelist do
100
+ argument('filelist') {
101
+ description "The files containing filenames"
102
+ arity '*'
103
+ default [ $stdin ]
104
+ required false
105
+ }
106
+ end
107
+
108
+ mixin :option_log_level do
109
+ option( 'log-level' ) do
110
+ description "The verbosity of logging, one of [ #{::Logging::LNAMES.map {|l| l.downcase }.join(', ')} ]"
111
+ argument :required
112
+ default 'info'
113
+ validate { |l| %w[ debug info warn error fatal off ].include?( l.downcase ) }
114
+ end
115
+ end
116
+
117
+ mixin :option_log_file do
118
+ option( 'log-file' ) do
119
+ description "Log to this file instead of stderr"
120
+ argument :required
121
+ validate { |f| File.directory?( File.dirname(File.expand_path( f ) ) ) }
122
+ end
123
+ end
124
+
125
+ mixin :option_output do
126
+ option( 'output' ) do
127
+ description "Where to write the output"
128
+ argument :required
129
+ validate { |f| File.directory?( File.dirname(File.expand_path( f ) ) ) }
130
+ end
131
+ end
132
+
133
+ mixin :option_error_filelist do
134
+ option('error-filelist') do
135
+ description "Write all the files from the filelist that had errors to this file"
136
+ argument :required
137
+ validate { |f| File.directory?( File.dirname(File.expand_path( f ) ) ) }
138
+ end
139
+ end
140
+ }
141
+
142
+
143
+ #
144
+ # Convert the Parameters::List that exists as the parameter from Main
145
+ #
146
+ #
147
+ def Cli.params_to_hash( params )
148
+ (hash = params.to_hash ).keys.each do |key|
149
+ v = hash[key].values
150
+ v = v.first if v.size <= 1
151
+ hash[key] = v
152
+ end
153
+ return hash
154
+ end
155
+
156
+ def Cli.run_command_with_params( command, params )
157
+ ::Readorder::Runner.new( Cli.params_to_hash( params ) ).run( command )
158
+ end
159
+ end