RubyGems - readorder - Versions diffs - 1.0.0 - Mend

readorder 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (34) hide show

data/HISTORY +4 -0
data/LICENSE +13 -0
data/README +158 -0
data/bin/readorder +11 -0
data/gemspec.rb +53 -0
data/lib/readorder/analyzer.rb +170 -0
data/lib/readorder/cli.rb +159 -0
data/lib/readorder/command.rb +147 -0
data/lib/readorder/commands/analyze.rb +17 -0
data/lib/readorder/commands/sort.rb +26 -0
data/lib/readorder/commands/test.rb +234 -0
data/lib/readorder/datum.rb +181 -0
data/lib/readorder/filelist.rb +61 -0
data/lib/readorder/log.rb +58 -0
data/lib/readorder/paths.rb +69 -0
data/lib/readorder/runner.rb +48 -0
data/lib/readorder/version.rb +30 -0
data/lib/readorder.rb +24 -0
data/spec/analyzer_spec.rb +51 -0
data/spec/command_spec.rb +37 -0
data/spec/filelist_spec.rb +53 -0
data/spec/log_spec.rb +13 -0
data/spec/paths_spec.rb +45 -0
data/spec/runner_spec.rb +46 -0
data/spec/spec_helper.rb +57 -0
data/spec/version_spec.rb +16 -0
data/tasks/announce.rake +39 -0
data/tasks/config.rb +107 -0
data/tasks/distribution.rake +38 -0
data/tasks/documentation.rake +32 -0
data/tasks/rspec.rake +29 -0
data/tasks/rubyforge.rake +51 -0
data/tasks/utils.rb +80 -0
metadata +161 -0

data/HISTORY ADDED Viewed

@@ -0,0 +1,4 @@
+= Changelog
+== Version 1.0.0
+* Initial public release

data/LICENSE ADDED Viewed

@@ -0,0 +1,13 @@
+Copyright (c) 2009, Jeremy Hinegardner
+Permission to use, copy, modify, and/or distribute this software for any
+purpose with or without fee is hereby granted, provided that the above
+copyright notice and this permission notice appear in all copies.
+THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

data/README ADDED Viewed

@@ -0,0 +1,158 @@
+== Readorder
+* Homepage[http://copiousfreetime.rubyforge.org/readorder/]
+* {Rubyforge Project}[http://rubyforge.org/projects/copiousfreetime/]
+* email jeremy at copiousfreetime dot org
+* git clone git://github.com/copiousfreetime/readorder.git
+== DESCRIPTION
+Readorder orders a list of files into a more effective read order.
+You would possibly want to use readorder in a case where you know ahead
+of time that you have a large quantity of files on disc to process.  You
+can give that list off those files and it will report back to you the
+order in which you should process them to make most effective use of
+your disc I/O.
+Given a list of filenames, either on the command line or via stdin,
+readorder will output the filenames in an order that should increase
+the I/O throughput when the files corresponding to the filenames are
+read off of disc.
+The output order of the filenames can either be in inode order or
+physical disc block order.  This is dependent upon operating system
+support and permission level of the user running readorder.
+== COMMANDS
+=== Sort
+Given a list of filenames, either on the command line or via stdin,
+output the filenames in an order that should increase the I/O
+throughput when the contents files are read from disc.
+==== Synopsis
+  readorder sort [filelist*] [options]+
+  filelist (-1 ~> filelist=#<IO:0x1277e4>)
+      The files containing filenames
+  --inode
+      Only use inode order do not attempt physical block order
+  --log-level=log-level (0 ~> log-level=info)
+      The verbosity of logging, one of [ debug, info, warn, error, fatal ]
+  --log-file=log-file (0 ~> log-file)
+      Log to this file instead of stderr
+  --output=output (0 ~> output)
+      Where to write the output
+  --error-filelist=error-filelist (0 ~> error-filelist)
+      Write all the files from the filelist that had errors to this file
+  --help, -h
+==== Example Output
+=== Analyze
+Take the list of filenames and output an analysis of the volume of
+data in those files.
+==== Synopsis
+  readorder analyze [filelist*] [options]+
+  filelist (-1 ~> filelist=#<IO:0x1277e4>)
+      The files containing filenames
+  --log-level=log-level (0 ~> log-level=info)
+      The verbosity of logging, one of [ debug, info, warn, error, fatal ]
+  --log-file=log-file (0 ~> log-file)
+      Log to this file instead of stderr
+  --output=output (0 ~> output)
+      Where to write the output
+  --error-filelist=error-filelist (0 ~> error-filelist)
+      Write all the files from the filelist that had errors to this file
+  --data-csv=data-csv (0 ~> data-csv)
+      Write the raw data collected to this csv file
+  --help, -h
+==== Example Output
+=== Test
+Give a list of filenames, either on the commandline or via stdin,
+take a random subsample of them and read all the contents of those
+files in different orders.
+* in initial given order
+* in inode order
+* in physical block order
+Output a report of the various times take to read the files.
+This command requires elevated priveleges to run.  It will purge your disc
+cache multiple times while running,  and will spike the I/O of your machine.
+Run with care.
+==== Synopsis
+  readorder test [filelist*] [options]+
+  filelist (-1 ~> filelist=#<IO:0x1277e4>)
+      The files containing filenames
+  --percentage=percentage (0 ~> int(percentage))
+      What random percentage of input files to select
+  --log-level=log-level (0 ~> log-level=info)
+      The verbosity of logging, one of [ debug, info, warn, error, fatal ]
+  --log-file=log-file (0 ~> log-file)
+      Log to this file instead of stderr
+  --error-filelist=error-filelist (0 ~> error-filelist)
+      Write all the files from the filelist that had errors to this file
+  --help, -h
+==== Example result
+                              Test Using First Of
+    ========================================================================
+      Total files read :         8052
+      Total bytes read :      6575824
+      Minimum filesize :          637
+      Average filesize :          816.670
+      Maximum filesize :         1393
+      Stddev of sizes  :           86.936
+                      read order   Elapsed time (sec)  Read rate (bytes/sec)
+    ------------------------------------------------------------------------
+                  original_order              352.403              18659.944
+                    inode_number               53.606             122669.175
+     first_physical_block_number               47.520             138379.024
+This is the output of a a <tt>readorder test</tt> command run on a directory on
+a ReiserFS filesytem containing 805,038 files, constituting 657,543,700 bytes
+of data.  A sample of 1% of the files was used for the test.
+If we process them in their original order we can see that this will
+potentially take us 9.78 hours.  If we process them in physical block number
+order that is reduces to 1.31 hours.
+== CREDITS
+* Linux System Programming by Robert Love
+* {readahead project}[https://fedorahosted.org/readahead/]
+== ISC LICENSE
+Copyright (c) 2009, Jeremy Hinegardner
+Permission to use, copy, modify, and/or distribute this software for any
+purpose with or without fee is hereby granted, provided that the above
+copyright notice and this permission notice appear in all copies.
+THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

data/bin/readorder ADDED Viewed

@@ -0,0 +1,11 @@
+#!/usr/bin/env ruby
+#--
+# Copyright (c) 2009
+# All rights reserved.  See LICENSE and/or COPYING for details.
+#++
+$:.unshift File.expand_path(File.join(File.dirname(__FILE__),"..","lib"))
+require 'readorder'
+::Readorder::Cli.new( ARGV, ENV ).run

data/gemspec.rb ADDED Viewed

@@ -0,0 +1,53 @@
+require 'rubygems'
+require 'readorder/version'
+require 'tasks/config'
+Readorder::GEM_SPEC = Gem::Specification.new do |spec|
+  proj = Configuration.for('project')
+  spec.name         = proj.name
+  spec.version      = Readorder::VERSION
+  spec.author       = proj.author
+  spec.email        = proj.email
+  spec.homepage     = proj.homepage
+  spec.summary      = proj.summary
+  spec.description  = proj.description
+  spec.platform     = Gem::Platform::RUBY
+  pkg = Configuration.for('packaging')
+  spec.files        = pkg.files.all
+  spec.executables  = pkg.files.bin.collect { |b| File.basename(b) }
+  # add dependencies here
+  spec.add_dependency("configuration", "~> 0.0.5")
+  spec.add_dependency("rbtree", "~> 0.2.1")
+  spec.add_dependency("main", "~> 2.8.3")
+  spec.add_dependency("logging", "~> 1.1.4")
+  spec.add_dependency("hitimes", "~> 1.0.1")
+  spec.add_development_dependency( "rake", "~> 0.8.3")
+  if ext_conf = Configuration.for_if_exist?("extension") then
+    spec.extensions << ext_conf.configs
+    spec.extensions.flatten!
+    spec.require_paths << "ext"
+  end
+  if rdoc = Configuration.for_if_exist?('rdoc') then
+    spec.has_rdoc         = true
+    spec.extra_rdoc_files = pkg.files.rdoc
+    spec.rdoc_options     = rdoc.options + [ "--main" , rdoc.main_page ]
+  else
+    spec.has_rdoc         = false
+  end
+  if test = Configuration.for_if_exist?('testing') then
+    spec.test_files       = test.files
+  end
+  if rf = Configuration.for_if_exist?('rubyforge') then
+    spec.rubyforge_project  = rf.project
+  end
+end

data/lib/readorder/analyzer.rb ADDED Viewed

@@ -0,0 +1,170 @@
+require 'hitimes'
+require 'readorder/datum'
+require 'rbtree'
+module Readorder
+  #
+  # Use the given Filelist and traverse all the file collecting the
+  # appropriate Datum instances
+  #
+  class Analyzer
+    # an Array of Datum instances for files that cannot be processed
+    attr_accessor :bad_data
+    # an Array of Datum instances in the order they were processed
+    attr_accessor :good_data
+    # an RBTree of Datum instances of those files that were analyzed
+    # in order by phyiscal disc block number.  This only has items if
+    # the physical block number was obtained.  It is empty otherwise
+    attr_accessor :physical_order
+    # an RBTree of Datum instances of those files that were analyzed
+    # in order by inode
+    attr_accessor :inode_order
+    #
+    # Initialize the Analyzer with the Filelist object and whether or
+    # not to gather the physical block size.
+    #
+    def initialize( filelist, get_physical = true )
+      @filelist          = filelist
+      @bad_data          = []
+      @good_data         = []
+      @physical_order    = ::MultiRBTree.new
+      @inode_order       = ::MultiRBTree.new
+      @get_physical      = get_physical
+      @size_metric       = ::Hitimes::ValueMetric.new( 'size' )
+      @time_metric       = ::Hitimes::TimedMetric.new( 'time' )
+    end
+    #
+    # call-seq:
+    #   analyzer.logger -> Logger
+    #
+    # return the Logger instance for the Analyzer
+    #
+    def logger
+      ::Logging::Logger[self]
+    end
+    #
+    # call-seq:
+    #   analyzer.collect_data -> nil
+    #
+    # Run data collections over the Filelist and store the results into
+    # *good_data* or *bad_data* as appropriate.  A status message is written to the
+    # log every 10,000 files processed
+    #
+    def collect_data
+      logger.info "Begin data collection"
+      original_order = 0
+      @filelist.each_line do |fname|
+        #logger.debug "  analyzing #{fname.strip}"
+        @time_metric.measure do
+          d = Datum.new( fname )
+          d.collect( @get_physical )
+          d.original_order = original_order
+          if d.valid? then
+            @good_data << d
+            @size_metric.measure d.stat.size
+            @inode_order[d.inode_number] = d
+            if @get_physical then
+              @physical_order[d.first_physical_block_number] = d
+            end
+          else
+            @bad_data << d
+          end
+        end
+        if @time_metric.count % 10_000 == 0 then
+          logger.info "  processed #{@time_metric.count} at #{"%0.3f" % @time_metric.rate} files/sec"
+        end
+        original_order += 1
+      end
+      logger.info "  processed #{@time_metric.count} at #{"%0.3f" % @time_metric.rate} files/sec"
+      logger.info "  yielded #{@good_data.size} data points"
+      logger.info "End data collection"
+      nil
+    end
+    #
+    # call-seq:
+    #   analyzer.log_summary_report -> nil
+    #
+    # Write the summary report to the #logger
+    #
+    def log_summary_report
+      summary_report.split("\n").each do |l|
+        logger.info l
+      end
+    end
+    #
+    # call-seq:
+    #   analyzer.summary_report -> String
+    #
+    # Generate a summary report of how long it took to analyze the files and the
+    # filesizes found.  return it as a String
+    #
+    def summary_report
+      s = StringIO.new
+      s.puts "Files analyzed   : #{"%12d" % @time_metric.count}"
+      s.puts "Elapsed time     : #{"%12d" % @time_metric.duration} seconds"
+      s.puts "Collection Rate  : #{"%16.3f" % @time_metric.rate} files/sec"
+      s.puts "Good files       : #{"%12d" % @good_data.size}"
+      s.puts "  average size   : #{"%16.3f" % @size_metric.mean} bytes"
+      s.puts "  minimum size   : #{"%16.3f" % @size_metric.min} bytes"
+      s.puts "  maximum size   : #{"%16.3f" % @size_metric.max} bytes"
+      s.puts "  sum of sizes   : #{"%12d" % @size_metric.sum} bytes"
+      s.puts "Bad files        : #{"%12d" % @bad_data.size}"
+      return s.string
+    end
+    #
+    # call-seq:
+    #   analyzer.dump_data_to( IO ) -> nil
+    #
+    # write a csv to the _IO_ object passed in.  The format is:
+    #
+    #   error reason,filename
+    #
+    # If there are no bad Datum instances then do not write anything.
+    #
+    def dump_bad_data_to( io )
+      if bad_data.size > 0 then
+        io.puts "error_reason,filename"
+        bad_data.each do |d|
+          io.puts "#{d.error_reason},#{d.filename}"
+        end
+      end
+      nil
+    end
+    #
+    # call-seq:
+    #   analyzer.dump_good_data_to( IO ) -> nil
+    #
+    # Write a csv fo the _IO_ object passed in.  The format is:
+    #
+    #   filename,size,inode_number,physical_block_count,first_physical_block_number
+    #
+    # The last two fields *physical_block_count* and *first_physical_block_number* are
+    # only written if the analyzer was able to gather physical block information
+    #
+    def dump_good_data_to( io )
+      fields = %w[ filename size inode_number ]
+      if @get_physical then
+        fields << 'physical_block_count'
+        fields << 'first_physical_block_number'
+      end
+      io.puts fields.join(",")
+      good_data.each do |d|
+        f = fields.collect { |f| d.send( f ) }
+        io.puts f.join(",")
+      end
+    end
+  end
+end

data/lib/readorder/cli.rb ADDED Viewed

@@ -0,0 +1,159 @@
+require 'main'
+require 'readorder/runner'
+module Readorder
+  Cli = Main.create {
+    author "Copyright 2009 (c) Jeremy Hinegardner"
+    version ::Readorder::VERSION
+    description <<-txt
+    Readorder orders a list of files into a more efficient read order.
+    Given a list of filenames, either on the command line or via stdin,
+    output the filenames in an order that should increase the I/O
+    throughput when the contents files are read from disc.
+    txt
+    run { help! }
+    ## --- Modes --
+    ## Default mode is sort, which is when no mode is given
+    mode( :sort ) {
+      description <<-txt
+      Given a list of filenames, either on the command line or via stdin,
+      output the filenames in an order that should increase the I/O
+      throughput when the contents files are read from disc.
+      txt
+      option( 'inode' ) {
+        description "Only use inode order do not attempt physical block order"
+        cast :boolean
+      }
+      mixin :option_log_level
+      mixin :option_log_file
+      mixin :argument_filelist
+      mixin :option_output
+      mixin :option_error_filelist
+      run { Cli.run_command_with_params( 'sort', params ) }
+    }
+    mode( :analyze ) {
+      description <<-txt
+      Take the list of filenames and output an analysis of the volume of
+      data in those files.
+      txt
+      mixin :option_log_level
+      mixin :option_log_file
+      mixin :argument_filelist
+      mixin :option_output
+      mixin :option_error_filelist
+      option( 'data-csv' ) {
+        description "Write the raw data collected to this csv file"
+        argument :required
+        validate { |f| File.directory?( File.dirname(File.expand_path( f ) ) ) }
+      }
+      run { Cli.run_command_with_params( 'analyze', params ) }
+    }
+    mode( :test ) {
+      description <<-txt
+      Give a list of filenames, either on the commandline or via stdin,
+      take a random subsample of them and read all the contents of those
+      files in different orders.
+      1) in initial given order
+      2) in inode order
+      3) in physical block order
+      Output a report of the various times take to read the files.
+      This command requires elevated priveleges to run and will spike the
+      I/O of your machine.  Run with care.
+      txt
+      option( :percentage ) {
+        description "What random percentage of input files to select"
+        argument :required
+        default "10"
+        validate { |p|
+          pi = Float(p)
+          (pi > 0) and (pi <= 100)
+        }
+        cast :float
+      }
+      mixin :option_log_level
+      mixin :option_log_file
+      mixin :option_output
+      mixin :argument_filelist
+      mixin :option_error_filelist
+      run { Cli.run_command_with_params( 'test', params ) }
+    }
+    ## --- Mixins ---
+    mixin :argument_filelist do
+      argument('filelist') {
+        description "The files containing filenames"
+        arity '*'
+        default [ $stdin ]
+        required false
+      }
+    end
+    mixin :option_log_level do
+      option( 'log-level' ) do
+        description "The verbosity of logging, one of [ #{::Logging::LNAMES.map {|l| l.downcase }.join(', ')} ]"
+        argument :required
+        default 'info'
+        validate { |l| %w[ debug info warn error fatal off ].include?( l.downcase ) }
+      end
+    end
+    mixin :option_log_file do
+      option( 'log-file' ) do
+        description "Log to this file instead of stderr"
+        argument :required
+        validate { |f| File.directory?( File.dirname(File.expand_path( f ) ) ) }
+      end
+    end
+    mixin :option_output do
+      option( 'output' ) do
+        description "Where to write the output"
+        argument :required
+        validate { |f| File.directory?( File.dirname(File.expand_path( f ) ) ) }
+      end
+    end
+    mixin :option_error_filelist do
+      option('error-filelist') do
+        description "Write all the files from the filelist that had errors to this file"
+        argument :required
+        validate { |f| File.directory?( File.dirname(File.expand_path( f ) ) ) }
+      end
+    end
+  }
+  #
+  # Convert the Parameters::List that exists as the parameter from Main
+  #
+  #
+  def Cli.params_to_hash( params )
+    (hash = params.to_hash ).keys.each do |key|
+      v = hash[key].values
+      v = v.first if v.size <= 1
+      hash[key] = v
+    end
+    return hash
+  end
+  def Cli.run_command_with_params( command, params )
+    ::Readorder::Runner.new( Cli.params_to_hash( params ) ).run( command )
+  end
+end