RubyGems - digestif - Versions diffs - 1.0.1 - Mend

digestif 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

data/Gemfile +5 -0
data/Gemfile.lock +40 -0
data/History.txt +0 -0
data/LICENSE +19 -0
data/README.textile +58 -0
data/Rakefile +17 -0
data/bin/digestif +7 -0
data/features/basic.feature +22 -0
data/features/fast_hash.feature +42 -0
data/features/input.feature +33 -0
data/features/step_definitions/digest_steps.rb +22 -0
data/features/step_definitions/stack_trace_output_steps.rb +13 -0
data/features/support/env.rb +3 -0
data/lib/digestif.rb +1 -0
data/lib/digestif/cli.rb +101 -0
data/lib/digestif/hasher.rb +28 -0
data/lib/digestif/version.rb +9 -0
metadata +120 -0

data/Gemfile ADDED Viewed

@@ -0,0 +1,5 @@
+source "http://rubygems.org"
+gem "cucumber"
+gem "aruba"
+gem "rake"

data/Gemfile.lock ADDED Viewed

@@ -0,0 +1,40 @@
+GEM
+  remote: http://rubygems.org/
+  specs:
+    aruba (0.3.2)
+      childprocess (~> 0.1.6)
+      cucumber (~> 0.10.0)
+      rspec (~> 2.3.0)
+    builder (2.1.2)
+    childprocess (0.1.6)
+      ffi (~> 0.6.3)
+    cucumber (0.10.0)
+      builder (>= 2.1.2)
+      diff-lcs (~> 1.1.2)
+      gherkin (~> 2.3.2)
+      json (~> 1.4.6)
+      term-ansicolor (~> 1.0.5)
+    diff-lcs (1.1.2)
+    ffi (0.6.3)
+      rake (>= 0.8.7)
+    gherkin (2.3.3)
+      json (~> 1.4.6)
+    json (1.4.6)
+    rake (0.8.7)
+    rspec (2.3.0)
+      rspec-core (~> 2.3.0)
+      rspec-expectations (~> 2.3.0)
+      rspec-mocks (~> 2.3.0)
+    rspec-core (2.3.1)
+    rspec-expectations (2.3.0)
+      diff-lcs (~> 1.1.2)
+    rspec-mocks (2.3.0)
+    term-ansicolor (1.0.5)
+PLATFORMS
+  ruby
+DEPENDENCIES
+  aruba
+  cucumber
+  rake

data/History.txt ADDED Viewed

File without changes

data/LICENSE ADDED Viewed

@@ -0,0 +1,19 @@
+Copyright (c) 2011 Andrew Roberts
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.

data/README.textile ADDED Viewed

@@ -0,0 +1,58 @@
+h1. Digestif
+An aid for creating hash digests of large files
+h2. Synopsis
+Digestif lets you create fast checksums of large files by
+skipping sections of the file.  It was created with compressed media
+files in mind, which generally have such a high information density
+that we can get away with a checksum that doesn't actually consider all
+the bits.  Someday I'd like to understand the likelyhood-of-collision
+implications for specific compression algorithms (mp3, h.264, xvid, et al.),
+but right now I'm going to settle for guessing at where "good enough for me"
+might lie.
+One side-effect of this approach is that the error-corrective nature of
+digests is, of course, lost.  This is really more of an inescapable artifact
+of the problem we're trying to solve.  To create a hash of a really large
+file, the biggest bottleneck with modern computers is streaming
+5-10 gigs off of the disk.  The actual checksumming is not hard.
+By looking at less data, we speed up the hash process immensely, and
+we incur the cost of vulnerability of file corruption.  Because the
+purpose I have in mind for this tool is identity checking, not
+corruption detection, this issue is not a problem for me.
+h2. Installation TODO
+h2. Usage
+Just like md5 on the command line, but it only works on files, not on
+streaming data (can't seek a stream).
+<pre>
+digestif some_large_file
+</pre>
+Since this program is designed to get around file limitations specifically, it
+didn't make sense for me to invest in making streams work.
+For a detailed look at the options, see
+<pre>
+digestif --help
+</pre>
+h2. Motivation
+I wrote digestif to solve a problem for a media catalogue I was working on.
+I wanted a filename-independent way to evaluate whether or not a file was in
+the catalogue yet, but the files were so large that streaming the whole file
+off of the hard drive was too slow for the response time I was hoping for.
+(Interested parties, I was getting 5 gigs hashed using md5 in about 2.4
+minutes.)
+h2. Author
+Copyright 2011 Andrew Roberts

data/Rakefile ADDED Viewed

@@ -0,0 +1,17 @@
+require 'rubygems'
+require 'cucumber'
+require 'cucumber/rake/task'
+require 'rake/gempackagetask'
+desc 'Default: run the cucumber features.'
+task :default => :features
+Cucumber::Rake::Task.new(:features) do |t|
+  t.cucumber_opts = "features --format pretty"
+end
+eval("$specification = begin; #{IO.read('digestif.gemspec')}; end")
+Rake::GemPackageTask.new($specification) do |package|
+  package.need_zip = true
+  package.need_tar = true
+end

data/bin/digestif ADDED Viewed

@@ -0,0 +1,7 @@
+#!/usr/bin/env ruby
+lib_dir = File.expand_path(File.join(File.dirname(__FILE__), '..', 'lib'))
+$LOAD_PATH.unshift(lib_dir) unless $LOAD_PATH.include?(lib_dir)
+require 'digestif'
+Digestif::CLI.run(ARGV)

data/features/basic.feature ADDED Viewed

@@ -0,0 +1,22 @@
+Feature: Basic application operation
+  In order to compare files based on content
+  As a user
+  I want to be told the hash digest of files
+  Background:
+    Given a file named "test_file" with:
+    """
+    This file is a test file for md5 to hash
+    """
+    And a file named "test_file_2" with:
+    """
+    This is another test file.
+    """
+  Scenario: Hashing a file
+    When I run "digestif -d md5 test_file"
+    Then the output should be a digest
+  Scenario: Hashing 2 files
+    When I run "digestif -d sha1 test_file test_file_2"
+    Then the output should be 2 digests

data/features/fast_hash.feature ADDED Viewed

@@ -0,0 +1,42 @@
+Feature: Hash files quickly
+  In order to hash large files quickly
+  As a user
+  I want to ensure that the hasher does not look at the whole file
+  Scenario: Changing a file without affecting the hash
+    # Given a file named "input" with:
+    # """
+    # This "feature" is really more of an inescapable artifact of the
+    # problem we're trying to solve.  To create a hash of a really large
+    # file, the biggest bottleneck with modern computers is streaming
+    # 5-10 gigs off of the disk.  The actual checksumming is not hard.
+    # By looking at less data, we speed up the hash process immensely, and
+    # we incur the cost of vulnerability of file corruption.  Because the
+    # purpose I have in mind for this tool is identity checking, not
+    # corruption detection, this issue is not a problem for me.
+    # """
+    # And a file named "modified" with:
+    # """
+    # Th     ea    "     ea    mo    f     ne    ab    rt    t     he
+    # problem we're trying to solve.  To create a hash of a really large
+    # file, the biggest bottleneck with modern computers is streaming
+    # 5-10 gigs off of the disk.  The actual checksumming is not hard.
+    # By looking at less data, we speed up the hash process immensely, and
+    # we incur the cost of vulnerability of file corruption.  Because the
+    # purpose I have in mind for this tool is identity checking, not
+    # corruption detection, this issue is not a problem for me.
+    # """
+    Given a file named "input" with:
+    """
+    two words, and not a moment too soon
+    """
+    And a file named "modified" with:
+    """
+    tw0000rd0000nd0000 a0000en0000o 0000
+    """
+    When I run "digestif -s 4 -r 2 input"
+    And I run "digestif -s 4 -r 2 modified"
+    Then the output should be 2 identical digests

data/features/input.feature ADDED Viewed

@@ -0,0 +1,33 @@
+Feature: Application input handling
+  In order to understand what was wrong with my input
+  As a user
+  I should be presented with sensical error messages
+  Scenario: program invoked with bad options
+    When I run "digestif --campari"
+    Then the output should not contain a stack trace
+    And there should be an error message
+    And the exit status should not be 0
+    When I run "digestif -d campari"
+    Then the output should not contain a stack trace
+    And there should be an error message
+    And the exit status should not be 0
+  Scenario: Program invoked on nonexistent file
+    When I run "digestif nonexistent_file"
+    Then the output should not contain a stack trace
+    And there should be an error message
+    And the exit status should not be 0
+  Scenario: Program invoked on existent and nonexistent files, together
+    Given a file named "test_file" with:
+    """
+    test data inside
+    """
+    When I run "digestif test_file test_file_2"
+    Then there should be an error message
+    And the output should not contain a stack trace
+    And the output should not contain a digest

data/features/step_definitions/digest_steps.rb ADDED Viewed

@@ -0,0 +1,22 @@
+require 'aruba/api'
+Then /^the output should be (a|\d+) digest(?:s?)$/ do |count|
+  count = 1 if count == 'a'
+  count = count.to_i
+  lines = all_output.split("\n")
+  lines.size.should == count
+  lines.each { |line| line.should match(/^[a-z0-9]+$/) }
+end
+Then /^the output should not contain a digest$/ do
+  all_output.split('\n').each { |l| l.should_not match(/^[a-z0-9]+$/) }
+end
+Then /^the output should be (\d+) identical digests$/ do |count|
+  count = count.to_i
+  lines = all_output.split("\n")
+  lines.size.should == count
+  lines.each { |line| line.should == lines[0] }
+end

data/features/step_definitions/stack_trace_output_steps.rb ADDED Viewed

@@ -0,0 +1,13 @@
+require 'aruba/api'
+Then /^the output should not contain a stack trace$/ do
+  all_output.should_not match(/from \/.+:\d+:in `\w+'/)
+end
+Then /^there should be an error message$/ do
+  all_stderr.should match(/^digestif: /)
+end
+Then /^the output should be empty$/ do
+  all_output.should match(/^$/)
+end

data/features/support/env.rb ADDED Viewed

@@ -0,0 +1,3 @@
+require 'rubygems'
+require 'aruba/cucumber'

data/lib/digestif.rb ADDED Viewed

	@@ -0,0 +1 @@
1	+ require 'digestif/cli'

data/lib/digestif/cli.rb ADDED Viewed

@@ -0,0 +1,101 @@
+require 'optparse'
+require 'ostruct'
+require 'digestif/hasher'
+require 'digestif/version'
+module Digestif
+  class CLI
+    def self.run(args)
+      new(args).run
+    end
+    attr_accessor :args, :options
+    def initialize(args)
+      self.args = args
+      self.options = parse_options
+    end
+    def run
+      # validate files first - fail fast
+      args.each do |file|
+        unless File.exists?(file)
+          error "file not found: #{file}"
+        end
+      end
+      # engage hasher
+      args.each do |file|
+        puts Hasher.new(file, options).digest
+      end
+    end
+    def parse_options
+      # defaults
+      options = OpenStruct.new
+      options.digest = :sha1
+      options.seek_size = 1024
+      options.read_size = 512
+      parser = OptionParser.new do |p|
+        p.banner = "Usage: digestif [options] filename"
+        p.separator ""
+        p.separator "Options:"
+        p.separator ""
+        p.on("-d", "--digest DIGEST", [:md5, :sha1],
+             "Digest algorithm to use.  Currently supported:",
+             "  md5", "  sha1", ' ') do |digest|
+          options.digest = digest
+             end
+        p.on("-r", "--read-size SIZE", Integer,
+             "Size of chunk to read, in bytes " +
+             "(#{options.read_size})") do |size|
+          options.read_size = size
+             end
+        p.on("-s", "--seek-size SIZE", Integer,
+             "Size of chunk to skip after each read, in bytes " +
+             "(#{options.seek_size})") do |size|
+          options.seek_size = size
+             end
+        p.separator ""
+        p.separator "Common options:"
+        p.separator ""
+        p.on_tail("-v", "--version", "Show this message") do
+          puts Digestif.version_string
+          exit 0
+        end
+        p.on_tail("-h", "--help", "Show this message") do
+          puts p
+          exit 0
+        end
+      end
+      begin
+        parser.parse!(args)
+      rescue OptionParser::ParseError => e
+        error e
+      end
+      options
+    end
+    def error(error_obj_or_str, code = 1)
+      if error_obj_or_str.respond_to?('to_s')
+        error_str = error_obj_or_str.to_s
+      else
+        error_str = error_obj_or_str.inspect
+      end
+      $stderr.puts "digestif: #{error_str}"
+      exit code
+    end
+  end
+end

data/lib/digestif/hasher.rb ADDED Viewed

@@ -0,0 +1,28 @@
+require 'digest/sha1'
+require 'digest/md5'
+module Digestif
+  class Hasher
+    attr_accessor :options, :filename
+    def initialize(filename, options)
+      self.filename = filename
+      self.options = options
+    end
+    def digest
+      hasher = Digest.const_get(options.digest.to_s.upcase).new
+      File.open(filename, 'rb') do |f|
+        until f.eof
+          hasher.update(f.read(options.read_size))
+          f.seek(options.seek_size, IO::SEEK_CUR)
+        end
+      end
+      hasher.hexdigest
+    end
+  end
+end

data/lib/digestif/version.rb ADDED Viewed

@@ -0,0 +1,9 @@
+module Digestif
+  def self.version
+    "1.0.1"
+  end
+  def self.version_string
+    "digestif version #{self.version}"
+  end
+end

metadata ADDED Viewed

@@ -0,0 +1,120 @@
+--- !ruby/object:Gem::Specification
+name: digestif
+version: !ruby/object:Gem::Version
+  hash: 21
+  prerelease: false
+  segments:
+  - 1
+  - 0
+  - 1
+  version: 1.0.1
+platform: ruby
+authors:
+- Andrew Roberts
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2011-01-12 00:00:00 -05:00
+default_executable: digestif
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: cucumber
+  prerelease: false
+  requirement: &id001 !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        hash: 3
+        segments:
+        - 0
+        version: "0"
+  type: :development
+  version_requirements: *id001
+- !ruby/object:Gem::Dependency
+  name: aruba
+  prerelease: false
+  requirement: &id002 !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        hash: 3
+        segments:
+        - 0
+        version: "0"
+  type: :development
+  version_requirements: *id002
+description: |-
+  Digestif lets you create fast checksums of
+          large files by skipping sections of the file.  It was created
+          with compressed media files in mind, which generally have such
+          a high information density that we can get away with a checksum
+          that doesn't actually consider all the bits.
+email: adroberts@gmail.com
+executables:
+- digestif
+extensions: []
+extra_rdoc_files: []
+files:
+- Gemfile
+- Gemfile.lock
+- History.txt
+- LICENSE
+- Rakefile
+- README.textile
+- lib/digestif/cli.rb
+- lib/digestif/hasher.rb
+- lib/digestif/version.rb
+- lib/digestif.rb
+- features/basic.feature
+- features/fast_hash.feature
+- features/input.feature
+- features/step_definitions/digest_steps.rb
+- features/step_definitions/stack_trace_output_steps.rb
+- features/support/env.rb
+- bin/digestif
+has_rdoc: true
+homepage: http://github.com/aroberts/digestif
+licenses: []
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      hash: 3
+      segments:
+      - 0
+      version: "0"
+required_rubygems_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      hash: 3
+      segments:
+      - 0
+      version: "0"
+requirements: []
+rubyforge_project:
+rubygems_version: 1.3.7
+signing_key:
+specification_version: 3
+summary: Easy digest generation for large files
+test_files:
+- features/basic.feature
+- features/fast_hash.feature
+- features/input.feature
+- features/step_definitions/digest_steps.rb
+- features/step_definitions/stack_trace_output_steps.rb
+- features/support/env.rb