RubyGems - encoding_sampler - Versions diffs - 0.3.0 - Mend

encoding_sampler 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

data/.gitignore +21 -0
data/.rspec +2 -0
data/.yardopts +6 -0
data/CHANGELOG.md +10 -0
data/Gemfile +3 -0
data/LICENSE +22 -0
data/README.md +113 -0
data/Rakefile +2 -0
data/encoding_sampler.gemspec +32 -0
data/lib/encoding_sampler.rb +1 -0
data/lib/encoding_sampler/diff_callbacks.rb +66 -0
data/lib/encoding_sampler/sampler.rb +156 -0
data/lib/encoding_sampler/version.rb +3 -0
data/spec/sampler_spec.rb +525 -0
data/spec/sampler_with_real_files_spec.rb +42 -0
data/spec/sampler_with_selected_file_spec.rb +32 -0
data/spec/spec_helper.rb +15 -0
metadata +209 -0

data/.gitignore ADDED

@@ -0,0 +1,21 @@
+*.gem
+*.rbc
+.bundle
+.config
+.yardoc
+Gemfile.lock
+InstalledFiles
+_yardoc
+coverage
+doc/
+lib/bundler/man
+pkg
+rdoc
+spec/files
+spec/reports
+test/tmp
+test/version_tmp
+tmp
+*.sublime*
+.rbenv-version
+.DS_Store

data/.rspec ADDED

	@@ -0,0 +1,2 @@
1	+ --color
2	+ --format progress

data/.yardopts ADDED

@@ -0,0 +1,6 @@
+--markup markdown
+--title "state_machine"
+--readme README.md
+-
+CHANGELOG.md
+LICENSE

data/CHANGELOG.md ADDED

@@ -0,0 +1,10 @@
+# master
+## 0.3.0 / 2013-3-19
+* Deprecated #unique_valid_encodings, changed name to #unique_valid_encoding_groups for clarity
+* Added YARD-generated documentation
+## < 0.2.1
+* Abandon hope, all ye who enter here.  Grievous errors undiscovered due to bad test rig until v 0.2.

data/Gemfile ADDED

@@ -0,0 +1,3 @@
+source 'https://rubygems.org'
+gemspec

data/LICENSE ADDED

@@ -0,0 +1,22 @@
+Copyright (c) 2012 Roll No Rocks LLC
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,113 @@
+# EncodingSampler
+EncodingSampler helps solve the problem of what to do when the character encoding is unknown,
+for example when a user is uploading a file but has no idea of its encoding (or typically, even what "character encoding" means.)
+EncodingSampler extracts a concise set of samples from the selected file for display so the user can choose wisely.
+For a given file, some encodings may be dismissed out of hand because they would result in invalid
+characters or sequences.  However, in the general case you have to let the user see the differences and choose.
+For example, it's easy to determine that an 8-bit character is _not_ encoded as US_ASCII because it is simply invalid,
+but it's impossible to tell whether the character __0xA4__ should be displayed as a
+generic currency symbol (&curren;) using ISO-8859-1 or as a Euro symbol (&euro;) using ISO-8859-15
+without asking the user.
+EncodingSampler solves the problem by collecting a reasonably (but not rigorously) minimal sample by reading the file line-by-line.  Lines that demonstrate the difference between any pair of encodings are noted, and when a line is encountered that cannot be "decoded" with a specific encoding, that encoding is considered invalid and removed from the running.  When the sampling is complete, each encoding is grouped with other encoding(s) that yield identical decoding results.
+There are three possible results:
+* There may be no valid encodings.  This could mean that none of the proposed encodings match the file, but often it means the file is either malformed, or is not a text file.  This is generally what you will see if you try to determine the encoding of a non-text binary file.
+* There may be only one group of valid encodings, all of which yield the same decoded data.  In this case there are no samples to look at because there are no differences to show.  A straight ASCII file may yield this result for many encodings.
+* There may be more than one set of valid encodings, each if which yields a different decoded data.  This is the interesting case!  Then samples will be available so a user can visually determine which is the correct interpretation.  The "diff-lcs" gem is used to diff the samples, providing a simple way to highlight the (usually few) differences.
+## Performance
+Because this method works by reading file lines and "decoding" each line with all the remaining valid encodings, it can be slow. For most files, the number of line "decodings" will equal the number of lines in the file times the number of encodings tested, and at this writing, Ruby 1.9.3 supports 168 encodings!  It's recommended to try and use a much smaller set.
+## Installation
+Add this line to your application's Gemfile:
+    gem 'encoding_sampler'
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install encoding_sampler
+## Usage
+Creating a new EncodingSampler instantiates a new instance and completes the file analysis.
+```ruby
+    EncodingSampler.new(file_name, options = {}}
+    # options:
+    #  :difference_start => inserted into the diffed samples to mark the start of a "different" section
+    #  :difference_end => inserted into the diffed samples to mark the end of a "different" section
+```
+Once you have an instance of an EncodingSampler, you can use the object's instance methods to determine which encodings are valid, which are unique (that is, which yield unique results,) and get samples to compare the differences visually.  For example, imagining you have a file that turns out to be ISO-8859-15 (which includes the Euro sign,) you might get these results:
+```ruby
+    sampler = EncodingSampler::Sampler.new(
+      'some/file/name.csv',
+      ['ASCII-8BIT', 'UTF-8', 'ISO-8859-1', 'ISO-8859-2', 'ISO-8859-15'])
+    sampler.valid_encodings
+            # ["ASCII-8BIT", "ISO-8859-1", "ISO-8859-2", "ISO-8859-15"]
+    sampler.unique_valid_encoding_groups
+            # [["ASCII-8BIT"], ["ISO-8859-1", 'ISO-8859-2'], ["ISO-8859-15"]]
+    sampler.sample('ASCII-8BIT')
+            # ["?ABCDEFabcdef0123456789?ABCDEFabcdef0123456789?"]
+    sampler.sample('ISO-8859-1')
+            # ["¤ABCDEFabcdef0123456789¤ABCDEFabcdef0123456789¤"]
+    sampler.sample('ISO-8859-15')
+            # ["€ABCDEFabcdef0123456789€ABCDEFabcdef0123456789€"]
+    sampler.samples(["ASCII-8BIT", "ISO-8859-1", "ISO-8859-15"])
+            # {"ASCII-8BIT"=>["?ABCDEFabcdef0123456789?ABCDEFabcdef0123456789?"],
+            #   "ISO-8859-1"=>["¤ABCDEFabcdef0123456789¤ABCDEFabcdef0123456789¤"],
+            #   "ISO-8859-15"=>["€ABCDEFabcdef0123456789€ABCDEFabcdef0123456789€"]}
+    sampler.diffed_samples(["ASCII-8BIT", "ISO-8859-1", "ISO-8859-15"])
+            # {"ASCII-8BIT"=>["<span class=\"difference\">?</span>ABCDEFabcdef0123456789<span class=\"difference\">?</span>ABCDEFabcdef0123456789<span class=\"difference\">?</span>"],
+            #   "ISO-8859-1"=>["<span class=\"difference\">¤</span>ABCDEFabcdef0123456789<span class=\"difference\">¤</span>ABCDEFabcdef0123456789<span class=\"difference\">¤</span>"],
+            #   "ISO-8859-15"=>["<span class=\"difference\">€</span>ABCDEFabcdef0123456789<span class=\"difference\">€</span>ABCDEFabcdef0123456789<span class=\"difference\">€</span>"]}
+```
+Notes:
+* Valid encodings don't include UTF-8, indicating it was invalid for one or more lines in the file
+* Results show that ISO-8859-1 and ISO-8859-2 decoded the sample file exactly the same, so they are grouped together in
+the unique_valid_encoding_groups.
+In raw form the `diffed_samples` don't seem impressive, but they can display the resuls via HTML, for example, to highlight and clarify the differences.
+<table>
+<tr>
+  <th>ASCII-8BIT</th>
+  <td><span style="font-weight:bold; color:#ff0000;">?</span>ABCDEFabcdef0123456789<span style="font-weight:bold; color:#ff0000;">?</span>ABCDEFabcdef0123456789<span style="font-weight:bold; color:#ff0000;">?</span></td>
+</tr>
+<tr>
+  <th>ISO-8859-1</th>
+  <td><span style="font-weight:bold; color:#ff0000;">¤</span>ABCDEFabcdef0123456789<span style="font-weight:bold; color:#ff0000;">¤</span>ABCDEFabcdef0123456789<span style="font-weight:bold; color:#ff0000;">¤</span></td>
+</tr>
+  <th>ISO-8859-15</th>
+  <td><span style="font-weight:bold; color:#ff0000;">€</span>ABCDEFabcdef0123456789<span style="font-weight:bold; color:#ff0000;">€</span>ABCDEFabcdef0123456789<span style="font-weight:bold; color:#ff0000;">€</span></td>
+</tr>
+</table>
+## Contributing
+EncodingSampler provides a functional but not-so-elegant solution.
+I'd love to see improvements or alternate ideas in regard to the concept, the algorithms, the interface, etc.
+1. Fork it
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Added some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create new Pull Request

data/Rakefile ADDED

	@@ -0,0 +1,2 @@
1	+ #!/usr/bin/env rake
2	+ require "bundler/gem_tasks"

data/encoding_sampler.gemspec ADDED

@@ -0,0 +1,32 @@
+# -*- encoding: utf-8 -*-
+require File.expand_path('../lib/encoding_sampler/version', __FILE__)
+Gem::Specification.new do |s|
+  s.authors       = ["Tom Wilson"]
+  s.email         = ["tom@rollnorocks.com"]
+  s.summary       = %q{Encoding Sampler extracts a concise sample from a text file to simplify selecting the right encoding.}
+  s.description   = %q{EncodingSampler helps solve the problem of what to do when the character encoding is unknown, for example when a user is uploading a file but has no idea of its encoding (or typically, even what "character encoding" means.) EncodingSampler extracts a concise set of samples from the selected file for display so the user can choose wisely.}
+  s.homepage      = ""
+  s.files         = `git ls-files`.split($\)
+  s.executables   = s.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
+  s.test_files    = s.files.grep(%r{^(test|spec|features)/})
+  s.name          = "encoding_sampler"
+  s.require_paths = ["lib"]
+  s.version       = EncodingSampler::VERSION
+  s.rdoc_options      = %w(--line-numbers --inline-source --title encoding_sampler --main README.md)
+  s.extra_rdoc_files  = %w(README.md CHANGELOG.md LICENSE)
+  s.add_dependency('diff-lcs', '1.1.3')
+  s.add_development_dependency("rake")
+  s.add_development_dependency("debugger")
+  s.add_development_dependency("rspec")
+  s.add_development_dependency("fakefs")
+  s.add_development_dependency("simplecov")
+  s.add_development_dependency("yard")
+  s.add_development_dependency("redcarpet")
+end

data/lib/encoding_sampler.rb ADDED

	@@ -0,0 +1 @@
1	+ require "encoding_sampler/sampler"

data/lib/encoding_sampler/diff_callbacks.rb ADDED

@@ -0,0 +1,66 @@
+require "cgi"
+module EncodingSampler
+  # Simple formatter to override Diff::LCS::DiffCallbacks in diff-lcs Gem to generate diffed output.
+  class DiffCallbacks
+    attr_accessor :output
+    # @!attribute output
+    #   @return [String] Storage for the resultant diffed output.
+    attr_reader :difference_start
+    # @!attribute [r] difference_start
+    #   @return [String] The string inserted in the diff results __before__ a segment where the samples differ.
+    #     Set as option on initialization.
+    attr_reader :difference_end
+    # @!attribute [r] difference_end
+    #   @return [String] The string inserted in the diff results __after__ a segment where the samples differ.
+    #     Set as option on initialization.
+    # @return [DiffCallbacks] Returns a new instance of EncodingSampler::DiffCallbacks.
+    # @param [Hash] options
+    #   Valid keys are :difference_start and :difference_end.
+    # @see #difference_start
+    # @see #difference_end
+    def initialize(output, options = {})
+      @output = output
+      options ||= {}
+      @difference_start = options[:difference_start] ||= '<span class="difference">'
+      @difference_end = options[:difference_end] ||= '</span>'
+    end
+    # Called with both strings are the same
+    def match(event)
+      output_matched event.old_element
+    end
+    # Called when there is a substring in A that isn't in B
+    def discard_a(event)
+      output_changed event.old_element
+    end
+    # Called when there is a substring in B that isn't in A
+    def discard_b(event)
+      output_changed event.new_element
+    end
+  private
+    def output_matched(element)
+      element = CGI.escapeHTML(element.chomp)
+      @output << "#{element}" unless element.empty?
+    end
+    def output_changed(element)
+      element = CGI.escapeHTML(element.chomp)
+      return if element.empty?
+      @output << "#{@difference_start}#{element}#{@difference_end}"
+      # Join adjacent changed sections
+      @output.gsub "#{element}#{@difference_end}#{@difference_start}", ''
+    end
+  end
+end

data/lib/encoding_sampler/sampler.rb ADDED

@@ -0,0 +1,156 @@
+require 'encoding_sampler/version'
+require 'encoding_sampler/diff_callbacks'
+require 'diff-lcs'
+module EncodingSampler
+    # @!attribute [r] filename
+    # @!attribute [r] unique_valid_encoding_groups
+    class Sampler
+    # Full name of the target file used to create the sample.
+    # @return [String]
+    attr_reader :filename
+    # Groups of valid encoding names, such that the encodings in a group all result in the same decoding for the target file.
+    # @example When ISO-8859-1 and ISO-8859-2 decode the target file in exactly the same way, but unlike ISO-8859-15,
+    #   [["ISO-8859-1", 'ISO-8859-2'], ["ISO-8859-15"]]
+    # @return [Array]
+    attr_reader :unique_valid_encoding_groups
+    # Attribute renamed for clarity.
+    # @deprecated Use {#unique_valid_encoding_groups} instead.
+    def unique_valid_encodings
+      unique_valid_encoding_groups
+    end
+    # All valid encodings.
+    # @return [Array] Names of encodings that return valid results for the entire file.
+    def valid_encodings
+      unique_valid_encoding_groups.flatten
+    end
+    # Sample file lines, decoded by _encoding_.
+    # @return [Array]
+    def sample(encoding)
+      @binary_samples.values.map {|line| decode_binary_string(line, encoding)}
+    end
+    # Returns a hash of samples, keyed by encoding
+    # @return [Hash]
+    def samples(encodings = valid_encodings)
+      encodings.inject({}) {|hash, encoding| hash.merge! encoding => sample(encoding)}
+    end
+    # Returns all the "best" encodings. Assumes shortest strings are most likely to be correct.
+    # @return [Array]
+    def best_encodings
+      candidates = samples(unique_valid_encoding_groups.collect {|encoding_group| encoding_group.first})
+      min_length = candidates.values.collect {|ary| ary.join('').size}.min
+      candidates.keys.select {|key| candidates[key].join('').size == min_length}
+    end
+    # Multiple encodings often return the exact same decoded sample.
+    # Return only unique samples, keyed on the first encoding to return each sample.
+    # What's first in each grouping is based on original order of encodings give to the constructor.
+    # @return [Array]
+    def unique_samples
+      samples(unique_valid_encoding_groups.collect {|encoding_group| encoding_group.first})
+    end
+    # Decoded sample, diffed against __all__ of the samples, and marked up to show differences.
+    # @param [String] encoding
+    # @return [String]
+    def diffed_sample(encoding)
+      diffed_encoded_samples[encoding]
+    end
+    def diffed_samples(encodings = valid_encodings)
+      encodings.inject({}) {|hash, encoding| hash.merge! encoding => diffed_sample(encoding)}
+    end
+    # @ (see #unique_samples) Samples are diffed
+    def unique_diffed_samples
+      diffed_samples(unique_valid_encoding_groups.collect {|encoding_group| encoding_group.first})
+    end
+  private
+    def initialize(file_name, encodings, diff_options = {})
+      @diff_options = diff_options
+      @filename = file_name.freeze
+      @unique_valid_encoding_groups, @binary_samples, solutions = [], {}, {}
+      solutions = {}
+      encodings.sort.combination(2).to_a.each {|pair| solutions[pair] = nil}
+      # read the entire file to verify encodings and collect samples for comparison of encodings
+      File.open(@filename, 'rb') do |file|
+        until file.eof?
+          binary_line = file.readline.strip
+          decoded_lines = multi_decode_binary_string(binary_line, encodings)
+          # eliminate any newly-invalid encodings from the scope
+          decoded_lines.select {|encoding, decoded_line| decoded_line.nil?}.keys.each do |invalid_encoding|
+            encodings.delete invalid_encoding
+            solutions.delete_if {|pair, lineno| pair.include? invalid_encoding}
+            @binary_samples.keep_if {|id, string| solutions.keys.flatten.include? id}
+          end
+          # add sample to solutions when binary string decodes differently for any two previously-undifferentiated encodings
+          solutions.select {|pair, lineno| lineno.nil?}.keys.each do |unsolved_pair|
+            solutions[unsolved_pair], @binary_samples[file.lineno] = file.lineno, binary_line if decoded_lines[unsolved_pair[0]] != decoded_lines[unsolved_pair[1]]
+          end
+        end
+      end
+      # group undifferentiated encodings
+      (solutions.select {|pair, lineno| lineno.nil?}.keys + encodings.collect {|encoding| [encoding]}).each do |subgroup|
+        group_index = @unique_valid_encoding_groups.index {|group| !(group & subgroup).empty?}
+        group_index ? @unique_valid_encoding_groups[group_index] |= subgroup : @unique_valid_encoding_groups << subgroup
+      end
+      @unique_valid_encoding_groups = @unique_valid_encoding_groups.each {|group| group.freeze}.freeze
+      @binary_samples.freeze
+    end
+    def decode_binary_string(binary_string, encoding)
+      encoded_string = binary_string.dup.force_encoding(encoding)
+      encoded_string.valid_encoding? ? encoded_string.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?') : nil
+    end
+    def multi_decode_binary_string(binary_string, encodings)
+      decoded_lines = {}
+      encodings.each {|encoding| decoded_lines[encoding] = decode_binary_string(binary_string, encoding)}
+      decoded_lines
+    end
+    def diffed_strings(array_of_strings)
+      lcs = array_of_strings.inject {|intermediate_lcs, string| Diff::LCS.LCS(intermediate_lcs, string).join }
+      callbacks = DiffCallbacks.new(diff_output = '', @diff_options)
+      array_of_strings.map do |string|
+        diff_output.clear
+        Diff::LCS.traverse_sequences(lcs, string, callbacks)
+        diff_output.dup
+      end
+    end
+    def diffed_encoded_samples
+      return @diffed_encoded_samples if @diffed_encoded_samples
+      encodings = valid_encodings.freeze
+      decoded_samples = samples(encodings)
+      @diffed_encoded_samples = encodings.inject({}) {|hash, key| hash.merge! key => []}
+      @binary_samples.values.each_index do |i|
+        decoded_lines = encodings.map {|encoding| decoded_samples[encoding][i]}
+        diffed_encoded_lines = diffed_strings(decoded_lines)
+        encodings.each_index {|j| @diffed_encoded_samples[encodings[j]] << diffed_encoded_lines[j] }
+      end
+      @diffed_encoded_samples.freeze
+    end
+  end
+end

data/lib/encoding_sampler/version.rb ADDED

@@ -0,0 +1,3 @@
+module EncodingSampler
+  VERSION = "0.3.0"
+end

data/spec/sampler_spec.rb ADDED

@@ -0,0 +1,525 @@
+require "spec_helper.rb"
+include EncodingSampler
+describe Sampler do
+  context 'with fakefs', fakefs: true do
+    before(:each) do
+      @filedir = '/test'
+      @filename = '/test/testfile'
+      @lines = %w(one two three four five)
+      FileUtils.mkdir(@filedir) unless Dir.exists?(@filedir)
+      File.open(@filename, "w") do |f|
+        @lines.each do |line|
+          f.puts line
+        end
+      end
+      @test_sampler = Sampler.new(@filename, %w(US-ASCII UTF-8))
+    end
+    describe 'verifying fakefs just to make sure' do
+      # Make sure this works right after all the trouble with home-grown file system stubs!!
+      it 'can open and readline without error' do
+        expect {
+          File.open(@filename, 'r') do |file|
+            until file.eof?
+              file.readline
+            end
+          end
+        }.to_not raise_error
+      end
+      it 'raises EOFError when readline called past eof' do
+        expect {
+          File.open(@filename, 'r') do |file|
+            until file.eof?
+              file.readline
+            end
+            file.readline
+          end
+        }.to raise_error(EOFError)
+      end
+      it 'readline returns lines' do
+        lines_read = []
+        File.open(@filename, 'r') do |file|
+          until file.eof?
+            lines_read << file.readline.chomp
+          end
+        end
+        lines_read.should eq @lines
+      end
+    end
+    describe 'creation' do
+      it 'works with required arguments' do
+        Sampler.new(@filename, []).should be_a Sampler
+      end
+      it 'requires a filename' do
+        expect {Sampler.new()}.to raise_error
+      end
+      it 'requires encodings' do
+        expect {Sampler.new(@filename)}.to raise_error
+      end
+      it 'passes error raised on File.open' do
+        File.stub(:open).and_raise 'some error'
+        expect {Sampler.new(@filename, [])}.to raise_error('some error')
+      end
+      it 'passes error raised on file.readline' do
+        File.any_instance.stub(:readline).and_raise 'some error'
+        expect {Sampler.new(@filename, [])}.to raise_error('some error')
+      end
+    end
+    describe "#filename" do
+      it 'returns the same filename used to create the instance' do
+        Sampler.new(@filename, []).filename.should eq @filename
+      end
+      it 'is read-only' do
+        expect {Sampler.new(@filename, []).filename = 'anything'}.to raise_error NoMethodError
+      end
+    end
+    describe '#unique_valid_encoding_groups' do
+      before(:each) do
+        Sampler.any_instance.stub(:decode_binary_string) do |*args|
+          if ['ENCODING1', 'LIKE_ENCODING1'].include? args[1]
+            args[0]
+          else
+            args[0].gsub(/t/, 'T')
+          end
+        end
+      end
+      it 'is read-only' do
+        expect {Sampler.new(@filename, []).unique_valid_encoding_groups = 'anything'}.to raise_error NoMethodError
+      end
+      it 'is frozen' do
+        Sampler.new(@filename, []).unique_valid_encoding_groups.should be_frozen
+      end
+      shared_examples 'unique_valid_encoding_groups format is correct' do
+        it 'returns an array' do
+          @sampler.unique_valid_encoding_groups.should be_a Array
+        end
+        it 'each array element is an array of strings (encoding names)' do
+          @sampler.unique_valid_encoding_groups.each do |element|
+            element.should be_a Array
+            element.each do |encoding|
+              encoding.should be_a String
+            end
+          end
+        end
+        it 'array elements do not share members with other elements' do
+          @sampler.unique_valid_encoding_groups.flatten.size.should eq @sampler.unique_valid_encoding_groups.flatten.uniq.size
+        end
+      end
+      context 'when there are no lines read' do
+        before(:each) do
+          File.any_instance.stub(:eof?).and_return true
+          @sampler = Sampler.new(@filename, %w(ENCODING1 LIKE_ENCODING1 UNLIKE_ENCODING1))
+        end
+        it_behaves_like 'unique_valid_encoding_groups format is correct'
+        it 'returns all encodings in a single array element' do
+          @sampler.unique_valid_encoding_groups.count.should eq 1
+        end
+        it 'contains all valid encodings' do
+          @sampler.unique_valid_encoding_groups.flatten.size.should eq 3
+        end
+      end
+      context 'when all encodings work the same' do
+        before(:each) do
+          @sampler = Sampler.new(@filename, %w(ENCODING1 LIKE_ENCODING1))
+        end
+        it_behaves_like 'unique_valid_encoding_groups format is correct'
+        it 'returns all encodings in a single array element' do
+          @sampler.unique_valid_encoding_groups.count.should eq 1
+        end
+        it 'the single array element contains all valid encodings' do
+          @sampler.unique_valid_encoding_groups[0].should eq %w(ENCODING1 LIKE_ENCODING1)
+        end
+      end
+      context 'when encodings are different' do
+        before(:each) do
+          @sampler = Sampler.new(@filename, %w(ENCODING1 UNLIKE_ENCODING1))
+        end
+        it_behaves_like 'unique_valid_encoding_groups format is correct'
+        it 'returns all encodings in two array elements' do
+          @sampler.unique_valid_encoding_groups.count.should eq 2
+        end
+        it 'the first array element contains one of the valid encodings' do
+          %w(ENCODING1 UNLIKE_ENCODING1).should include @sampler.unique_valid_encoding_groups[0][0]
+        end
+        it 'the second array element contains one of the valid encodings' do
+          %w(ENCODING1 UNLIKE_ENCODING1).should include @sampler.unique_valid_encoding_groups[1][0]
+        end
+        it 'the array elements contains all valid encodings' do
+          @sampler.unique_valid_encoding_groups.flatten.sort.should eq %w(ENCODING1 UNLIKE_ENCODING1).sort
+        end
+      end
+    end
+    describe '#valid_encodings' do
+      it 'should contain all encodings in unique_valid_encoding_groups' do
+        @test_sampler.valid_encodings.sort.should eq @test_sampler.unique_valid_encodings.flatten.sort
+      end
+    end
+    describe '#sample' do
+      before(:each) do
+        Sampler.any_instance.stub(:decode_binary_string) do |*args|
+          if ['ENCODING1', 'LIKE_ENCODING1'].include? args[1]
+            args[0]
+          else
+            args[0].gsub(/t/, 'T')
+          end
+        end
+      end
+      shared_examples 'sample format is correct' do
+        it 'returns a hash for each valid encoding' do
+          @sampler.valid_encodings.each do |encoding|
+            @sampler.sample(encoding).should be_a Array
+          end
+        end
+        it 'elements are strings (decoded lines)' do
+          @sampler.sample('ENCODING1').each do |element|
+            element.should be_a String
+          end
+        end
+      end
+      context 'when there are no lines read' do
+        before(:each) do
+          File.any_instance.stub(:eof?).and_return true
+          @sampler = Sampler.new(@filename, %w(ENCODING1 LIKE_ENCODING1 UNLIKE_ENCODING1))
+        end
+        it_behaves_like 'sample format is correct'
+        it 'it is empty' do
+          %w(ENCODING1 LIKE_ENCODING1 UNLIKE_ENCODING1).each do |encoding|
+            @sampler.sample(encoding).should be_empty
+          end
+        end
+      end
+      context 'when all encodings work the same' do
+        before(:each) do
+          @sampler = Sampler.new(@filename, %w(ENCODING1 LIKE_ENCODING1))
+        end
+        it_behaves_like 'sample format is correct'
+        it 'it is empty' do
+          %w(ENCODING1 LIKE_ENCODING1).each do |encoding|
+            @sampler.sample(encoding).should be_empty
+          end
+        end
+      end
+      context 'when encoding are different' do
+        before(:each) do
+          @sampler = Sampler.new(@filename, %w(ENCODING1 UNLIKE_ENCODING1))
+        end
+        it_behaves_like 'sample format is correct'
+        it 'it is not empty' do
+          %w(ENCODING1 LIKE_ENCODING1).each do |encoding|
+            @sampler.sample(encoding).should_not be_empty
+          end
+        end
+        it 'the samples values should not be equal' do
+          @sampler.sample('ENCODING1').should_not eq @sampler.sample('UNLIKE_ENCODING1')
+        end
+      end
+    end
+    describe '#samples' do
+      before(:each) do
+        Sampler.any_instance.stub(:decode_binary_string) do |*args|
+          if ['ENCODING1', 'LIKE_ENCODING1'].include? args[1]
+            args[0]
+          else
+            args[0].gsub(/t/, 'T')
+          end
+        end
+      end
+      context 'when there are no lines read' do
+        before(:each) do
+          File.any_instance.stub(:eof?).and_return true
+          @sampler = Sampler.new(@filename, %w(ENCODING1 LIKE_ENCODING1 UNLIKE_ENCODING1))
+        end
+        it 'each included sample is empty' do
+          @sampler.samples.each {|encoding, sample| sample.should be_empty}
+        end
+      end
+      context 'when all encodings work the same' do
+        before(:each) do
+          @sampler = Sampler.new(@filename, %w(ENCODING1 LIKE_ENCODING1))
+        end
+        it 'each included sample is empty' do
+          @sampler.samples.each {|encoding, sample| sample.should be_empty}
+        end
+      end
+      context 'when encoding are different' do
+        before(:each) do
+          @sampler = Sampler.new(@filename, %w(ENCODING1 UNLIKE_ENCODING1))
+        end
+        it 'it is not empty' do
+          %w(ENCODING1 LIKE_ENCODING1).each do |encoding|
+            @sampler.samples.should_not be_empty
+          end
+        end
+        it 'should have a sample for each valid encoding' do
+          (@sampler.samples.keys & @sampler.valid_encodings).sort.should eq @sampler.valid_encodings.sort
+        end
+        it 'each sample value (the string samples) should be the same size' do
+          sample_values = @sampler.samples.values
+          sample_values.each do |sample_value|
+            sample_value.size.should eq sample_values.first.size
+          end
+        end
+        # it 'the sample values should not be equal, duh' do
+          # samples = @sampler.samples
+          # samples.values.each do |string_array|
+            # samples['ENCODING1'][key].should_not eq samples['UNLIKE_ENCODING1'][key]
+          # end
+        # end
+      end
+    end
+    describe '#best_encodings' do
+      before(:each) do
+       Sampler.any_instance.stub(:decode_binary_string) do |*args|
+          case args[1]
+          when 'SHORTEST_ENCODING' then args[0]
+          when 'LIKE_SHORTEST_ENCODING' then args[0].reverse # same length and different is all that matters
+          when 'INVALID_ENCODING' then nil
+          else args[0].gsub(/t/, 'T&#') # force longer faked encoding for letter 't'
+          end
+        end
+      end
+      context 'no valid encodings' do
+        before(:each) do
+          @sampler = Sampler.new(@filename, %w(INVALID_ENCODING))
+        end
+        it 'returns empty array' do
+          @sampler.best_encodings.should eq []
+        end
+      end
+      context 'one valid encoding' do
+        before(:each) do
+          @sampler = Sampler.new(@filename, %w(SHORTEST_ENCODING))
+        end
+        it 'returns an array with the one shortest encoding' do
+          @sampler.best_encodings.should eq ['SHORTEST_ENCODING']
+        end
+      end
+      context 'when one shortest encoding' do
+        before(:each) do
+          @sampler = Sampler.new(@filename, %w(SHORTEST_ENCODING LONGER_ENCODING))
+        end
+        it 'returns an array with the one shortest encoding' do
+          @sampler.best_encodings.should eq ['SHORTEST_ENCODING']
+        end
+      end
+      context 'when more than one shortest encoding' do
+        before(:each) do
+          @sampler = Sampler.new(@filename, %w(SHORTEST_ENCODING LIKE_SHORTEST_ENCODING LONGER_ENCODING))
+        end
+        it 'returns an array with the shortest encodings' do
+          @sampler.best_encodings.should eq ['SHORTEST_ENCODING', 'LIKE_SHORTEST_ENCODING']
+        end
+      end
+    end
+    describe '#unique_samples' do
+      it 'should return a Hash' do
+        @test_sampler.unique_samples.should be_a Hash
+      end
+      it 'should have keys equal to first item from each valid_encoding_group' do
+        @test_sampler.unique_samples.keys.should eq @test_sampler.unique_valid_encoding_groups.collect {|group| group.first}
+      end
+      it 'should provide the right sample value for each key' do
+        @test_sampler.unique_samples.keys.each do |encoding|
+          @test_sampler.unique_samples[encoding].should eq @test_sampler.sample(encoding)
+        end
+      end
+    end
+    describe 'diffed_sample' do
+      before(:each) do
+        Sampler.any_instance.stub(:decode_binary_string) do |*args|
+          if ['ENCODING1', 'LIKE_ENCODING1'].include? args[1]
+            args[0]
+          else
+            args[0].gsub(/t/, 'T')
+          end
+        end
+        @sampler = Sampler.new(@filename, %w(ENCODING1 LIKE_ENCODING1 UNLIKE_ENCODING1))
+      end
+      it 'works' do
+        @sampler.diffed_sample('ENCODING1')
+      end
+      it 'returns an array' do
+        # note: two different encodings that express different results only takes one sample
+        @sampler.diffed_sample('ENCODING1').should be_a Array
+      end
+      it 'has one line for each sample' do
+        # note: two different encodings that express different results only takes one sample
+        @sampler.diffed_sample('ENCODING1').size.should eq 1
+      end
+      it 'returns identical results when the decoded strings are the same' do
+        @sampler.diffed_sample('ENCODING1').should eq @sampler.diffed_sample('LIKE_ENCODING1')
+      end
+      it 'returns different results when the decoded strings are different' do
+        @sampler.diffed_sample('ENCODING1').should_not eq @sampler.diffed_sample('UNLIKE_ENCODING1')
+      end
+    end
+    describe 'diffed_samples' do
+      before(:each) do
+        Sampler.any_instance.stub(:decode_binary_string) do |*args|
+          if ['ENCODING1', 'LIKE_ENCODING1'].include? args[1]
+            args[0]
+          else
+            args[0].gsub(/t/, 'T')
+          end
+        end
+      end
+      context 'with default options' do
+        before(:each) do
+          @sampler = Sampler.new(@filename, %w(ENCODING1 LIKE_ENCODING1 UNLIKE_ENCODING1))
+        end
+        it 'works' do
+          @sampler.diffed_samples(['ENCODING1'])
+        end
+        it 'returns a hash' do
+          # note: two different encodings that express different results only takes one sample
+          @sampler.diffed_samples(['ENCODING1']).should be_a Hash
+        end
+        it 'keys match encodings in argument' do
+          # note: two different encodings that express different results only takes one sample
+          @sampler.diffed_samples(['ENCODING1','UNLIKE_ENCODING1']).keys.should eq ['ENCODING1','UNLIKE_ENCODING1']
+        end
+        it 'values match the values from diffed_sample for the same encoding' do
+          @sampler.diffed_samples(['ENCODING1'])['ENCODING1'].should eq @sampler.diffed_sample('ENCODING1')
+        end
+      end
+      context 'with custom :difference_start, :difference_end options' do
+        before(:each) do
+          @sampler = Sampler.new(@filename, %w(ENCODING1 LIKE_ENCODING1 UNLIKE_ENCODING1), difference_start: '<start>', difference_end: '<end>')
+        end
+        it 'uses difference_start value specified in options hash' do
+          @sampler.diffed_sample('ENCODING1').join.should include '<start>'
+        end
+        it 'uses difference_end value specified in options hash' do
+          @sampler.diffed_sample('ENCODING1').join.should include '<end>'
+        end
+      end
+    end
+    describe '#unique_diffed_samples' do
+      it 'should return a Hash' do
+        @test_sampler.unique_diffed_samples.should be_a Hash
+      end
+      it 'should have keys equal to first item from each valid_encoding_group' do
+        @test_sampler.unique_diffed_samples.keys.should eq @test_sampler.unique_valid_encoding_groups.collect {|group| group.first}
+      end
+      it 'should provide the right sample value for each key' do
+        @test_sampler.unique_diffed_samples.keys.each do |encoding|
+          @test_sampler.unique_diffed_samples[encoding].should eq @test_sampler.diffed_samples[encoding]
+        end
+      end
+    end
+  end
+end

data/spec/sampler_with_real_files_spec.rb ADDED

@@ -0,0 +1,42 @@
+require "spec_helper.rb"
+include EncodingSampler
+describe Sampler do
+  context 'with real files' do
+    before(:all) do
+      # create some encoded strings
+      @encodings = %w(ASCII-8BIT UTF-8 WINDOWS-1252 ISO-8859-1 ISO-8859-2 ISO-8859-15)
+      @special_chars = "\u20AC\u201C\u201d\u00A1\u00A2\u00A3\u00A9\u00AE\u00C4\u00C5\u00E4\u00E5"
+      @ascii_chars = "ABCDEFabcdef0123456789"
+      @mixed_lines = []
+      3.times do
+        @mixed_lines << @ascii_chars # first line the same for all
+      end
+      (0..(@special_chars.length - 1)).each do |i|
+        @mixed_lines << @special_chars.chars.to_a[i] + @ascii_chars + @special_chars.chars.to_a[i] + @ascii_chars + @special_chars.chars.to_a[i]
+      end
+      # create temp files
+      @encoding_file_dir = './spec/files/'
+      Dir.mkdir(@encoding_file_dir) unless Dir.exists? @encoding_file_dir
+      @file_names = {}
+      @encodings.each do |encoding|
+        file_name = "#{@encoding_file_dir}#{encoding}.txt"
+        @file_names[encoding] = file_name
+        File.open(file_name, "w:#{encoding}") do |file|
+          # replace: '' to omit characters unavailable for the selected encoding, creating clean valid files
+          file.write @mixed_lines.join("\n").encode(encoding, invalid: :replace, undef: :replace, replace: '')
+        end
+      end
+    end
+    it 'can be created for each file encoding' do
+      @encodings.each do |encoding|
+        expect { Sampler.new(@file_names[encoding], @encodings) }.to_not raise_error
+      end
+    end
+  end
+end

data/spec/sampler_with_selected_file_spec.rb ADDED

@@ -0,0 +1,32 @@
+require "spec_helper.rb"
+include EncodingSampler
+# For ad-hoc testing using local file.
+# Set env var FILENAME='filename'
+# Optionally set ENCODINGS='encoding1 encoding2' etc
+describe Sampler do
+  context "when ENV['FILENAME'] is set to a selected filename" do
+    let(:default_encodings) {%w(ASCII-8BIT UTF-8 WINDOWS-1252 ISO-8859-1 ISO-8859-2 ISO-8859-15)}
+    it 'it works and displays the results' do
+      sampler, filename = nil, nil
+      filename = ENV['FILENAME']
+      encodings = ENV['ENCODINGS'] || default_encodings
+      if filename.nil?
+        p "ENV['FILENAME'] is nil, skipping ad-hoc test."
+      else
+        filename.should_not be_nil
+        expect { sampler = Sampler.new(filename, encodings) }.to_not raise_error
+        p ''
+        p "Results for #{filename}:"
+        pp sampler.inspect
+        pp sampler.unique_diffed_samples
+      end
+    end
+  end
+end

data/spec/spec_helper.rb ADDED

@@ -0,0 +1,15 @@
+if ENV['COVERAGE']
+  require 'simplecov'
+  SimpleCov.start { add_filter '/test/' }
+end
+require 'encoding_sampler'
+require 'fakefs/spec_helpers'
+RSpec.configure do |config|
+  config.treat_symbols_as_metadata_keys_with_true_values = true
+  config.run_all_when_everything_filtered = true
+  config.filter_run :focus
+  config.include FakeFS::SpecHelpers, fakefs: true
+end

metadata ADDED

@@ -0,0 +1,209 @@
+--- !ruby/object:Gem::Specification
+name: encoding_sampler
+version: !ruby/object:Gem::Version
+  version: 0.3.0
+  prerelease:
+platform: ruby
+authors:
+- Tom Wilson
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2013-03-19 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: diff-lcs
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - '='
+      - !ruby/object:Gem::Version
+        version: 1.1.3
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - '='
+      - !ruby/object:Gem::Version
+        version: 1.1.3
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: debugger
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: fakefs
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: simplecov
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: yard
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: redcarpet
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+description: EncodingSampler helps solve the problem of what to do when the character
+  encoding is unknown, for example when a user is uploading a file but has no idea
+  of its encoding (or typically, even what "character encoding" means.) EncodingSampler
+  extracts a concise set of samples from the selected file for display so the user
+  can choose wisely.
+email:
+- tom@rollnorocks.com
+executables: []
+extensions: []
+extra_rdoc_files:
+- README.md
+- CHANGELOG.md
+- LICENSE
+files:
+- .gitignore
+- .rspec
+- .yardopts
+- CHANGELOG.md
+- Gemfile
+- LICENSE
+- README.md
+- Rakefile
+- encoding_sampler.gemspec
+- lib/encoding_sampler.rb
+- lib/encoding_sampler/diff_callbacks.rb
+- lib/encoding_sampler/sampler.rb
+- lib/encoding_sampler/version.rb
+- spec/sampler_spec.rb
+- spec/sampler_with_real_files_spec.rb
+- spec/sampler_with_selected_file_spec.rb
+- spec/spec_helper.rb
+homepage: ''
+licenses: []
+post_install_message:
+rdoc_options:
+- --line-numbers
+- --inline-source
+- --title
+- encoding_sampler
+- --main
+- README.md
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 1.8.24
+signing_key:
+specification_version: 3
+summary: Encoding Sampler extracts a concise sample from a text file to simplify selecting
+  the right encoding.
+test_files:
+- spec/sampler_spec.rb
+- spec/sampler_with_real_files_spec.rb
+- spec/sampler_with_selected_file_spec.rb
+- spec/spec_helper.rb
+has_rdoc: