encoding_sampler 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,21 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ Gemfile.lock
7
+ InstalledFiles
8
+ _yardoc
9
+ coverage
10
+ doc/
11
+ lib/bundler/man
12
+ pkg
13
+ rdoc
14
+ spec/files
15
+ spec/reports
16
+ test/tmp
17
+ test/version_tmp
18
+ tmp
19
+ *.sublime*
20
+ .rbenv-version
21
+ .DS_Store
data/.rspec ADDED
@@ -0,0 +1,2 @@
1
+ --color
2
+ --format progress
@@ -0,0 +1,6 @@
1
+ --markup markdown
2
+ --title "state_machine"
3
+ --readme README.md
4
+ -
5
+ CHANGELOG.md
6
+ LICENSE
@@ -0,0 +1,10 @@
1
+ # master
2
+
3
+ ## 0.3.0 / 2013-3-19
4
+
5
+ * Deprecated #unique_valid_encodings, changed name to #unique_valid_encoding_groups for clarity
6
+ * Added YARD-generated documentation
7
+
8
+ ## < 0.2.1
9
+
10
+ * Abandon hope, all ye who enter here. Grievous errors undiscovered due to bad test rig until v 0.2.
data/Gemfile ADDED
@@ -0,0 +1,3 @@
1
+ source 'https://rubygems.org'
2
+
3
+ gemspec
data/LICENSE ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2012 Roll No Rocks LLC
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,113 @@
1
+ # EncodingSampler
2
+
3
+ EncodingSampler helps solve the problem of what to do when the character encoding is unknown,
4
+ for example when a user is uploading a file but has no idea of its encoding (or typically, even what "character encoding" means.)
5
+ EncodingSampler extracts a concise set of samples from the selected file for display so the user can choose wisely.
6
+
7
+ For a given file, some encodings may be dismissed out of hand because they would result in invalid
8
+ characters or sequences. However, in the general case you have to let the user see the differences and choose.
9
+ For example, it's easy to determine that an 8-bit character is _not_ encoded as US_ASCII because it is simply invalid,
10
+ but it's impossible to tell whether the character __0xA4__ should be displayed as a
11
+ generic currency symbol (&curren;) using ISO-8859-1 or as a Euro symbol (&euro;) using ISO-8859-15
12
+ without asking the user.
13
+
14
+ EncodingSampler solves the problem by collecting a reasonably (but not rigorously) minimal sample by reading the file line-by-line. Lines that demonstrate the difference between any pair of encodings are noted, and when a line is encountered that cannot be "decoded" with a specific encoding, that encoding is considered invalid and removed from the running. When the sampling is complete, each encoding is grouped with other encoding(s) that yield identical decoding results.
15
+
16
+ There are three possible results:
17
+
18
+ * There may be no valid encodings. This could mean that none of the proposed encodings match the file, but often it means the file is either malformed, or is not a text file. This is generally what you will see if you try to determine the encoding of a non-text binary file.
19
+
20
+ * There may be only one group of valid encodings, all of which yield the same decoded data. In this case there are no samples to look at because there are no differences to show. A straight ASCII file may yield this result for many encodings.
21
+
22
+ * There may be more than one set of valid encodings, each if which yields a different decoded data. This is the interesting case! Then samples will be available so a user can visually determine which is the correct interpretation. The "diff-lcs" gem is used to diff the samples, providing a simple way to highlight the (usually few) differences.
23
+
24
+ ## Performance
25
+
26
+ Because this method works by reading file lines and "decoding" each line with all the remaining valid encodings, it can be slow. For most files, the number of line "decodings" will equal the number of lines in the file times the number of encodings tested, and at this writing, Ruby 1.9.3 supports 168 encodings! It's recommended to try and use a much smaller set.
27
+
28
+ ## Installation
29
+
30
+ Add this line to your application's Gemfile:
31
+
32
+ gem 'encoding_sampler'
33
+
34
+ And then execute:
35
+
36
+ $ bundle
37
+
38
+ Or install it yourself as:
39
+
40
+ $ gem install encoding_sampler
41
+
42
+ ## Usage
43
+
44
+ Creating a new EncodingSampler instantiates a new instance and completes the file analysis.
45
+
46
+ ```ruby
47
+ EncodingSampler.new(file_name, options = {}}
48
+
49
+ # options:
50
+ # :difference_start => inserted into the diffed samples to mark the start of a "different" section
51
+ # :difference_end => inserted into the diffed samples to mark the end of a "different" section
52
+ ```
53
+
54
+ Once you have an instance of an EncodingSampler, you can use the object's instance methods to determine which encodings are valid, which are unique (that is, which yield unique results,) and get samples to compare the differences visually. For example, imagining you have a file that turns out to be ISO-8859-15 (which includes the Euro sign,) you might get these results:
55
+
56
+ ```ruby
57
+ sampler = EncodingSampler::Sampler.new(
58
+ 'some/file/name.csv',
59
+ ['ASCII-8BIT', 'UTF-8', 'ISO-8859-1', 'ISO-8859-2', 'ISO-8859-15'])
60
+
61
+ sampler.valid_encodings
62
+ # ["ASCII-8BIT", "ISO-8859-1", "ISO-8859-2", "ISO-8859-15"]
63
+ sampler.unique_valid_encoding_groups
64
+ # [["ASCII-8BIT"], ["ISO-8859-1", 'ISO-8859-2'], ["ISO-8859-15"]]
65
+
66
+ sampler.sample('ASCII-8BIT')
67
+ # ["?ABCDEFabcdef0123456789?ABCDEFabcdef0123456789?"]
68
+ sampler.sample('ISO-8859-1')
69
+ # ["¤ABCDEFabcdef0123456789¤ABCDEFabcdef0123456789¤"]
70
+ sampler.sample('ISO-8859-15')
71
+ # ["€ABCDEFabcdef0123456789€ABCDEFabcdef0123456789€"]
72
+ sampler.samples(["ASCII-8BIT", "ISO-8859-1", "ISO-8859-15"])
73
+ # {"ASCII-8BIT"=>["?ABCDEFabcdef0123456789?ABCDEFabcdef0123456789?"],
74
+ # "ISO-8859-1"=>["¤ABCDEFabcdef0123456789¤ABCDEFabcdef0123456789¤"],
75
+ # "ISO-8859-15"=>["€ABCDEFabcdef0123456789€ABCDEFabcdef0123456789€"]}
76
+
77
+ sampler.diffed_samples(["ASCII-8BIT", "ISO-8859-1", "ISO-8859-15"])
78
+ # {"ASCII-8BIT"=>["<span class=\"difference\">?</span>ABCDEFabcdef0123456789<span class=\"difference\">?</span>ABCDEFabcdef0123456789<span class=\"difference\">?</span>"],
79
+ # "ISO-8859-1"=>["<span class=\"difference\">¤</span>ABCDEFabcdef0123456789<span class=\"difference\">¤</span>ABCDEFabcdef0123456789<span class=\"difference\">¤</span>"],
80
+ # "ISO-8859-15"=>["<span class=\"difference\">€</span>ABCDEFabcdef0123456789<span class=\"difference\">€</span>ABCDEFabcdef0123456789<span class=\"difference\">€</span>"]}
81
+ ```
82
+ Notes:
83
+
84
+ * Valid encodings don't include UTF-8, indicating it was invalid for one or more lines in the file
85
+ * Results show that ISO-8859-1 and ISO-8859-2 decoded the sample file exactly the same, so they are grouped together in
86
+ the unique_valid_encoding_groups.
87
+
88
+ In raw form the `diffed_samples` don't seem impressive, but they can display the resuls via HTML, for example, to highlight and clarify the differences.
89
+
90
+ <table>
91
+ <tr>
92
+ <th>ASCII-8BIT</th>
93
+ <td><span style="font-weight:bold; color:#ff0000;">?</span>ABCDEFabcdef0123456789<span style="font-weight:bold; color:#ff0000;">?</span>ABCDEFabcdef0123456789<span style="font-weight:bold; color:#ff0000;">?</span></td>
94
+ </tr>
95
+ <tr>
96
+ <th>ISO-8859-1</th>
97
+ <td><span style="font-weight:bold; color:#ff0000;">¤</span>ABCDEFabcdef0123456789<span style="font-weight:bold; color:#ff0000;">¤</span>ABCDEFabcdef0123456789<span style="font-weight:bold; color:#ff0000;">¤</span></td>
98
+ </tr>
99
+ <th>ISO-8859-15</th>
100
+ <td><span style="font-weight:bold; color:#ff0000;">€</span>ABCDEFabcdef0123456789<span style="font-weight:bold; color:#ff0000;">€</span>ABCDEFabcdef0123456789<span style="font-weight:bold; color:#ff0000;">€</span></td>
101
+ </tr>
102
+ </table>
103
+
104
+ ## Contributing
105
+
106
+ EncodingSampler provides a functional but not-so-elegant solution.
107
+ I'd love to see improvements or alternate ideas in regard to the concept, the algorithms, the interface, etc.
108
+
109
+ 1. Fork it
110
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
111
+ 3. Commit your changes (`git commit -am 'Added some feature'`)
112
+ 4. Push to the branch (`git push origin my-new-feature`)
113
+ 5. Create new Pull Request
@@ -0,0 +1,2 @@
1
+ #!/usr/bin/env rake
2
+ require "bundler/gem_tasks"
@@ -0,0 +1,32 @@
1
+ # -*- encoding: utf-8 -*-
2
+ require File.expand_path('../lib/encoding_sampler/version', __FILE__)
3
+
4
+ Gem::Specification.new do |s|
5
+ s.authors = ["Tom Wilson"]
6
+ s.email = ["tom@rollnorocks.com"]
7
+ s.summary = %q{Encoding Sampler extracts a concise sample from a text file to simplify selecting the right encoding.}
8
+ s.description = %q{EncodingSampler helps solve the problem of what to do when the character encoding is unknown, for example when a user is uploading a file but has no idea of its encoding (or typically, even what "character encoding" means.) EncodingSampler extracts a concise set of samples from the selected file for display so the user can choose wisely.}
9
+ s.homepage = ""
10
+
11
+ s.files = `git ls-files`.split($\)
12
+ s.executables = s.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
13
+ s.test_files = s.files.grep(%r{^(test|spec|features)/})
14
+ s.name = "encoding_sampler"
15
+ s.require_paths = ["lib"]
16
+ s.version = EncodingSampler::VERSION
17
+
18
+ s.rdoc_options = %w(--line-numbers --inline-source --title encoding_sampler --main README.md)
19
+ s.extra_rdoc_files = %w(README.md CHANGELOG.md LICENSE)
20
+
21
+ s.add_dependency('diff-lcs', '1.1.3')
22
+
23
+ s.add_development_dependency("rake")
24
+ s.add_development_dependency("debugger")
25
+
26
+ s.add_development_dependency("rspec")
27
+ s.add_development_dependency("fakefs")
28
+ s.add_development_dependency("simplecov")
29
+
30
+ s.add_development_dependency("yard")
31
+ s.add_development_dependency("redcarpet")
32
+ end
@@ -0,0 +1 @@
1
+ require "encoding_sampler/sampler"
@@ -0,0 +1,66 @@
1
+ require "cgi"
2
+
3
+ module EncodingSampler
4
+
5
+ # Simple formatter to override Diff::LCS::DiffCallbacks in diff-lcs Gem to generate diffed output.
6
+ class DiffCallbacks
7
+
8
+ attr_accessor :output
9
+ # @!attribute output
10
+ # @return [String] Storage for the resultant diffed output.
11
+
12
+ attr_reader :difference_start
13
+ # @!attribute [r] difference_start
14
+ # @return [String] The string inserted in the diff results __before__ a segment where the samples differ.
15
+ # Set as option on initialization.
16
+
17
+ attr_reader :difference_end
18
+ # @!attribute [r] difference_end
19
+ # @return [String] The string inserted in the diff results __after__ a segment where the samples differ.
20
+ # Set as option on initialization.
21
+
22
+ # @return [DiffCallbacks] Returns a new instance of EncodingSampler::DiffCallbacks.
23
+ # @param [Hash] options
24
+ # Valid keys are :difference_start and :difference_end.
25
+ # @see #difference_start
26
+ # @see #difference_end
27
+ def initialize(output, options = {})
28
+ @output = output
29
+ options ||= {}
30
+ @difference_start = options[:difference_start] ||= '<span class="difference">'
31
+ @difference_end = options[:difference_end] ||= '</span>'
32
+ end
33
+
34
+ # Called with both strings are the same
35
+ def match(event)
36
+ output_matched event.old_element
37
+ end
38
+
39
+ # Called when there is a substring in A that isn't in B
40
+ def discard_a(event)
41
+ output_changed event.old_element
42
+ end
43
+
44
+ # Called when there is a substring in B that isn't in A
45
+ def discard_b(event)
46
+ output_changed event.new_element
47
+ end
48
+
49
+ private
50
+
51
+ def output_matched(element)
52
+ element = CGI.escapeHTML(element.chomp)
53
+ @output << "#{element}" unless element.empty?
54
+ end
55
+
56
+ def output_changed(element)
57
+ element = CGI.escapeHTML(element.chomp)
58
+ return if element.empty?
59
+ @output << "#{@difference_start}#{element}#{@difference_end}"
60
+ # Join adjacent changed sections
61
+ @output.gsub "#{element}#{@difference_end}#{@difference_start}", ''
62
+ end
63
+
64
+ end
65
+
66
+ end
@@ -0,0 +1,156 @@
1
+ require 'encoding_sampler/version'
2
+ require 'encoding_sampler/diff_callbacks'
3
+ require 'diff-lcs'
4
+
5
+ module EncodingSampler
6
+
7
+ # @!attribute [r] filename
8
+ # @!attribute [r] unique_valid_encoding_groups
9
+ class Sampler
10
+
11
+ # Full name of the target file used to create the sample.
12
+ # @return [String]
13
+ attr_reader :filename
14
+
15
+ # Groups of valid encoding names, such that the encodings in a group all result in the same decoding for the target file.
16
+ # @example When ISO-8859-1 and ISO-8859-2 decode the target file in exactly the same way, but unlike ISO-8859-15,
17
+ # [["ISO-8859-1", 'ISO-8859-2'], ["ISO-8859-15"]]
18
+ # @return [Array]
19
+ attr_reader :unique_valid_encoding_groups
20
+
21
+ # Attribute renamed for clarity.
22
+ # @deprecated Use {#unique_valid_encoding_groups} instead.
23
+ def unique_valid_encodings
24
+ unique_valid_encoding_groups
25
+ end
26
+
27
+ # All valid encodings.
28
+ # @return [Array] Names of encodings that return valid results for the entire file.
29
+ def valid_encodings
30
+ unique_valid_encoding_groups.flatten
31
+ end
32
+
33
+ # Sample file lines, decoded by _encoding_.
34
+ # @return [Array]
35
+ def sample(encoding)
36
+ @binary_samples.values.map {|line| decode_binary_string(line, encoding)}
37
+ end
38
+
39
+ # Returns a hash of samples, keyed by encoding
40
+ # @return [Hash]
41
+ def samples(encodings = valid_encodings)
42
+ encodings.inject({}) {|hash, encoding| hash.merge! encoding => sample(encoding)}
43
+ end
44
+
45
+ # Returns all the "best" encodings. Assumes shortest strings are most likely to be correct.
46
+ # @return [Array]
47
+ def best_encodings
48
+ candidates = samples(unique_valid_encoding_groups.collect {|encoding_group| encoding_group.first})
49
+ min_length = candidates.values.collect {|ary| ary.join('').size}.min
50
+ candidates.keys.select {|key| candidates[key].join('').size == min_length}
51
+ end
52
+
53
+ # Multiple encodings often return the exact same decoded sample.
54
+ # Return only unique samples, keyed on the first encoding to return each sample.
55
+ # What's first in each grouping is based on original order of encodings give to the constructor.
56
+ # @return [Array]
57
+ def unique_samples
58
+ samples(unique_valid_encoding_groups.collect {|encoding_group| encoding_group.first})
59
+ end
60
+
61
+ # Decoded sample, diffed against __all__ of the samples, and marked up to show differences.
62
+ # @param [String] encoding
63
+ # @return [String]
64
+ def diffed_sample(encoding)
65
+ diffed_encoded_samples[encoding]
66
+ end
67
+
68
+ def diffed_samples(encodings = valid_encodings)
69
+ encodings.inject({}) {|hash, encoding| hash.merge! encoding => diffed_sample(encoding)}
70
+ end
71
+
72
+ # @ (see #unique_samples) Samples are diffed
73
+ def unique_diffed_samples
74
+ diffed_samples(unique_valid_encoding_groups.collect {|encoding_group| encoding_group.first})
75
+ end
76
+
77
+ private
78
+
79
+ def initialize(file_name, encodings, diff_options = {})
80
+ @diff_options = diff_options
81
+ @filename = file_name.freeze
82
+ @unique_valid_encoding_groups, @binary_samples, solutions = [], {}, {}
83
+
84
+ solutions = {}
85
+ encodings.sort.combination(2).to_a.each {|pair| solutions[pair] = nil}
86
+
87
+ # read the entire file to verify encodings and collect samples for comparison of encodings
88
+ File.open(@filename, 'rb') do |file|
89
+ until file.eof?
90
+ binary_line = file.readline.strip
91
+ decoded_lines = multi_decode_binary_string(binary_line, encodings)
92
+
93
+ # eliminate any newly-invalid encodings from the scope
94
+ decoded_lines.select {|encoding, decoded_line| decoded_line.nil?}.keys.each do |invalid_encoding|
95
+ encodings.delete invalid_encoding
96
+ solutions.delete_if {|pair, lineno| pair.include? invalid_encoding}
97
+ @binary_samples.keep_if {|id, string| solutions.keys.flatten.include? id}
98
+ end
99
+
100
+ # add sample to solutions when binary string decodes differently for any two previously-undifferentiated encodings
101
+ solutions.select {|pair, lineno| lineno.nil?}.keys.each do |unsolved_pair|
102
+ solutions[unsolved_pair], @binary_samples[file.lineno] = file.lineno, binary_line if decoded_lines[unsolved_pair[0]] != decoded_lines[unsolved_pair[1]]
103
+ end
104
+ end
105
+ end
106
+
107
+ # group undifferentiated encodings
108
+ (solutions.select {|pair, lineno| lineno.nil?}.keys + encodings.collect {|encoding| [encoding]}).each do |subgroup|
109
+ group_index = @unique_valid_encoding_groups.index {|group| !(group & subgroup).empty?}
110
+ group_index ? @unique_valid_encoding_groups[group_index] |= subgroup : @unique_valid_encoding_groups << subgroup
111
+ end
112
+
113
+ @unique_valid_encoding_groups = @unique_valid_encoding_groups.each {|group| group.freeze}.freeze
114
+ @binary_samples.freeze
115
+ end
116
+
117
+ def decode_binary_string(binary_string, encoding)
118
+ encoded_string = binary_string.dup.force_encoding(encoding)
119
+ encoded_string.valid_encoding? ? encoded_string.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?') : nil
120
+ end
121
+
122
+ def multi_decode_binary_string(binary_string, encodings)
123
+ decoded_lines = {}
124
+ encodings.each {|encoding| decoded_lines[encoding] = decode_binary_string(binary_string, encoding)}
125
+ decoded_lines
126
+ end
127
+
128
+ def diffed_strings(array_of_strings)
129
+ lcs = array_of_strings.inject {|intermediate_lcs, string| Diff::LCS.LCS(intermediate_lcs, string).join }
130
+ callbacks = DiffCallbacks.new(diff_output = '', @diff_options)
131
+ array_of_strings.map do |string|
132
+ diff_output.clear
133
+ Diff::LCS.traverse_sequences(lcs, string, callbacks)
134
+ diff_output.dup
135
+ end
136
+ end
137
+
138
+ def diffed_encoded_samples
139
+ return @diffed_encoded_samples if @diffed_encoded_samples
140
+
141
+ encodings = valid_encodings.freeze
142
+ decoded_samples = samples(encodings)
143
+ @diffed_encoded_samples = encodings.inject({}) {|hash, key| hash.merge! key => []}
144
+
145
+ @binary_samples.values.each_index do |i|
146
+ decoded_lines = encodings.map {|encoding| decoded_samples[encoding][i]}
147
+ diffed_encoded_lines = diffed_strings(decoded_lines)
148
+ encodings.each_index {|j| @diffed_encoded_samples[encodings[j]] << diffed_encoded_lines[j] }
149
+ end
150
+
151
+ @diffed_encoded_samples.freeze
152
+ end
153
+
154
+ end
155
+
156
+ end
@@ -0,0 +1,3 @@
1
+ module EncodingSampler
2
+ VERSION = "0.3.0"
3
+ end
@@ -0,0 +1,525 @@
1
+ require "spec_helper.rb"
2
+
3
+ include EncodingSampler
4
+
5
+ describe Sampler do
6
+ context 'with fakefs', fakefs: true do
7
+ before(:each) do
8
+ @filedir = '/test'
9
+ @filename = '/test/testfile'
10
+ @lines = %w(one two three four five)
11
+ FileUtils.mkdir(@filedir) unless Dir.exists?(@filedir)
12
+ File.open(@filename, "w") do |f|
13
+ @lines.each do |line|
14
+ f.puts line
15
+ end
16
+ end
17
+ @test_sampler = Sampler.new(@filename, %w(US-ASCII UTF-8))
18
+ end
19
+
20
+ describe 'verifying fakefs just to make sure' do
21
+ # Make sure this works right after all the trouble with home-grown file system stubs!!
22
+
23
+ it 'can open and readline without error' do
24
+ expect {
25
+ File.open(@filename, 'r') do |file|
26
+ until file.eof?
27
+ file.readline
28
+ end
29
+ end
30
+ }.to_not raise_error
31
+ end
32
+
33
+ it 'raises EOFError when readline called past eof' do
34
+ expect {
35
+ File.open(@filename, 'r') do |file|
36
+ until file.eof?
37
+ file.readline
38
+ end
39
+ file.readline
40
+ end
41
+ }.to raise_error(EOFError)
42
+ end
43
+
44
+ it 'readline returns lines' do
45
+ lines_read = []
46
+ File.open(@filename, 'r') do |file|
47
+ until file.eof?
48
+ lines_read << file.readline.chomp
49
+ end
50
+ end
51
+ lines_read.should eq @lines
52
+ end
53
+
54
+ end
55
+
56
+ describe 'creation' do
57
+
58
+ it 'works with required arguments' do
59
+ Sampler.new(@filename, []).should be_a Sampler
60
+ end
61
+
62
+ it 'requires a filename' do
63
+ expect {Sampler.new()}.to raise_error
64
+ end
65
+
66
+ it 'requires encodings' do
67
+ expect {Sampler.new(@filename)}.to raise_error
68
+ end
69
+
70
+ it 'passes error raised on File.open' do
71
+ File.stub(:open).and_raise 'some error'
72
+ expect {Sampler.new(@filename, [])}.to raise_error('some error')
73
+ end
74
+
75
+ it 'passes error raised on file.readline' do
76
+ File.any_instance.stub(:readline).and_raise 'some error'
77
+ expect {Sampler.new(@filename, [])}.to raise_error('some error')
78
+ end
79
+
80
+ end
81
+
82
+ describe "#filename" do
83
+
84
+ it 'returns the same filename used to create the instance' do
85
+ Sampler.new(@filename, []).filename.should eq @filename
86
+ end
87
+
88
+ it 'is read-only' do
89
+ expect {Sampler.new(@filename, []).filename = 'anything'}.to raise_error NoMethodError
90
+ end
91
+
92
+ end
93
+
94
+ describe '#unique_valid_encoding_groups' do
95
+ before(:each) do
96
+ Sampler.any_instance.stub(:decode_binary_string) do |*args|
97
+ if ['ENCODING1', 'LIKE_ENCODING1'].include? args[1]
98
+ args[0]
99
+ else
100
+ args[0].gsub(/t/, 'T')
101
+ end
102
+ end
103
+ end
104
+
105
+ it 'is read-only' do
106
+ expect {Sampler.new(@filename, []).unique_valid_encoding_groups = 'anything'}.to raise_error NoMethodError
107
+ end
108
+
109
+ it 'is frozen' do
110
+ Sampler.new(@filename, []).unique_valid_encoding_groups.should be_frozen
111
+ end
112
+
113
+ shared_examples 'unique_valid_encoding_groups format is correct' do
114
+
115
+ it 'returns an array' do
116
+ @sampler.unique_valid_encoding_groups.should be_a Array
117
+ end
118
+
119
+ it 'each array element is an array of strings (encoding names)' do
120
+ @sampler.unique_valid_encoding_groups.each do |element|
121
+ element.should be_a Array
122
+ element.each do |encoding|
123
+ encoding.should be_a String
124
+ end
125
+ end
126
+ end
127
+
128
+ it 'array elements do not share members with other elements' do
129
+ @sampler.unique_valid_encoding_groups.flatten.size.should eq @sampler.unique_valid_encoding_groups.flatten.uniq.size
130
+ end
131
+
132
+ end
133
+
134
+ context 'when there are no lines read' do
135
+ before(:each) do
136
+ File.any_instance.stub(:eof?).and_return true
137
+ @sampler = Sampler.new(@filename, %w(ENCODING1 LIKE_ENCODING1 UNLIKE_ENCODING1))
138
+ end
139
+
140
+ it_behaves_like 'unique_valid_encoding_groups format is correct'
141
+
142
+ it 'returns all encodings in a single array element' do
143
+ @sampler.unique_valid_encoding_groups.count.should eq 1
144
+ end
145
+
146
+ it 'contains all valid encodings' do
147
+ @sampler.unique_valid_encoding_groups.flatten.size.should eq 3
148
+ end
149
+
150
+ end
151
+
152
+ context 'when all encodings work the same' do
153
+ before(:each) do
154
+ @sampler = Sampler.new(@filename, %w(ENCODING1 LIKE_ENCODING1))
155
+ end
156
+
157
+ it_behaves_like 'unique_valid_encoding_groups format is correct'
158
+
159
+ it 'returns all encodings in a single array element' do
160
+ @sampler.unique_valid_encoding_groups.count.should eq 1
161
+ end
162
+
163
+ it 'the single array element contains all valid encodings' do
164
+ @sampler.unique_valid_encoding_groups[0].should eq %w(ENCODING1 LIKE_ENCODING1)
165
+ end
166
+
167
+ end
168
+
169
+ context 'when encodings are different' do
170
+ before(:each) do
171
+ @sampler = Sampler.new(@filename, %w(ENCODING1 UNLIKE_ENCODING1))
172
+ end
173
+
174
+ it_behaves_like 'unique_valid_encoding_groups format is correct'
175
+
176
+ it 'returns all encodings in two array elements' do
177
+ @sampler.unique_valid_encoding_groups.count.should eq 2
178
+ end
179
+
180
+ it 'the first array element contains one of the valid encodings' do
181
+ %w(ENCODING1 UNLIKE_ENCODING1).should include @sampler.unique_valid_encoding_groups[0][0]
182
+ end
183
+
184
+ it 'the second array element contains one of the valid encodings' do
185
+ %w(ENCODING1 UNLIKE_ENCODING1).should include @sampler.unique_valid_encoding_groups[1][0]
186
+ end
187
+
188
+ it 'the array elements contains all valid encodings' do
189
+ @sampler.unique_valid_encoding_groups.flatten.sort.should eq %w(ENCODING1 UNLIKE_ENCODING1).sort
190
+ end
191
+
192
+ end
193
+ end
194
+
195
+ describe '#valid_encodings' do
196
+
197
+ it 'should contain all encodings in unique_valid_encoding_groups' do
198
+ @test_sampler.valid_encodings.sort.should eq @test_sampler.unique_valid_encodings.flatten.sort
199
+ end
200
+
201
+ end
202
+
203
+ describe '#sample' do
204
+ before(:each) do
205
+ Sampler.any_instance.stub(:decode_binary_string) do |*args|
206
+ if ['ENCODING1', 'LIKE_ENCODING1'].include? args[1]
207
+ args[0]
208
+ else
209
+ args[0].gsub(/t/, 'T')
210
+ end
211
+ end
212
+ end
213
+
214
+ shared_examples 'sample format is correct' do
215
+
216
+ it 'returns a hash for each valid encoding' do
217
+ @sampler.valid_encodings.each do |encoding|
218
+ @sampler.sample(encoding).should be_a Array
219
+ end
220
+ end
221
+
222
+ it 'elements are strings (decoded lines)' do
223
+ @sampler.sample('ENCODING1').each do |element|
224
+ element.should be_a String
225
+ end
226
+ end
227
+
228
+ end
229
+
230
+ context 'when there are no lines read' do
231
+ before(:each) do
232
+ File.any_instance.stub(:eof?).and_return true
233
+ @sampler = Sampler.new(@filename, %w(ENCODING1 LIKE_ENCODING1 UNLIKE_ENCODING1))
234
+ end
235
+
236
+ it_behaves_like 'sample format is correct'
237
+
238
+ it 'it is empty' do
239
+ %w(ENCODING1 LIKE_ENCODING1 UNLIKE_ENCODING1).each do |encoding|
240
+ @sampler.sample(encoding).should be_empty
241
+ end
242
+ end
243
+
244
+ end
245
+
246
+
247
+ context 'when all encodings work the same' do
248
+ before(:each) do
249
+ @sampler = Sampler.new(@filename, %w(ENCODING1 LIKE_ENCODING1))
250
+ end
251
+
252
+ it_behaves_like 'sample format is correct'
253
+
254
+ it 'it is empty' do
255
+ %w(ENCODING1 LIKE_ENCODING1).each do |encoding|
256
+ @sampler.sample(encoding).should be_empty
257
+ end
258
+ end
259
+
260
+ end
261
+
262
+ context 'when encoding are different' do
263
+ before(:each) do
264
+ @sampler = Sampler.new(@filename, %w(ENCODING1 UNLIKE_ENCODING1))
265
+ end
266
+
267
+ it_behaves_like 'sample format is correct'
268
+
269
+ it 'it is not empty' do
270
+ %w(ENCODING1 LIKE_ENCODING1).each do |encoding|
271
+ @sampler.sample(encoding).should_not be_empty
272
+ end
273
+ end
274
+
275
+ it 'the samples values should not be equal' do
276
+ @sampler.sample('ENCODING1').should_not eq @sampler.sample('UNLIKE_ENCODING1')
277
+ end
278
+
279
+ end
280
+ end
281
+
282
+ describe '#samples' do
283
+ before(:each) do
284
+ Sampler.any_instance.stub(:decode_binary_string) do |*args|
285
+ if ['ENCODING1', 'LIKE_ENCODING1'].include? args[1]
286
+ args[0]
287
+ else
288
+ args[0].gsub(/t/, 'T')
289
+ end
290
+ end
291
+ end
292
+
293
+ context 'when there are no lines read' do
294
+ before(:each) do
295
+ File.any_instance.stub(:eof?).and_return true
296
+ @sampler = Sampler.new(@filename, %w(ENCODING1 LIKE_ENCODING1 UNLIKE_ENCODING1))
297
+ end
298
+
299
+ it 'each included sample is empty' do
300
+ @sampler.samples.each {|encoding, sample| sample.should be_empty}
301
+ end
302
+
303
+ end
304
+
305
+
306
+ context 'when all encodings work the same' do
307
+ before(:each) do
308
+ @sampler = Sampler.new(@filename, %w(ENCODING1 LIKE_ENCODING1))
309
+ end
310
+
311
+ it 'each included sample is empty' do
312
+ @sampler.samples.each {|encoding, sample| sample.should be_empty}
313
+ end
314
+
315
+ end
316
+
317
+ context 'when encoding are different' do
318
+ before(:each) do
319
+ @sampler = Sampler.new(@filename, %w(ENCODING1 UNLIKE_ENCODING1))
320
+ end
321
+
322
+ it 'it is not empty' do
323
+ %w(ENCODING1 LIKE_ENCODING1).each do |encoding|
324
+ @sampler.samples.should_not be_empty
325
+ end
326
+ end
327
+
328
+ it 'should have a sample for each valid encoding' do
329
+ (@sampler.samples.keys & @sampler.valid_encodings).sort.should eq @sampler.valid_encodings.sort
330
+ end
331
+
332
+ it 'each sample value (the string samples) should be the same size' do
333
+ sample_values = @sampler.samples.values
334
+ sample_values.each do |sample_value|
335
+ sample_value.size.should eq sample_values.first.size
336
+ end
337
+ end
338
+
339
+ # it 'the sample values should not be equal, duh' do
340
+ # samples = @sampler.samples
341
+ # samples.values.each do |string_array|
342
+ # samples['ENCODING1'][key].should_not eq samples['UNLIKE_ENCODING1'][key]
343
+ # end
344
+ # end
345
+
346
+ end
347
+
348
+ end
349
+
350
+ describe '#best_encodings' do
351
+ before(:each) do
352
+ Sampler.any_instance.stub(:decode_binary_string) do |*args|
353
+ case args[1]
354
+ when 'SHORTEST_ENCODING' then args[0]
355
+ when 'LIKE_SHORTEST_ENCODING' then args[0].reverse # same length and different is all that matters
356
+ when 'INVALID_ENCODING' then nil
357
+ else args[0].gsub(/t/, 'T&#') # force longer faked encoding for letter 't'
358
+ end
359
+ end
360
+ end
361
+
362
+ context 'no valid encodings' do
363
+ before(:each) do
364
+ @sampler = Sampler.new(@filename, %w(INVALID_ENCODING))
365
+ end
366
+ it 'returns empty array' do
367
+ @sampler.best_encodings.should eq []
368
+ end
369
+ end
370
+
371
+ context 'one valid encoding' do
372
+ before(:each) do
373
+ @sampler = Sampler.new(@filename, %w(SHORTEST_ENCODING))
374
+ end
375
+ it 'returns an array with the one shortest encoding' do
376
+ @sampler.best_encodings.should eq ['SHORTEST_ENCODING']
377
+ end
378
+ end
379
+
380
+ context 'when one shortest encoding' do
381
+ before(:each) do
382
+ @sampler = Sampler.new(@filename, %w(SHORTEST_ENCODING LONGER_ENCODING))
383
+ end
384
+ it 'returns an array with the one shortest encoding' do
385
+ @sampler.best_encodings.should eq ['SHORTEST_ENCODING']
386
+ end
387
+ end
388
+
389
+ context 'when more than one shortest encoding' do
390
+ before(:each) do
391
+ @sampler = Sampler.new(@filename, %w(SHORTEST_ENCODING LIKE_SHORTEST_ENCODING LONGER_ENCODING))
392
+ end
393
+ it 'returns an array with the shortest encodings' do
394
+ @sampler.best_encodings.should eq ['SHORTEST_ENCODING', 'LIKE_SHORTEST_ENCODING']
395
+ end
396
+ end
397
+
398
+ end
399
+
400
+ describe '#unique_samples' do
401
+
402
+ it 'should return a Hash' do
403
+ @test_sampler.unique_samples.should be_a Hash
404
+ end
405
+
406
+ it 'should have keys equal to first item from each valid_encoding_group' do
407
+ @test_sampler.unique_samples.keys.should eq @test_sampler.unique_valid_encoding_groups.collect {|group| group.first}
408
+ end
409
+
410
+ it 'should provide the right sample value for each key' do
411
+ @test_sampler.unique_samples.keys.each do |encoding|
412
+ @test_sampler.unique_samples[encoding].should eq @test_sampler.sample(encoding)
413
+ end
414
+ end
415
+
416
+ end
417
+
418
+ describe 'diffed_sample' do
419
+ before(:each) do
420
+ Sampler.any_instance.stub(:decode_binary_string) do |*args|
421
+ if ['ENCODING1', 'LIKE_ENCODING1'].include? args[1]
422
+ args[0]
423
+ else
424
+ args[0].gsub(/t/, 'T')
425
+ end
426
+ end
427
+ @sampler = Sampler.new(@filename, %w(ENCODING1 LIKE_ENCODING1 UNLIKE_ENCODING1))
428
+ end
429
+
430
+ it 'works' do
431
+ @sampler.diffed_sample('ENCODING1')
432
+ end
433
+
434
+ it 'returns an array' do
435
+ # note: two different encodings that express different results only takes one sample
436
+ @sampler.diffed_sample('ENCODING1').should be_a Array
437
+ end
438
+
439
+ it 'has one line for each sample' do
440
+ # note: two different encodings that express different results only takes one sample
441
+ @sampler.diffed_sample('ENCODING1').size.should eq 1
442
+ end
443
+
444
+ it 'returns identical results when the decoded strings are the same' do
445
+ @sampler.diffed_sample('ENCODING1').should eq @sampler.diffed_sample('LIKE_ENCODING1')
446
+ end
447
+
448
+ it 'returns different results when the decoded strings are different' do
449
+ @sampler.diffed_sample('ENCODING1').should_not eq @sampler.diffed_sample('UNLIKE_ENCODING1')
450
+ end
451
+
452
+ end
453
+
454
+ describe 'diffed_samples' do
455
+ before(:each) do
456
+ Sampler.any_instance.stub(:decode_binary_string) do |*args|
457
+ if ['ENCODING1', 'LIKE_ENCODING1'].include? args[1]
458
+ args[0]
459
+ else
460
+ args[0].gsub(/t/, 'T')
461
+ end
462
+ end
463
+ end
464
+
465
+ context 'with default options' do
466
+ before(:each) do
467
+ @sampler = Sampler.new(@filename, %w(ENCODING1 LIKE_ENCODING1 UNLIKE_ENCODING1))
468
+ end
469
+
470
+ it 'works' do
471
+ @sampler.diffed_samples(['ENCODING1'])
472
+ end
473
+
474
+ it 'returns a hash' do
475
+ # note: two different encodings that express different results only takes one sample
476
+ @sampler.diffed_samples(['ENCODING1']).should be_a Hash
477
+ end
478
+
479
+ it 'keys match encodings in argument' do
480
+ # note: two different encodings that express different results only takes one sample
481
+ @sampler.diffed_samples(['ENCODING1','UNLIKE_ENCODING1']).keys.should eq ['ENCODING1','UNLIKE_ENCODING1']
482
+ end
483
+
484
+ it 'values match the values from diffed_sample for the same encoding' do
485
+ @sampler.diffed_samples(['ENCODING1'])['ENCODING1'].should eq @sampler.diffed_sample('ENCODING1')
486
+ end
487
+ end
488
+
489
+ context 'with custom :difference_start, :difference_end options' do
490
+ before(:each) do
491
+ @sampler = Sampler.new(@filename, %w(ENCODING1 LIKE_ENCODING1 UNLIKE_ENCODING1), difference_start: '<start>', difference_end: '<end>')
492
+ end
493
+
494
+ it 'uses difference_start value specified in options hash' do
495
+ @sampler.diffed_sample('ENCODING1').join.should include '<start>'
496
+ end
497
+
498
+ it 'uses difference_end value specified in options hash' do
499
+ @sampler.diffed_sample('ENCODING1').join.should include '<end>'
500
+ end
501
+
502
+ end
503
+
504
+ end
505
+
506
+ describe '#unique_diffed_samples' do
507
+
508
+ it 'should return a Hash' do
509
+ @test_sampler.unique_diffed_samples.should be_a Hash
510
+ end
511
+
512
+ it 'should have keys equal to first item from each valid_encoding_group' do
513
+ @test_sampler.unique_diffed_samples.keys.should eq @test_sampler.unique_valid_encoding_groups.collect {|group| group.first}
514
+ end
515
+
516
+ it 'should provide the right sample value for each key' do
517
+ @test_sampler.unique_diffed_samples.keys.each do |encoding|
518
+ @test_sampler.unique_diffed_samples[encoding].should eq @test_sampler.diffed_samples[encoding]
519
+ end
520
+ end
521
+
522
+ end
523
+
524
+ end
525
+ end
@@ -0,0 +1,42 @@
1
+ require "spec_helper.rb"
2
+
3
+ include EncodingSampler
4
+
5
+ describe Sampler do
6
+
7
+ context 'with real files' do
8
+ before(:all) do
9
+ # create some encoded strings
10
+ @encodings = %w(ASCII-8BIT UTF-8 WINDOWS-1252 ISO-8859-1 ISO-8859-2 ISO-8859-15)
11
+ @special_chars = "\u20AC\u201C\u201d\u00A1\u00A2\u00A3\u00A9\u00AE\u00C4\u00C5\u00E4\u00E5"
12
+ @ascii_chars = "ABCDEFabcdef0123456789"
13
+ @mixed_lines = []
14
+ 3.times do
15
+ @mixed_lines << @ascii_chars # first line the same for all
16
+ end
17
+ (0..(@special_chars.length - 1)).each do |i|
18
+ @mixed_lines << @special_chars.chars.to_a[i] + @ascii_chars + @special_chars.chars.to_a[i] + @ascii_chars + @special_chars.chars.to_a[i]
19
+ end
20
+ # create temp files
21
+ @encoding_file_dir = './spec/files/'
22
+ Dir.mkdir(@encoding_file_dir) unless Dir.exists? @encoding_file_dir
23
+ @file_names = {}
24
+ @encodings.each do |encoding|
25
+ file_name = "#{@encoding_file_dir}#{encoding}.txt"
26
+ @file_names[encoding] = file_name
27
+ File.open(file_name, "w:#{encoding}") do |file|
28
+ # replace: '' to omit characters unavailable for the selected encoding, creating clean valid files
29
+ file.write @mixed_lines.join("\n").encode(encoding, invalid: :replace, undef: :replace, replace: '')
30
+ end
31
+ end
32
+ end
33
+
34
+ it 'can be created for each file encoding' do
35
+ @encodings.each do |encoding|
36
+ expect { Sampler.new(@file_names[encoding], @encodings) }.to_not raise_error
37
+ end
38
+ end
39
+
40
+ end
41
+
42
+ end
@@ -0,0 +1,32 @@
1
+ require "spec_helper.rb"
2
+
3
+ include EncodingSampler
4
+
5
+ # For ad-hoc testing using local file.
6
+ # Set env var FILENAME='filename'
7
+ # Optionally set ENCODINGS='encoding1 encoding2' etc
8
+ describe Sampler do
9
+
10
+ context "when ENV['FILENAME'] is set to a selected filename" do
11
+ let(:default_encodings) {%w(ASCII-8BIT UTF-8 WINDOWS-1252 ISO-8859-1 ISO-8859-2 ISO-8859-15)}
12
+
13
+ it 'it works and displays the results' do
14
+ sampler, filename = nil, nil
15
+ filename = ENV['FILENAME']
16
+ encodings = ENV['ENCODINGS'] || default_encodings
17
+
18
+ if filename.nil?
19
+ p "ENV['FILENAME'] is nil, skipping ad-hoc test."
20
+ else
21
+ filename.should_not be_nil
22
+ expect { sampler = Sampler.new(filename, encodings) }.to_not raise_error
23
+ p ''
24
+ p "Results for #{filename}:"
25
+ pp sampler.inspect
26
+ pp sampler.unique_diffed_samples
27
+ end
28
+ end
29
+
30
+ end
31
+
32
+ end
@@ -0,0 +1,15 @@
1
+ if ENV['COVERAGE']
2
+ require 'simplecov'
3
+ SimpleCov.start { add_filter '/test/' }
4
+ end
5
+
6
+ require 'encoding_sampler'
7
+ require 'fakefs/spec_helpers'
8
+
9
+ RSpec.configure do |config|
10
+ config.treat_symbols_as_metadata_keys_with_true_values = true
11
+ config.run_all_when_everything_filtered = true
12
+ config.filter_run :focus
13
+
14
+ config.include FakeFS::SpecHelpers, fakefs: true
15
+ end
metadata ADDED
@@ -0,0 +1,209 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: encoding_sampler
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.3.0
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - Tom Wilson
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2013-03-19 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: diff-lcs
16
+ requirement: !ruby/object:Gem::Requirement
17
+ none: false
18
+ requirements:
19
+ - - '='
20
+ - !ruby/object:Gem::Version
21
+ version: 1.1.3
22
+ type: :runtime
23
+ prerelease: false
24
+ version_requirements: !ruby/object:Gem::Requirement
25
+ none: false
26
+ requirements:
27
+ - - '='
28
+ - !ruby/object:Gem::Version
29
+ version: 1.1.3
30
+ - !ruby/object:Gem::Dependency
31
+ name: rake
32
+ requirement: !ruby/object:Gem::Requirement
33
+ none: false
34
+ requirements:
35
+ - - ! '>='
36
+ - !ruby/object:Gem::Version
37
+ version: '0'
38
+ type: :development
39
+ prerelease: false
40
+ version_requirements: !ruby/object:Gem::Requirement
41
+ none: false
42
+ requirements:
43
+ - - ! '>='
44
+ - !ruby/object:Gem::Version
45
+ version: '0'
46
+ - !ruby/object:Gem::Dependency
47
+ name: debugger
48
+ requirement: !ruby/object:Gem::Requirement
49
+ none: false
50
+ requirements:
51
+ - - ! '>='
52
+ - !ruby/object:Gem::Version
53
+ version: '0'
54
+ type: :development
55
+ prerelease: false
56
+ version_requirements: !ruby/object:Gem::Requirement
57
+ none: false
58
+ requirements:
59
+ - - ! '>='
60
+ - !ruby/object:Gem::Version
61
+ version: '0'
62
+ - !ruby/object:Gem::Dependency
63
+ name: rspec
64
+ requirement: !ruby/object:Gem::Requirement
65
+ none: false
66
+ requirements:
67
+ - - ! '>='
68
+ - !ruby/object:Gem::Version
69
+ version: '0'
70
+ type: :development
71
+ prerelease: false
72
+ version_requirements: !ruby/object:Gem::Requirement
73
+ none: false
74
+ requirements:
75
+ - - ! '>='
76
+ - !ruby/object:Gem::Version
77
+ version: '0'
78
+ - !ruby/object:Gem::Dependency
79
+ name: fakefs
80
+ requirement: !ruby/object:Gem::Requirement
81
+ none: false
82
+ requirements:
83
+ - - ! '>='
84
+ - !ruby/object:Gem::Version
85
+ version: '0'
86
+ type: :development
87
+ prerelease: false
88
+ version_requirements: !ruby/object:Gem::Requirement
89
+ none: false
90
+ requirements:
91
+ - - ! '>='
92
+ - !ruby/object:Gem::Version
93
+ version: '0'
94
+ - !ruby/object:Gem::Dependency
95
+ name: simplecov
96
+ requirement: !ruby/object:Gem::Requirement
97
+ none: false
98
+ requirements:
99
+ - - ! '>='
100
+ - !ruby/object:Gem::Version
101
+ version: '0'
102
+ type: :development
103
+ prerelease: false
104
+ version_requirements: !ruby/object:Gem::Requirement
105
+ none: false
106
+ requirements:
107
+ - - ! '>='
108
+ - !ruby/object:Gem::Version
109
+ version: '0'
110
+ - !ruby/object:Gem::Dependency
111
+ name: yard
112
+ requirement: !ruby/object:Gem::Requirement
113
+ none: false
114
+ requirements:
115
+ - - ! '>='
116
+ - !ruby/object:Gem::Version
117
+ version: '0'
118
+ type: :development
119
+ prerelease: false
120
+ version_requirements: !ruby/object:Gem::Requirement
121
+ none: false
122
+ requirements:
123
+ - - ! '>='
124
+ - !ruby/object:Gem::Version
125
+ version: '0'
126
+ - !ruby/object:Gem::Dependency
127
+ name: redcarpet
128
+ requirement: !ruby/object:Gem::Requirement
129
+ none: false
130
+ requirements:
131
+ - - ! '>='
132
+ - !ruby/object:Gem::Version
133
+ version: '0'
134
+ type: :development
135
+ prerelease: false
136
+ version_requirements: !ruby/object:Gem::Requirement
137
+ none: false
138
+ requirements:
139
+ - - ! '>='
140
+ - !ruby/object:Gem::Version
141
+ version: '0'
142
+ description: EncodingSampler helps solve the problem of what to do when the character
143
+ encoding is unknown, for example when a user is uploading a file but has no idea
144
+ of its encoding (or typically, even what "character encoding" means.) EncodingSampler
145
+ extracts a concise set of samples from the selected file for display so the user
146
+ can choose wisely.
147
+ email:
148
+ - tom@rollnorocks.com
149
+ executables: []
150
+ extensions: []
151
+ extra_rdoc_files:
152
+ - README.md
153
+ - CHANGELOG.md
154
+ - LICENSE
155
+ files:
156
+ - .gitignore
157
+ - .rspec
158
+ - .yardopts
159
+ - CHANGELOG.md
160
+ - Gemfile
161
+ - LICENSE
162
+ - README.md
163
+ - Rakefile
164
+ - encoding_sampler.gemspec
165
+ - lib/encoding_sampler.rb
166
+ - lib/encoding_sampler/diff_callbacks.rb
167
+ - lib/encoding_sampler/sampler.rb
168
+ - lib/encoding_sampler/version.rb
169
+ - spec/sampler_spec.rb
170
+ - spec/sampler_with_real_files_spec.rb
171
+ - spec/sampler_with_selected_file_spec.rb
172
+ - spec/spec_helper.rb
173
+ homepage: ''
174
+ licenses: []
175
+ post_install_message:
176
+ rdoc_options:
177
+ - --line-numbers
178
+ - --inline-source
179
+ - --title
180
+ - encoding_sampler
181
+ - --main
182
+ - README.md
183
+ require_paths:
184
+ - lib
185
+ required_ruby_version: !ruby/object:Gem::Requirement
186
+ none: false
187
+ requirements:
188
+ - - ! '>='
189
+ - !ruby/object:Gem::Version
190
+ version: '0'
191
+ required_rubygems_version: !ruby/object:Gem::Requirement
192
+ none: false
193
+ requirements:
194
+ - - ! '>='
195
+ - !ruby/object:Gem::Version
196
+ version: '0'
197
+ requirements: []
198
+ rubyforge_project:
199
+ rubygems_version: 1.8.24
200
+ signing_key:
201
+ specification_version: 3
202
+ summary: Encoding Sampler extracts a concise sample from a text file to simplify selecting
203
+ the right encoding.
204
+ test_files:
205
+ - spec/sampler_spec.rb
206
+ - spec/sampler_with_real_files_spec.rb
207
+ - spec/sampler_with_selected_file_spec.rb
208
+ - spec/spec_helper.rb
209
+ has_rdoc: