gigo 1.2.0 → 1.3.0
Sign up to get free protection for your applications and to get access to all the features.
- data/README.md +15 -4
- data/gemfiles/activesupport30.gemfile.lock +6 -6
- data/gemfiles/activesupport31.gemfile.lock +6 -6
- data/gemfiles/activesupport32.gemfile.lock +9 -10
- data/gemfiles/activesupport40.gemfile.lock +11 -12
- data/gigo.gemspec +2 -2
- data/lib/gigo.rb +17 -19
- data/lib/gigo/transcoders.rb +7 -0
- data/lib/gigo/transcoders/active_support.rb +13 -0
- data/lib/gigo/transcoders/blind.rb +13 -0
- data/lib/gigo/transcoders/rchardet.rb +22 -0
- data/lib/gigo/version.rb +1 -1
- data/test/cases/gigo_test.rb +24 -3
- data/test/test_helper.rb +16 -4
- metadata +15 -13
- data/test/support/minitest.rb +0 -7
data/README.md
CHANGED
@@ -5,9 +5,7 @@ Or better yet, Garbage In, Gold Out! - The GIGO gem aims to fix ruby string enco
|
|
5
5
|
|
6
6
|
The GIGO gem is not likely the proper solutions. If you have bad encodings in your database, you should fix them and write consistent encodings. That said, if you have no other choice, GIGO can help.
|
7
7
|
|
8
|
-
This gem depends on one of the many public forks of `CharDet` for ruby. Since `CharDet` is not a public gem and following proper semantic versioning, we have decided to vendor the [kirillrdy/rchardet](http://github.com/kirillrdy/rchardet) repo. We have even made sure that our vendored version stays in our namesacpe by using `GIGO::CharDet`. So if you have another version bundled, feel confident that the two will not conflict.
|
9
|
-
|
10
|
-
We use `GIGO::CharDet` to do the grunt work of finding the proper encoding of an untrusted string. Once found, we use the [EnsureValidEncoding](http://github.com/jrochkind/ensure_valid_encoding) gem to either force an encoding while removing any non-convertable characters.
|
8
|
+
This gem depends on a series of transcoders including `ActiveSupport::Multibyte#tidy_bytes` along with one of the many public forks of `CharDet` for ruby. Since `CharDet` is not a public gem and following proper semantic versioning, we have decided to vendor the [kirillrdy/rchardet](http://github.com/kirillrdy/rchardet) repo. We have even made sure that our vendored version stays in our namesacpe by using `GIGO::CharDet`. So if you have another version bundled, feel confident that the two will not conflict.
|
11
9
|
|
12
10
|
|
13
11
|
## Usage
|
@@ -26,6 +24,19 @@ def comments
|
|
26
24
|
end
|
27
25
|
```
|
28
26
|
|
27
|
+
GIGO's encoding can be configured using the `GIGO.encoding` accessor. By default this is either `Encoding.default_internal` with a fallback to `Encoding::UTF_8`.
|
28
|
+
|
29
|
+
|
30
|
+
## Transcoders
|
31
|
+
|
32
|
+
GIGO transcoders can be any module or class that implements the `transcode` method. This method takes one argument, the string to transcode and can hook into the `GIGO.encoding` if needed. The default list of transcoders is.
|
33
|
+
|
34
|
+
* GIGO::Transcoders::ActiveSupport
|
35
|
+
* GIGO::Transcoders::CharDet
|
36
|
+
* GIGO::Transcoders::Blind
|
37
|
+
|
38
|
+
GIGO attempts to use each in that order. Upon successful transcoding, we use the [EnsureValidEncoding](http://github.com/jrochkind/ensure_valid_encoding) gem to force an encoding to match the `GIGO.encoding` while removing any non-convertable characters.
|
39
|
+
|
29
40
|
|
30
41
|
## Toe Dough List
|
31
42
|
|
@@ -45,6 +56,6 @@ $ bundle exec rake appraisal test
|
|
45
56
|
We use the [appraisal](https://github.com/thoughtbot/appraisal) gem from Thoughtbot to help us generate the individual gemfiles for each ActiveSupport version and to run the tests locally against each generated Gemfile. The `rake appraisal test` command actually runs our test suite against all Rails versions in our `Appraisal` file. If you want to run the tests for a specific Rails version, use `rake -T` for a list. For example, the following command will run the tests for Rails 3.2 only.
|
46
57
|
|
47
58
|
```shell
|
48
|
-
$ bundle exec rake appraisal:
|
59
|
+
$ bundle exec rake appraisal:activesupport32 test
|
49
60
|
```
|
50
61
|
|
@@ -1,7 +1,7 @@
|
|
1
1
|
PATH
|
2
2
|
remote: /Users/kencollins/Repositories/customink/gigo
|
3
3
|
specs:
|
4
|
-
gigo (1.
|
4
|
+
gigo (1.2.0)
|
5
5
|
activesupport (>= 3.0)
|
6
6
|
ensure_valid_encoding (~> 0.5.3)
|
7
7
|
|
@@ -9,12 +9,12 @@ GEM
|
|
9
9
|
remote: https://rubygems.org/
|
10
10
|
specs:
|
11
11
|
activesupport (3.0.20)
|
12
|
-
appraisal (0.5.
|
12
|
+
appraisal (0.5.2)
|
13
13
|
bundler
|
14
14
|
rake
|
15
15
|
ensure_valid_encoding (0.5.3)
|
16
|
-
|
17
|
-
minitest
|
16
|
+
i18n (0.6.4)
|
17
|
+
minitest (5.0.1)
|
18
18
|
rake (10.0.4)
|
19
19
|
|
20
20
|
PLATFORMS
|
@@ -24,6 +24,6 @@ DEPENDENCIES
|
|
24
24
|
activesupport (~> 3.0.0)
|
25
25
|
appraisal
|
26
26
|
gigo!
|
27
|
-
|
28
|
-
minitest
|
27
|
+
i18n
|
28
|
+
minitest (~> 5.0)
|
29
29
|
rake
|
@@ -1,7 +1,7 @@
|
|
1
1
|
PATH
|
2
2
|
remote: /Users/kencollins/Repositories/customink/gigo
|
3
3
|
specs:
|
4
|
-
gigo (1.
|
4
|
+
gigo (1.2.0)
|
5
5
|
activesupport (>= 3.0)
|
6
6
|
ensure_valid_encoding (~> 0.5.3)
|
7
7
|
|
@@ -10,12 +10,12 @@ GEM
|
|
10
10
|
specs:
|
11
11
|
activesupport (3.1.10)
|
12
12
|
multi_json (>= 1.0, < 1.3)
|
13
|
-
appraisal (0.5.
|
13
|
+
appraisal (0.5.2)
|
14
14
|
bundler
|
15
15
|
rake
|
16
16
|
ensure_valid_encoding (0.5.3)
|
17
|
-
|
18
|
-
minitest
|
17
|
+
i18n (0.6.4)
|
18
|
+
minitest (5.0.1)
|
19
19
|
multi_json (1.2.0)
|
20
20
|
rake (10.0.4)
|
21
21
|
|
@@ -26,6 +26,6 @@ DEPENDENCIES
|
|
26
26
|
activesupport (~> 3.1.0)
|
27
27
|
appraisal
|
28
28
|
gigo!
|
29
|
-
|
30
|
-
minitest
|
29
|
+
i18n
|
30
|
+
minitest (~> 5.0)
|
31
31
|
rake
|
@@ -1,24 +1,23 @@
|
|
1
1
|
PATH
|
2
2
|
remote: /Users/kencollins/Repositories/customink/gigo
|
3
3
|
specs:
|
4
|
-
gigo (1.
|
4
|
+
gigo (1.2.0)
|
5
5
|
activesupport (>= 3.0)
|
6
6
|
ensure_valid_encoding (~> 0.5.3)
|
7
7
|
|
8
8
|
GEM
|
9
9
|
remote: https://rubygems.org/
|
10
10
|
specs:
|
11
|
-
activesupport (3.2.
|
12
|
-
i18n (
|
11
|
+
activesupport (3.2.12)
|
12
|
+
i18n (~> 0.6)
|
13
13
|
multi_json (~> 1.0)
|
14
|
-
appraisal (0.5.
|
14
|
+
appraisal (0.5.2)
|
15
15
|
bundler
|
16
16
|
rake
|
17
17
|
ensure_valid_encoding (0.5.3)
|
18
|
-
i18n (0.6.
|
19
|
-
minitest (
|
20
|
-
|
21
|
-
multi_json (1.7.2)
|
18
|
+
i18n (0.6.4)
|
19
|
+
minitest (5.0.1)
|
20
|
+
multi_json (1.7.3)
|
22
21
|
rake (10.0.4)
|
23
22
|
|
24
23
|
PLATFORMS
|
@@ -28,6 +27,6 @@ DEPENDENCIES
|
|
28
27
|
activesupport (~> 3.2.0)
|
29
28
|
appraisal
|
30
29
|
gigo!
|
31
|
-
|
32
|
-
minitest
|
30
|
+
i18n
|
31
|
+
minitest (~> 5.0)
|
33
32
|
rake
|
@@ -1,33 +1,32 @@
|
|
1
1
|
GIT
|
2
2
|
remote: git://github.com/rails/rails.git
|
3
|
-
revision:
|
3
|
+
revision: d3d8cfd5689188f48714f49ad000a1c1fbd9edcd
|
4
4
|
specs:
|
5
|
-
activesupport (4.
|
5
|
+
activesupport (4.1.0.beta)
|
6
6
|
i18n (~> 0.6, >= 0.6.4)
|
7
|
-
|
8
|
-
|
7
|
+
json (~> 1.7)
|
8
|
+
minitest (~> 5.0)
|
9
9
|
thread_safe (~> 0.1)
|
10
10
|
tzinfo (~> 0.3.37)
|
11
11
|
|
12
12
|
PATH
|
13
13
|
remote: /Users/kencollins/Repositories/customink/gigo
|
14
14
|
specs:
|
15
|
-
gigo (1.
|
15
|
+
gigo (1.2.0)
|
16
16
|
activesupport (>= 3.0)
|
17
17
|
ensure_valid_encoding (~> 0.5.3)
|
18
18
|
|
19
19
|
GEM
|
20
20
|
remote: https://rubygems.org/
|
21
21
|
specs:
|
22
|
-
appraisal (0.5.
|
22
|
+
appraisal (0.5.2)
|
23
23
|
bundler
|
24
24
|
rake
|
25
|
-
atomic (1.
|
25
|
+
atomic (1.1.9)
|
26
26
|
ensure_valid_encoding (0.5.3)
|
27
27
|
i18n (0.6.4)
|
28
|
-
|
29
|
-
minitest
|
30
|
-
multi_json (1.7.2)
|
28
|
+
json (1.8.0)
|
29
|
+
minitest (5.0.1)
|
31
30
|
rake (10.0.4)
|
32
31
|
thread_safe (0.1.0)
|
33
32
|
atomic
|
@@ -40,6 +39,6 @@ DEPENDENCIES
|
|
40
39
|
activesupport!
|
41
40
|
appraisal
|
42
41
|
gigo!
|
43
|
-
|
44
|
-
minitest
|
42
|
+
i18n
|
43
|
+
minitest (~> 5.0)
|
45
44
|
rake
|
data/gigo.gemspec
CHANGED
@@ -18,7 +18,7 @@ Gem::Specification.new do |gem|
|
|
18
18
|
gem.add_runtime_dependency 'activesupport', '>= 3.0'
|
19
19
|
gem.add_runtime_dependency 'ensure_valid_encoding', '~> 0.5.3'
|
20
20
|
gem.add_development_dependency 'appraisal'
|
21
|
+
gem.add_development_dependency 'i18n' # Older ActiveSupport does not have a proper dep.
|
21
22
|
gem.add_development_dependency 'rake'
|
22
|
-
gem.add_development_dependency 'minitest'
|
23
|
-
gem.add_development_dependency 'minitest-emoji'
|
23
|
+
gem.add_development_dependency 'minitest', '~> 5.0'
|
24
24
|
end
|
data/lib/gigo.rb
CHANGED
@@ -1,40 +1,38 @@
|
|
1
|
-
require 'active_support/
|
2
|
-
require 'active_support/core_ext/object/acts_like'
|
3
|
-
require 'active_support/core_ext/string/behavior'
|
1
|
+
require 'active_support/all'
|
4
2
|
require 'ensure_valid_encoding'
|
5
|
-
require 'gigo/
|
3
|
+
require 'gigo/transcoders'
|
4
|
+
require 'gigo/transcoders/active_support'
|
5
|
+
require 'gigo/transcoders/rchardet'
|
6
|
+
require 'gigo/transcoders/blind'
|
6
7
|
require 'gigo/version'
|
7
8
|
|
8
9
|
module GIGO
|
9
10
|
|
10
|
-
|
11
|
+
mattr_accessor :encoding
|
12
|
+
self.encoding = Encoding.default_internal || Encoding::UTF_8
|
13
|
+
|
14
|
+
def self.load(data, options = {})
|
11
15
|
return data if data.nil? || !data.acts_like?(:string)
|
12
|
-
|
13
|
-
|
16
|
+
tcoders = options[:transcoders] || transcoders
|
17
|
+
encoded_string = transcode(data, tcoders)
|
18
|
+
return data if data.encoding == GIGO.encoding && data == encoded_string
|
14
19
|
encoded_string
|
15
20
|
end
|
16
21
|
|
17
22
|
|
18
23
|
protected
|
19
24
|
|
20
|
-
def self.
|
25
|
+
def self.transcode(data, tcoders)
|
21
26
|
string = data
|
22
|
-
|
23
|
-
string = ActiveSupport::Multibyte.proxy_class.new(string).tidy_bytes
|
24
|
-
rescue Exception => e
|
27
|
+
tcoders.detect do |t|
|
25
28
|
begin
|
26
|
-
|
27
|
-
string = string.force_encoding(encoding).encode forced_encoding, :undef => :replace, :invalid => :replace
|
29
|
+
string = t.transcode(string)
|
28
30
|
rescue Exception => e
|
29
|
-
|
31
|
+
false
|
30
32
|
end
|
31
33
|
end
|
32
34
|
string = EnsureValidEncoding.ensure_valid_encoding string, invalid: :replace, replace: "?"
|
33
|
-
string
|
35
|
+
string
|
34
36
|
end
|
35
37
|
|
36
|
-
def self.forced_encoding
|
37
|
-
Encoding.default_internal || Encoding::UTF_8
|
38
|
-
end
|
39
|
-
|
40
38
|
end
|
@@ -0,0 +1,22 @@
|
|
1
|
+
require 'gigo/rchardet'
|
2
|
+
|
3
|
+
module GIGO
|
4
|
+
module Transcoders
|
5
|
+
module CharDet
|
6
|
+
|
7
|
+
GIGO.transcoders << self
|
8
|
+
|
9
|
+
def self.transcode(data)
|
10
|
+
source_encoding = detect_encoding(data) || data.encoding || Encoding.default_internal || Encoding::UTF_8
|
11
|
+
data.force_encoding(source_encoding).encode GIGO.encoding, :undef => :replace, :invalid => :replace
|
12
|
+
end
|
13
|
+
|
14
|
+
private
|
15
|
+
|
16
|
+
def self.detect_encoding(data)
|
17
|
+
CharDet.detect(data.dup)['encoding']
|
18
|
+
end
|
19
|
+
|
20
|
+
end
|
21
|
+
end
|
22
|
+
end
|
data/lib/gigo/version.rb
CHANGED
data/test/cases/gigo_test.rb
CHANGED
@@ -4,8 +4,6 @@ require 'test_helper'
|
|
4
4
|
module GIGO
|
5
5
|
class BaseTest < TestCase
|
6
6
|
|
7
|
-
include ERB::Util
|
8
|
-
|
9
7
|
let(:data_utf8_emoji) { "💖" }
|
10
8
|
let(:data_utf8) { "€20 – “Woohoo”" }
|
11
9
|
let(:data_bad_readin) { "�20 � �Woohoo�" }
|
@@ -14,6 +12,19 @@ module GIGO
|
|
14
12
|
let(:data_really_bad) { "ed.Ã\u0083Ã\u0083\xC3" }
|
15
13
|
|
16
14
|
|
15
|
+
describe '.encoding' do
|
16
|
+
|
17
|
+
it 'defaults to UTF-8 encoding' do
|
18
|
+
GIGO.encoding.must_equal Encoding::UTF_8
|
19
|
+
end
|
20
|
+
|
21
|
+
it 'can be set to any encoding' do
|
22
|
+
GIGO.encoding = Encoding::CP1252
|
23
|
+
GIGO.encoding.must_equal Encoding::CP1252
|
24
|
+
end
|
25
|
+
|
26
|
+
end
|
27
|
+
|
17
28
|
describe '.load' do
|
18
29
|
|
19
30
|
it 'ignores if string is not present' do
|
@@ -61,7 +72,17 @@ module GIGO
|
|
61
72
|
|
62
73
|
end
|
63
74
|
|
64
|
-
|
75
|
+
describe '.transcoders' do
|
76
|
+
|
77
|
+
it 'is an array of default transcoders' do
|
78
|
+
GIGO.transcoders.must_equal [
|
79
|
+
GIGO::Transcoders::ActiveSupport,
|
80
|
+
GIGO::Transcoders::CharDet,
|
81
|
+
GIGO::Transcoders::Blind
|
82
|
+
]
|
83
|
+
end
|
84
|
+
|
85
|
+
end
|
65
86
|
|
66
87
|
end
|
67
88
|
end
|
data/test/test_helper.rb
CHANGED
@@ -1,14 +1,26 @@
|
|
1
|
-
require 'bundler'
|
2
|
-
require 'minitest/autorun'
|
3
|
-
Bundler.require :development, :test
|
1
|
+
require 'bundler' ; Bundler.require :development, :test
|
4
2
|
require 'gigo'
|
5
|
-
require '
|
3
|
+
require 'minitest/autorun'
|
6
4
|
require 'erb'
|
7
5
|
|
8
6
|
module GIGO
|
9
7
|
class TestCase < MiniTest::Spec
|
10
8
|
|
9
|
+
include ERB::Util
|
10
|
+
|
11
|
+
before { setup_gigo }
|
12
|
+
after { teardown_gigo }
|
13
|
+
|
14
|
+
|
15
|
+
private
|
16
|
+
|
17
|
+
def setup_gigo
|
18
|
+
@_default_gigo_encoding = GIGO.encoding
|
19
|
+
end
|
11
20
|
|
21
|
+
def teardown_gigo
|
22
|
+
GIGO.encoding = @_default_gigo_encoding
|
23
|
+
end
|
12
24
|
|
13
25
|
end
|
14
26
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: gigo
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.3.0
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2013-
|
12
|
+
date: 2013-05-19 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: activesupport
|
@@ -60,7 +60,7 @@ dependencies:
|
|
60
60
|
- !ruby/object:Gem::Version
|
61
61
|
version: '0'
|
62
62
|
- !ruby/object:Gem::Dependency
|
63
|
-
name:
|
63
|
+
name: i18n
|
64
64
|
requirement: !ruby/object:Gem::Requirement
|
65
65
|
none: false
|
66
66
|
requirements:
|
@@ -76,7 +76,7 @@ dependencies:
|
|
76
76
|
- !ruby/object:Gem::Version
|
77
77
|
version: '0'
|
78
78
|
- !ruby/object:Gem::Dependency
|
79
|
-
name:
|
79
|
+
name: rake
|
80
80
|
requirement: !ruby/object:Gem::Requirement
|
81
81
|
none: false
|
82
82
|
requirements:
|
@@ -92,21 +92,21 @@ dependencies:
|
|
92
92
|
- !ruby/object:Gem::Version
|
93
93
|
version: '0'
|
94
94
|
- !ruby/object:Gem::Dependency
|
95
|
-
name: minitest
|
95
|
+
name: minitest
|
96
96
|
requirement: !ruby/object:Gem::Requirement
|
97
97
|
none: false
|
98
98
|
requirements:
|
99
|
-
- -
|
99
|
+
- - ~>
|
100
100
|
- !ruby/object:Gem::Version
|
101
|
-
version: '0'
|
101
|
+
version: '5.0'
|
102
102
|
type: :development
|
103
103
|
prerelease: false
|
104
104
|
version_requirements: !ruby/object:Gem::Requirement
|
105
105
|
none: false
|
106
106
|
requirements:
|
107
|
-
- -
|
107
|
+
- - ~>
|
108
108
|
- !ruby/object:Gem::Version
|
109
|
-
version: '0'
|
109
|
+
version: '5.0'
|
110
110
|
description: Garbage in, garbage out. Fix ruby encoded strings at all costs.
|
111
111
|
email:
|
112
112
|
- kcollins@customink.com
|
@@ -165,9 +165,12 @@ files:
|
|
165
165
|
- lib/gigo/rchardet/sjisprober.rb
|
166
166
|
- lib/gigo/rchardet/universaldetector.rb
|
167
167
|
- lib/gigo/rchardet/utf8prober.rb
|
168
|
+
- lib/gigo/transcoders.rb
|
169
|
+
- lib/gigo/transcoders/active_support.rb
|
170
|
+
- lib/gigo/transcoders/blind.rb
|
171
|
+
- lib/gigo/transcoders/rchardet.rb
|
168
172
|
- lib/gigo/version.rb
|
169
173
|
- test/cases/gigo_test.rb
|
170
|
-
- test/support/minitest.rb
|
171
174
|
- test/test_helper.rb
|
172
175
|
homepage: http://github.com/customink/gigo
|
173
176
|
licenses: []
|
@@ -183,7 +186,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
183
186
|
version: '0'
|
184
187
|
segments:
|
185
188
|
- 0
|
186
|
-
hash:
|
189
|
+
hash: -1915512776307670961
|
187
190
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
188
191
|
none: false
|
189
192
|
requirements:
|
@@ -192,7 +195,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
192
195
|
version: '0'
|
193
196
|
segments:
|
194
197
|
- 0
|
195
|
-
hash:
|
198
|
+
hash: -1915512776307670961
|
196
199
|
requirements: []
|
197
200
|
rubyforge_project:
|
198
201
|
rubygems_version: 1.8.25
|
@@ -203,5 +206,4 @@ summary: The gigo gem aims to solve bad data, likely from a legacy database. It
|
|
203
206
|
put in and take out of your data stores.
|
204
207
|
test_files:
|
205
208
|
- test/cases/gigo_test.rb
|
206
|
-
- test/support/minitest.rb
|
207
209
|
- test/test_helper.rb
|
data/test/support/minitest.rb
DELETED