scrub_rb 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 0048aa4b025f832fcb8cdc3d1676f097fc887239
4
+ data.tar.gz: aed4530569a2dfedc28a434a6eff778d00bae38c
5
+ SHA512:
6
+ metadata.gz: 35e024e14c9fe573d9315775785dafed01cb73a7b0c11cb68939a969c3cde8cb2cb631fd7572a579ab89a99bb87945f3cabe9c6ffe01507848cbe679d8e23d5a
7
+ data.tar.gz: e4c13c2f1e0b119fcae0361e439d0b997d12f5dfbd5efa16d43503efccf9e40ad856a9e2c89ccd42f05bb4979a9f605f600966caf3b8e447e13c04f4e245cbbb
data/.gitignore ADDED
@@ -0,0 +1,17 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ Gemfile.lock
7
+ InstalledFiles
8
+ _yardoc
9
+ coverage
10
+ doc/
11
+ lib/bundler/man
12
+ pkg
13
+ rdoc
14
+ spec/reports
15
+ test/tmp
16
+ test/version_tmp
17
+ tmp
data/.travis.yml ADDED
@@ -0,0 +1,6 @@
1
+ language: ruby
2
+ rvm:
3
+ - 1.9.3
4
+ - 2.0.0
5
+ - jruby-19mode
6
+ - jruby-20mode
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in scrub_rb.gemspec
4
+ gemspec
data/LICENSE.txt ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2013 Jonathan Rochkind
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,79 @@
1
+ # ScrubRb
2
+
3
+ Pure-ruby polyfill of MRI 2.1 String#scrub, for ruby 1.9 and 2.0 any interpreter
4
+
5
+ [![Build Status](https://travis-ci.org/jrochkind/scrub_rb.png?branch=master)](https://travis-ci.org/jrochkind/scrub_rb)
6
+
7
+ ## Installation
8
+
9
+ Add this line to your application's Gemfile:
10
+
11
+ gem 'scrub_rb'
12
+
13
+ And then execute:
14
+
15
+ $ bundle
16
+
17
+ Or install it yourself as:
18
+
19
+ $ gem install scrub_rb
20
+
21
+
22
+ ## What it is
23
+
24
+ Ruby 2.1 introduces String#scrub, a method to replace invalid bytes in a given string
25
+ and it's specified encoding. See docs in [MRI ruby source](https://github.com/ruby/ruby/blob/1e8a05c1dfee94db9b6b825097e1d192ad32930a/string.c#L7772)
26
+
27
+ If you need String#scrub in MRI ruby 2.0, you can use the [string-scrub gem](https://github.com/hsbt/string-scrub), which provides a backport of the C code from MRI ruby 2.1 into MRI 2.0.
28
+
29
+ What if you need this functionality in ruby 1.9, in jruby in 1.9 or 2.0 modes, or in
30
+ any other ruby platform that does not (or does not yet) support String#scrub? What if
31
+ you need to write code that will work on any of these platforms?
32
+
33
+ This gem provides a pure-ruby implementation of `String#scrub` and `#scrub!`, monkey-patched into
34
+ String, that should work on any ruby platform. It will only monkey-patch String
35
+ if String does not already have a #scrub method -- so it's safe to include
36
+ this gem in multi-platform code, when the code runs on ruby 2.1, String#scrub will
37
+ still be the original stdlib implementation.
38
+
39
+ ~~~ruby
40
+ # Encoding: utf-8
41
+
42
+ "abc\u3042\x81".scrub #=> "abc\u3042\uFFFD"
43
+ "abc\u3042\x81".scrub("*") #=> "abc\u3042*"
44
+ "abc\u3042\xE3\x80".scrub{|bytes| '<'+bytes.unpack('H*')[0]+'>' } #=> "abc\u3042<e380>"
45
+ ~~~
46
+
47
+ ## Performance
48
+
49
+ This pure ruby implementation is about an order of magnitude slower than stdlib String#scrub on ruby 2.1, or than `string-scrub` C gem on MRI 2.0. For most applications, string-scrubbing will probably be a small portion of total execution time, is still fairly fast, and hopefully won't be a problem.
50
+
51
+ ## Discrepency with MRI 2.1 String#scrub
52
+
53
+ If there are more than one concurrent invalid byte in a string, should the entire block be replaced with only one replacement, or should each invalid byte be replaced with a replacement?
54
+
55
+ I have not been able to understand the logic MRI 2.1 uses to divide contiguous invalid bytes into
56
+ certain sub-sequences for replacement, as represented in the [test suite](https://github.com/ruby/ruby/blob/3ac0ec4ecdea849143ed64e8935e6675b341e44b/test/ruby/test_m17n.rb#L1505). The test suite may be suggesting that the examples are from unicode documentation, but I wasn't able to find such documentation to see if it shed any light on the matter.
57
+
58
+ `scrub_rb` always combines contiguous invalid bytes into a single replacement. As a result, it fails several tests from the original String#scrub test suite, which want other divisions of contiguous invalid bytes. I've altered our local tests for our current behavior.
59
+
60
+ Beware of this potential difference when using the block form of #scrub especially -- you may get a different number of calls with sequence of invalid bytes divided into different substrings with `scrub_rb` as compared to official MRI 2.1 String#scrub or `string-scrub`.
61
+
62
+ For most uses, this discrepency is probably not of consequence.
63
+
64
+ If anyone can explain whats going on here, I'm very curious! I can't read C very well to try and figure it out from source.
65
+
66
+ ## Jruby may raise
67
+
68
+ Due to an apparent JRuby bug, some invalid strings cause an internal
69
+ exception from JRuby when trying to scrub_rb. The entire original MRI test suite
70
+ does passes against scrub_rb in JRuby -- but [one test original to us, involving
71
+ input tagged 'ascii' encoding](./test/scrub_test.rb#L67), fails raising an ArrayIndexOutOfBoundsException
72
+ from inside of JRuby. I have filed an [issue with JRuby](https://github.com/jruby/jruby/issues/1361).
73
+
74
+ I believe this problem should be rare -- so far, the only reproduction case involves an input string tagged 'ascii' encoding, which probably isn't a common use case. But it's unfortunate
75
+ that `scrub_rb` isn't reliable on jruby. I haven't been able to figure out any workaround in ruby to the jruby bug -- you could theoretically provide a Java alternate implementation usable in jruby, but I'm not sure what Java tools are available and how hard it would be to match the scrub api.
76
+
77
+ ## Contributions
78
+
79
+ Pull requests or suggestions welcome, especially on performance, on JRuby issue, and on discrepencies with official String#scrub.
data/Rakefile ADDED
@@ -0,0 +1,12 @@
1
+ #!/usr/bin/env rake
2
+ require "bundler/gem_tasks"
3
+
4
+ require 'rake/testtask'
5
+
6
+ Rake::TestTask.new do |t|
7
+ t.libs.push "lib"
8
+ t.test_files = FileList['test/*_test.rb']
9
+ t.verbose = true
10
+ end
11
+
12
+ task :default => [:test]
@@ -0,0 +1,49 @@
1
+ # Encoding: utf-8
2
+
3
+ # Just gives us a ballpark. Some issues with this benchmark:
4
+ # * our strings might not be representative of real work
5
+ # * we're testing against static class method, not actual monkey patch, which
6
+ # would have one more method call, which may or may not matter.
7
+
8
+ require 'benchmark'
9
+
10
+ # for MRI 2.0, let's load the C scrub gem
11
+ begin
12
+ require 'string/scrub'
13
+ rescue LoadError
14
+ puts "(Could not load scrub gem C backfill)"
15
+ end
16
+
17
+ require 'scrub_rb'
18
+
19
+ test_strings = [
20
+ "abc\u3042\x81",
21
+ "good string",
22
+ "abc\u3042\xE3\x80",
23
+ "another good string",
24
+ "M\xE9xico",
25
+ "More good string"
26
+ ]
27
+
28
+ n = 10000
29
+ Benchmark.bmbm do |x|
30
+ x.report("built-in") do
31
+ n.times do
32
+ test_strings.each do |str|
33
+ str.scrub
34
+ str.scrub("*")
35
+ str.scrub {|bytes| '<'+bytes.unpack('H*')[0]+'>'}
36
+ end
37
+ end
38
+ end
39
+
40
+ x.report("ScrubRb") do |x|
41
+ n.times do
42
+ test_strings.each do |str|
43
+ ScrubRb.scrub(str)
44
+ ScrubRb.scrub(str, "*")
45
+ ScrubRb.scrub(str) {|bytes| '<'+bytes.unpack('H*')[0]+'>'}
46
+ end
47
+ end
48
+ end
49
+ end
data/lib/scrub_rb.rb ADDED
@@ -0,0 +1,69 @@
1
+ require "scrub_rb/version"
2
+
3
+ module ScrubRb
4
+
5
+ # static function implementation of String#scrub, where
6
+ # first arg is the string.
7
+ #
8
+ # ScrubRb.scrub("abc\u3042\x81") #=> "abc\u3042\uFFFD"
9
+ # ScrubRb.scrub("abc\u3042\x81", "*") #=> "abc\u3042*"
10
+ # ScrubRb.scrub("abc\u3042\xE3\x80") {|bytes| '<'+bytes.unpack('H*')[0]+'>' } #=> "abc\u3042<e380>"
11
+ def self.scrub(str, replacement=nil, &block)
12
+ return str if str.nil?
13
+
14
+ if replacement.nil? && ! block_given?
15
+ replacement =
16
+ # UTF-8 for unicode replacement char \uFFFD, encode in
17
+ # encoding of input string, using '?' as a fallback where
18
+ # it can't be (which should be non-unicode encodings)
19
+ "\xEF\xBF\xBD".force_encoding("UTF-8").encode( str.encoding,
20
+ :undef => :replace,
21
+ :replace => '?' )
22
+ end
23
+
24
+ result = ""
25
+ bad_chars = ""
26
+ bad_char_flag = false # weirdly, optimization to use flag
27
+
28
+ str.chars.each do |c|
29
+ if c.valid_encoding?
30
+ if bad_char_flag
31
+ scrub_replace(result, bad_chars, replacement, block)
32
+ bad_char_flag = false
33
+ end
34
+ result << c
35
+ else
36
+ bad_char_flag = true
37
+ bad_chars << c
38
+ end
39
+ end
40
+ if bad_char_flag
41
+ scrub_replace(result, bad_chars, replacement, block)
42
+ end
43
+
44
+ return result
45
+ end
46
+
47
+ private
48
+ def self.scrub_replace(result, bad_chars, replacement, block)
49
+ if block
50
+ r = block.call(bad_chars)
51
+ else
52
+ r = replacement
53
+ end
54
+
55
+ if r.respond_to?(:to_str)
56
+ r = r.to_str
57
+ else
58
+ raise TypeError, "no implicit conversion of #{r.class} into String"
59
+ end
60
+
61
+ unless r.valid_encoding?
62
+ raise ArgumentError, "replacement must be valid byte sequence '#{replacement}'"
63
+ end
64
+
65
+ result << r
66
+ bad_chars.clear
67
+ end
68
+
69
+ end
@@ -0,0 +1,20 @@
1
+ # Have to explicitly require this file to get the monkey
2
+ # patching of String#scrub in there, this file won't and shouldn't
3
+ # be 'require'd in automatically.
4
+ #
5
+ # However if there's already a String#scrub defiend, requiring
6
+ # this file will do nothing.
7
+
8
+ class String
9
+ # Only monkey patch if not already defined....
10
+ unless instance_methods.include? :scrub
11
+ def scrub(replacement=nil, &block)
12
+ ScrubRb.scrub(self, replacement, &block)
13
+ end
14
+
15
+ def scrub!(*args)
16
+ self.replace( self.scrub(*args) )
17
+ end
18
+ end
19
+
20
+ end
@@ -0,0 +1,3 @@
1
+ module ScrubRb
2
+ VERSION = "0.1.0"
3
+ end
data/scrub_rb.gemspec ADDED
@@ -0,0 +1,23 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'scrub_rb/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "scrub_rb"
8
+ spec.version = ScrubRb::VERSION
9
+ spec.authors = ["Jonathan Rochkind"]
10
+ spec.email = ["jonathan@dnil.net"]
11
+ spec.summary = %q{Pure-ruby polyfill of MRI 2.1 String#scrub, for ruby 1.9 and 2.0 any interpreter
12
+ }
13
+ spec.homepage = "https://github.com/jrochkind/scrub_rb"
14
+ spec.license = "MIT"
15
+
16
+ spec.files = `git ls-files`.split($/)
17
+ spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
18
+ spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
19
+ spec.require_paths = ["lib"]
20
+
21
+ spec.add_development_dependency "bundler", "~> 1.3"
22
+ spec.add_development_dependency "rake"
23
+ end
@@ -0,0 +1,116 @@
1
+ # coding: US-ASCII
2
+
3
+ # This whole file borrowered from string-scrub:
4
+ # https://raw.github.com/hsbt/string-scrub/master/test/test_scrub.rb
5
+ # Actually adapted originally from MRI test suite:
6
+ # https://github.com/ruby/ruby/blob/3ac0ec4ecdea849143ed64e8935e6675b341e44b/test/ruby/test_m17n.rb#L1505
7
+ # We want to make sure we pass the same tests.
8
+ #
9
+ # NOTE: Some tests of multiple contiguous illegal bytes, we've had
10
+ # to change to match scrub_rb behavior.
11
+ # See README under 'Discrepency'; search this source for 'SKIPPED'
12
+ # and 'OWN'
13
+
14
+ require 'scrub_rb'
15
+ require 'scrub_rb/monkey_patch'
16
+ require 'test/unit'
17
+
18
+
19
+ class BorrowedStringScrubTest < Test::Unit::TestCase
20
+ module AESU
21
+ def ua(str) str.dup.force_encoding("US-ASCII") end
22
+ def a(str) str.dup.force_encoding("ASCII-8BIT") end
23
+ def e(str) str.dup.force_encoding("EUC-JP") end
24
+ def s(str) str.dup.force_encoding("Windows-31J") end
25
+ def u(str) str.dup.force_encoding("UTF-8") end
26
+ end
27
+ include AESU
28
+
29
+ def test_scrub
30
+ str = "\u3042\u3044"
31
+ assert_not_same(str, str.scrub)
32
+ str.force_encoding(Encoding::ISO_2022_JP) # dummy encoding
33
+ assert_not_same(str, str.scrub)
34
+
35
+ # SKIPPED, discrepency
36
+ #assert_equal("\uFFFD\uFFFD\uFFFD", u("\x80\x80\x80").scrub)
37
+ # OWN equivalent
38
+ assert_equal("\uFFFD", u("\x80\x80\x80").scrub)
39
+
40
+ #assert_equal("\uFFFDA", u("\xF4\x80\x80A").scrub)
41
+
42
+ # examples in Unicode 6.1.0 D93b
43
+ # SKIPPED, discrepency
44
+ #assert_equal("\x41\uFFFD\uFFFD\x41\uFFFD\x41",
45
+ # u("\x41\xC0\xAF\x41\xF4\x80\x80\x41").scrub)
46
+ # OWN equivalent
47
+ assert_equal("\x41\uFFFD\x41\uFFFD\x41",
48
+ u("\x41\xC0\xAF\x41\xF4\x80\x80\x41").scrub)
49
+
50
+ # SKIPPED, discrepency
51
+ #assert_equal("\x41\uFFFD\uFFFD\uFFFD\x41",
52
+ # u("\x41\xE0\x9F\x80\x41").scrub)
53
+ # OWN equivalent
54
+ assert_equal("\x41\uFFFD\x41",
55
+ u("\x41\xE0\x9F\x80\x41").scrub)
56
+
57
+ # SKIPPED, discrepency
58
+ #assert_equal("\u0061\uFFFD\uFFFD\uFFFD\u0062\uFFFD\u0063\uFFFD\uFFFD\u0064",
59
+ # u("\x61\xF1\x80\x80\xE1\x80\xC2\x62\x80\x63\x80\xBF\x64").scrub)
60
+ # OWN equivalent
61
+ assert_equal("\u0061\uFFFD\u0062\uFFFD\u0063\uFFFD\u0064",
62
+ u("\x61\xF1\x80\x80\xE1\x80\xC2\x62\x80\x63\x80\xBF\x64").scrub)
63
+ # SKIPPED discrepency
64
+ #assert_equal("abcdefghijklmnopqrstuvwxyz\u0061\uFFFD\uFFFD\uFFFD\u0062\uFFFD\u0063\uFFFD\uFFFD\u0064",
65
+ # u("abcdefghijklmnopqrstuvwxyz\x61\xF1\x80\x80\xE1\x80\xC2\x62\x80\x63\x80\xBF\x64").scrub)
66
+ # OWN equivalent
67
+ assert_equal("abcdefghijklmnopqrstuvwxyz\u0061\uFFFD\u0062\uFFFD\u0063\uFFFD\u0064",
68
+ u("abcdefghijklmnopqrstuvwxyz\x61\xF1\x80\x80\xE1\x80\xC2\x62\x80\x63\x80\xBF\x64").scrub)
69
+
70
+
71
+ assert_equal("\u3042\u3013", u("\xE3\x81\x82\xE3\x81").scrub("\u3013"))
72
+ assert_raise(Encoding::CompatibilityError){ u("\xE3\x81\x82\xE3\x81").scrub(e("\xA4\xA2")) }
73
+ assert_raise(TypeError){ u("\xE3\x81\x82\xE3\x81").scrub(1) }
74
+ assert_raise(ArgumentError){ u("\xE3\x81\x82\xE3\x81\x82\xE3\x81").scrub(u("\x81")) }
75
+ assert_equal(e("\xA4\xA2\xA2\xAE"), e("\xA4\xA2\xA4").scrub(e("\xA2\xAE")))
76
+
77
+ assert_equal("\u3042<e381>", u("\xE3\x81\x82\xE3\x81").scrub{|x|'<'+x.unpack('H*')[0]+'>'})
78
+ assert_raise(Encoding::CompatibilityError){ u("\xE3\x81\x82\xE3\x81").scrub{e("\xA4\xA2")} }
79
+
80
+
81
+ assert_raise(TypeError){ u("\xE3\x81\x82\xE3\x81").scrub{1} }
82
+ assert_raise(ArgumentError){ u("\xE3\x81\x82\xE3\x81\x82\xE3\x81").scrub{u("\x81")} }
83
+ assert_equal(e("\xA4\xA2\xA2\xAE"), e("\xA4\xA2\xA4").scrub{e("\xA2\xAE")})
84
+
85
+ assert_equal(u("\x81"), u("a\x81").scrub {|c| break c})
86
+ assert_raise(ArgumentError) {u("a\x81").scrub {|c| c}}
87
+
88
+ assert_equal("\uFFFD\u3042".encode("UTF-16BE"),
89
+ "\xD8\x00\x30\x42".force_encoding(Encoding::UTF_16BE).
90
+ scrub)
91
+ assert_equal("\uFFFD\u3042".encode("UTF-16LE"),
92
+ "\x00\xD8\x42\x30".force_encoding(Encoding::UTF_16LE).
93
+ scrub)
94
+ assert_equal("\uFFFD".encode("UTF-32BE"),
95
+ "\xff".force_encoding(Encoding::UTF_32BE).
96
+ scrub)
97
+ assert_equal("\uFFFD".encode("UTF-32LE"),
98
+ "\xff".force_encoding(Encoding::UTF_32LE).
99
+ scrub)
100
+ end
101
+
102
+ def test_scrub_bang
103
+ str = "\u3042\u3044"
104
+ assert_same(str, str.scrub!)
105
+ str.force_encoding(Encoding::ISO_2022_JP) # dummy encoding
106
+ assert_same(str, str.scrub!)
107
+
108
+ str = u("\x80\x80\x80")
109
+ str.scrub!
110
+ assert_same(str, str.scrub!)
111
+ # SKIPPED, discrepency
112
+ #assert_equal("\uFFFD\uFFFD\uFFFD", str)
113
+ # OWN, single replacement
114
+ assert_equal("\uFFFD", str)
115
+ end
116
+ end
@@ -0,0 +1,35 @@
1
+ # Encoding: utf-8
2
+
3
+ require 'minitest/spec'
4
+ require 'minitest/autorun'
5
+
6
+ require 'scrub_rb'
7
+
8
+ # Going to require the monkey-patch, which will end up
9
+ # monkey-patching String for entire program execution, don't
10
+ # know any way to monkey patch just for this test, sorry.
11
+
12
+ require 'scrub_rb/monkey_patch'
13
+
14
+ describe "Monkey-patched String#scrub does same thing as ScrubRb.scrub" do
15
+ it "abc\\u304\\x81" do
16
+ "abc\u3042\x81".scrub.must_equal ScrubRb.scrub("abc\u3042\x81")
17
+ end
18
+
19
+ it "abc\\u3042\\x81, *" do
20
+ "abc\u3042\x81".scrub("*").must_equal ScrubRb.scrub("abc\u3042\x81", "*")
21
+ end
22
+
23
+ it "abc\\u3042\\xE3\\x80 with block" do
24
+ block = lambda do |bytes|
25
+ '<'+bytes.unpack('H*')[0]+'>'
26
+ end
27
+
28
+ "abc\u3042\xE3\x80".scrub(&block).must_equal ScrubRb.scrub("abc\u3042\xE3\x80", &block)
29
+ end
30
+
31
+ it "no bad bytes" do
32
+ "no bad bytes".scrub.must_equal ScrubRb.scrub("no bad bytes")
33
+ end
34
+
35
+ end
@@ -0,0 +1,88 @@
1
+ # Encoding: UTF-8
2
+
3
+ require 'minitest/spec'
4
+ require 'minitest/autorun'
5
+
6
+ require 'scrub_rb'
7
+
8
+ describe "ScrubRb" do
9
+ describe "examples from ruby 2.1 String#scrub" do
10
+ it '"abc\u3042\x81".scrub #=> "abc\u3042\uFFFD"' do
11
+ ScrubRb.scrub("abc\u3042\x81").must_equal("abc\u3042\uFFFD")
12
+ end
13
+
14
+ it '"abc\u3042\x81".scrub("*") #=> "abc\u3042*"' do
15
+ ScrubRb.scrub("abc\u3042\x81", "*").must_equal("abc\u3042*")
16
+ end
17
+
18
+ it 'block' do
19
+ ScrubRb.scrub("abc\u3042\xE3\x80") do |bytes|
20
+ '<'+bytes.unpack('H*')[0]+'>'
21
+ end.must_equal("abc\u3042<e380>")
22
+ end
23
+ end
24
+
25
+ # Things investigated in ruby 2.1 String#scrub to make sure
26
+ # we're doing the same things.
27
+ describe "compatible with ruby 2.1 String#scrub edge cases" do
28
+ it "returns copy even on legal string" do
29
+ original = "perfectly legal"
30
+ scrubbed = ScrubRb.scrub(original)
31
+
32
+ # not identity
33
+ refute scrubbed.equal? original
34
+ # yes equality
35
+ assert_equal original, scrubbed
36
+ end
37
+ it "collapses multiple bad bytes into one replacement" do
38
+ ScrubRb.scrub("abc\u3042\xE3\x80").must_equal("abc\u3042\uFFFD")
39
+ end
40
+ end
41
+
42
+
43
+ before do
44
+ @bad_bytes_utf8 = "M\xE9xico".force_encoding("UTF-8")
45
+ @bad_bytes_utf16 = "M\x00\xDFxico".force_encoding( Encoding::UTF_16LE )
46
+ @bad_bytes_ascii = "M\xA1xico".force_encoding("ASCII")
47
+ end
48
+
49
+
50
+ it "replaces with unicode replacement string" do
51
+ scrubbed = ScrubRb.scrub(@bad_bytes_utf8)
52
+
53
+ assert scrubbed.valid_encoding?
54
+ assert_equal scrubbed, "M\uFFFDxico"
55
+ end
56
+
57
+ it "replaces with chosen replacement string" do
58
+ ScrubRb.scrub(@bad_bytes_utf8, "*").must_equal("M*xico")
59
+ end
60
+
61
+ it "replaces with empty string" do
62
+ ScrubRb.scrub(@bad_bytes_utf8, '').must_equal("Mxico")
63
+ end
64
+
65
+
66
+ it "replaces non-unicode encoding with ? replacement str" do
67
+ if RUBY_PLATFORM == "java"
68
+ skip("known not to pass on JRuby, reported to JRuby github #1361")
69
+ end
70
+ ScrubRb.scrub(@bad_bytes_ascii).must_equal("M?xico")
71
+ end
72
+
73
+
74
+ it "works with first byte bad" do
75
+ str = "\xE9xico".force_encoding("UTF-8")
76
+ ScrubRb.scrub(str, "?").must_equal("?xico")
77
+ end
78
+
79
+ it "works with last bad byte" do
80
+ str = "Mexico\xE9".force_encoding("UTF-8")
81
+ ScrubRb.scrub(str, "?").must_equal("Mexico?")
82
+ end
83
+
84
+ it "with works for nil input" do
85
+ ScrubRb.scrub(nil).must_be_nil
86
+ end
87
+
88
+ end
metadata ADDED
@@ -0,0 +1,89 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: scrub_rb
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Jonathan Rochkind
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2013-12-26 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: bundler
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ~>
18
+ - !ruby/object:Gem::Version
19
+ version: '1.3'
20
+ type: :development
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ~>
25
+ - !ruby/object:Gem::Version
26
+ version: '1.3'
27
+ - !ruby/object:Gem::Dependency
28
+ name: rake
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - '>='
32
+ - !ruby/object:Gem::Version
33
+ version: '0'
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - '>='
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ description:
42
+ email:
43
+ - jonathan@dnil.net
44
+ executables: []
45
+ extensions: []
46
+ extra_rdoc_files: []
47
+ files:
48
+ - .gitignore
49
+ - .travis.yml
50
+ - Gemfile
51
+ - LICENSE.txt
52
+ - README.md
53
+ - Rakefile
54
+ - benchmark/benchmark.rb
55
+ - lib/scrub_rb.rb
56
+ - lib/scrub_rb/monkey_patch.rb
57
+ - lib/scrub_rb/version.rb
58
+ - scrub_rb.gemspec
59
+ - test/borrowed_string_scrub_test.rb
60
+ - test/monkey_patch_test.rb
61
+ - test/scrub_test.rb
62
+ homepage: https://github.com/jrochkind/scrub_rb
63
+ licenses:
64
+ - MIT
65
+ metadata: {}
66
+ post_install_message:
67
+ rdoc_options: []
68
+ require_paths:
69
+ - lib
70
+ required_ruby_version: !ruby/object:Gem::Requirement
71
+ requirements:
72
+ - - '>='
73
+ - !ruby/object:Gem::Version
74
+ version: '0'
75
+ required_rubygems_version: !ruby/object:Gem::Requirement
76
+ requirements:
77
+ - - '>='
78
+ - !ruby/object:Gem::Version
79
+ version: '0'
80
+ requirements: []
81
+ rubyforge_project:
82
+ rubygems_version: 2.0.3
83
+ signing_key:
84
+ specification_version: 4
85
+ summary: Pure-ruby polyfill of MRI 2.1 String#scrub, for ruby 1.9 and 2.0 any interpreter
86
+ test_files:
87
+ - test/borrowed_string_scrub_test.rb
88
+ - test/monkey_patch_test.rb
89
+ - test/scrub_test.rb