scrub_rb 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 0048aa4b025f832fcb8cdc3d1676f097fc887239
4
+ data.tar.gz: aed4530569a2dfedc28a434a6eff778d00bae38c
5
+ SHA512:
6
+ metadata.gz: 35e024e14c9fe573d9315775785dafed01cb73a7b0c11cb68939a969c3cde8cb2cb631fd7572a579ab89a99bb87945f3cabe9c6ffe01507848cbe679d8e23d5a
7
+ data.tar.gz: e4c13c2f1e0b119fcae0361e439d0b997d12f5dfbd5efa16d43503efccf9e40ad856a9e2c89ccd42f05bb4979a9f605f600966caf3b8e447e13c04f4e245cbbb
data/.gitignore ADDED
@@ -0,0 +1,17 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ Gemfile.lock
7
+ InstalledFiles
8
+ _yardoc
9
+ coverage
10
+ doc/
11
+ lib/bundler/man
12
+ pkg
13
+ rdoc
14
+ spec/reports
15
+ test/tmp
16
+ test/version_tmp
17
+ tmp
data/.travis.yml ADDED
@@ -0,0 +1,6 @@
1
+ language: ruby
2
+ rvm:
3
+ - 1.9.3
4
+ - 2.0.0
5
+ - jruby-19mode
6
+ - jruby-20mode
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in scrub_rb.gemspec
4
+ gemspec
data/LICENSE.txt ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2013 Jonathan Rochkind
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,79 @@
1
+ # ScrubRb
2
+
3
+ Pure-ruby polyfill of MRI 2.1 String#scrub, for ruby 1.9 and 2.0 any interpreter
4
+
5
+ [![Build Status](https://travis-ci.org/jrochkind/scrub_rb.png?branch=master)](https://travis-ci.org/jrochkind/scrub_rb)
6
+
7
+ ## Installation
8
+
9
+ Add this line to your application's Gemfile:
10
+
11
+ gem 'scrub_rb'
12
+
13
+ And then execute:
14
+
15
+ $ bundle
16
+
17
+ Or install it yourself as:
18
+
19
+ $ gem install scrub_rb
20
+
21
+
22
+ ## What it is
23
+
24
+ Ruby 2.1 introduces String#scrub, a method to replace invalid bytes in a given string
25
+ and it's specified encoding. See docs in [MRI ruby source](https://github.com/ruby/ruby/blob/1e8a05c1dfee94db9b6b825097e1d192ad32930a/string.c#L7772)
26
+
27
+ If you need String#scrub in MRI ruby 2.0, you can use the [string-scrub gem](https://github.com/hsbt/string-scrub), which provides a backport of the C code from MRI ruby 2.1 into MRI 2.0.
28
+
29
+ What if you need this functionality in ruby 1.9, in jruby in 1.9 or 2.0 modes, or in
30
+ any other ruby platform that does not (or does not yet) support String#scrub? What if
31
+ you need to write code that will work on any of these platforms?
32
+
33
+ This gem provides a pure-ruby implementation of `String#scrub` and `#scrub!`, monkey-patched into
34
+ String, that should work on any ruby platform. It will only monkey-patch String
35
+ if String does not already have a #scrub method -- so it's safe to include
36
+ this gem in multi-platform code, when the code runs on ruby 2.1, String#scrub will
37
+ still be the original stdlib implementation.
38
+
39
+ ~~~ruby
40
+ # Encoding: utf-8
41
+
42
+ "abc\u3042\x81".scrub #=> "abc\u3042\uFFFD"
43
+ "abc\u3042\x81".scrub("*") #=> "abc\u3042*"
44
+ "abc\u3042\xE3\x80".scrub{|bytes| '<'+bytes.unpack('H*')[0]+'>' } #=> "abc\u3042<e380>"
45
+ ~~~
46
+
47
+ ## Performance
48
+
49
+ This pure ruby implementation is about an order of magnitude slower than stdlib String#scrub on ruby 2.1, or than `string-scrub` C gem on MRI 2.0. For most applications, string-scrubbing will probably be a small portion of total execution time, is still fairly fast, and hopefully won't be a problem.
50
+
51
+ ## Discrepency with MRI 2.1 String#scrub
52
+
53
+ If there are more than one concurrent invalid byte in a string, should the entire block be replaced with only one replacement, or should each invalid byte be replaced with a replacement?
54
+
55
+ I have not been able to understand the logic MRI 2.1 uses to divide contiguous invalid bytes into
56
+ certain sub-sequences for replacement, as represented in the [test suite](https://github.com/ruby/ruby/blob/3ac0ec4ecdea849143ed64e8935e6675b341e44b/test/ruby/test_m17n.rb#L1505). The test suite may be suggesting that the examples are from unicode documentation, but I wasn't able to find such documentation to see if it shed any light on the matter.
57
+
58
+ `scrub_rb` always combines contiguous invalid bytes into a single replacement. As a result, it fails several tests from the original String#scrub test suite, which want other divisions of contiguous invalid bytes. I've altered our local tests for our current behavior.
59
+
60
+ Beware of this potential difference when using the block form of #scrub especially -- you may get a different number of calls with sequence of invalid bytes divided into different substrings with `scrub_rb` as compared to official MRI 2.1 String#scrub or `string-scrub`.
61
+
62
+ For most uses, this discrepency is probably not of consequence.
63
+
64
+ If anyone can explain whats going on here, I'm very curious! I can't read C very well to try and figure it out from source.
65
+
66
+ ## Jruby may raise
67
+
68
+ Due to an apparent JRuby bug, some invalid strings cause an internal
69
+ exception from JRuby when trying to scrub_rb. The entire original MRI test suite
70
+ does passes against scrub_rb in JRuby -- but [one test original to us, involving
71
+ input tagged 'ascii' encoding](./test/scrub_test.rb#L67), fails raising an ArrayIndexOutOfBoundsException
72
+ from inside of JRuby. I have filed an [issue with JRuby](https://github.com/jruby/jruby/issues/1361).
73
+
74
+ I believe this problem should be rare -- so far, the only reproduction case involves an input string tagged 'ascii' encoding, which probably isn't a common use case. But it's unfortunate
75
+ that `scrub_rb` isn't reliable on jruby. I haven't been able to figure out any workaround in ruby to the jruby bug -- you could theoretically provide a Java alternate implementation usable in jruby, but I'm not sure what Java tools are available and how hard it would be to match the scrub api.
76
+
77
+ ## Contributions
78
+
79
+ Pull requests or suggestions welcome, especially on performance, on JRuby issue, and on discrepencies with official String#scrub.
data/Rakefile ADDED
@@ -0,0 +1,12 @@
1
+ #!/usr/bin/env rake
2
+ require "bundler/gem_tasks"
3
+
4
+ require 'rake/testtask'
5
+
6
+ Rake::TestTask.new do |t|
7
+ t.libs.push "lib"
8
+ t.test_files = FileList['test/*_test.rb']
9
+ t.verbose = true
10
+ end
11
+
12
+ task :default => [:test]
@@ -0,0 +1,49 @@
1
+ # Encoding: utf-8
2
+
3
+ # Just gives us a ballpark. Some issues with this benchmark:
4
+ # * our strings might not be representative of real work
5
+ # * we're testing against static class method, not actual monkey patch, which
6
+ # would have one more method call, which may or may not matter.
7
+
8
+ require 'benchmark'
9
+
10
+ # for MRI 2.0, let's load the C scrub gem
11
+ begin
12
+ require 'string/scrub'
13
+ rescue LoadError
14
+ puts "(Could not load scrub gem C backfill)"
15
+ end
16
+
17
+ require 'scrub_rb'
18
+
19
+ test_strings = [
20
+ "abc\u3042\x81",
21
+ "good string",
22
+ "abc\u3042\xE3\x80",
23
+ "another good string",
24
+ "M\xE9xico",
25
+ "More good string"
26
+ ]
27
+
28
+ n = 10000
29
+ Benchmark.bmbm do |x|
30
+ x.report("built-in") do
31
+ n.times do
32
+ test_strings.each do |str|
33
+ str.scrub
34
+ str.scrub("*")
35
+ str.scrub {|bytes| '<'+bytes.unpack('H*')[0]+'>'}
36
+ end
37
+ end
38
+ end
39
+
40
+ x.report("ScrubRb") do |x|
41
+ n.times do
42
+ test_strings.each do |str|
43
+ ScrubRb.scrub(str)
44
+ ScrubRb.scrub(str, "*")
45
+ ScrubRb.scrub(str) {|bytes| '<'+bytes.unpack('H*')[0]+'>'}
46
+ end
47
+ end
48
+ end
49
+ end
data/lib/scrub_rb.rb ADDED
@@ -0,0 +1,69 @@
1
+ require "scrub_rb/version"
2
+
3
+ module ScrubRb
4
+
5
+ # static function implementation of String#scrub, where
6
+ # first arg is the string.
7
+ #
8
+ # ScrubRb.scrub("abc\u3042\x81") #=> "abc\u3042\uFFFD"
9
+ # ScrubRb.scrub("abc\u3042\x81", "*") #=> "abc\u3042*"
10
+ # ScrubRb.scrub("abc\u3042\xE3\x80") {|bytes| '<'+bytes.unpack('H*')[0]+'>' } #=> "abc\u3042<e380>"
11
+ def self.scrub(str, replacement=nil, &block)
12
+ return str if str.nil?
13
+
14
+ if replacement.nil? && ! block_given?
15
+ replacement =
16
+ # UTF-8 for unicode replacement char \uFFFD, encode in
17
+ # encoding of input string, using '?' as a fallback where
18
+ # it can't be (which should be non-unicode encodings)
19
+ "\xEF\xBF\xBD".force_encoding("UTF-8").encode( str.encoding,
20
+ :undef => :replace,
21
+ :replace => '?' )
22
+ end
23
+
24
+ result = ""
25
+ bad_chars = ""
26
+ bad_char_flag = false # weirdly, optimization to use flag
27
+
28
+ str.chars.each do |c|
29
+ if c.valid_encoding?
30
+ if bad_char_flag
31
+ scrub_replace(result, bad_chars, replacement, block)
32
+ bad_char_flag = false
33
+ end
34
+ result << c
35
+ else
36
+ bad_char_flag = true
37
+ bad_chars << c
38
+ end
39
+ end
40
+ if bad_char_flag
41
+ scrub_replace(result, bad_chars, replacement, block)
42
+ end
43
+
44
+ return result
45
+ end
46
+
47
+ private
48
+ def self.scrub_replace(result, bad_chars, replacement, block)
49
+ if block
50
+ r = block.call(bad_chars)
51
+ else
52
+ r = replacement
53
+ end
54
+
55
+ if r.respond_to?(:to_str)
56
+ r = r.to_str
57
+ else
58
+ raise TypeError, "no implicit conversion of #{r.class} into String"
59
+ end
60
+
61
+ unless r.valid_encoding?
62
+ raise ArgumentError, "replacement must be valid byte sequence '#{replacement}'"
63
+ end
64
+
65
+ result << r
66
+ bad_chars.clear
67
+ end
68
+
69
+ end
@@ -0,0 +1,20 @@
1
+ # Have to explicitly require this file to get the monkey
2
+ # patching of String#scrub in there, this file won't and shouldn't
3
+ # be 'require'd in automatically.
4
+ #
5
+ # However if there's already a String#scrub defiend, requiring
6
+ # this file will do nothing.
7
+
8
+ class String
9
+ # Only monkey patch if not already defined....
10
+ unless instance_methods.include? :scrub
11
+ def scrub(replacement=nil, &block)
12
+ ScrubRb.scrub(self, replacement, &block)
13
+ end
14
+
15
+ def scrub!(*args)
16
+ self.replace( self.scrub(*args) )
17
+ end
18
+ end
19
+
20
+ end
@@ -0,0 +1,3 @@
1
+ module ScrubRb
2
+ VERSION = "0.1.0"
3
+ end
data/scrub_rb.gemspec ADDED
@@ -0,0 +1,23 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'scrub_rb/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "scrub_rb"
8
+ spec.version = ScrubRb::VERSION
9
+ spec.authors = ["Jonathan Rochkind"]
10
+ spec.email = ["jonathan@dnil.net"]
11
+ spec.summary = %q{Pure-ruby polyfill of MRI 2.1 String#scrub, for ruby 1.9 and 2.0 any interpreter
12
+ }
13
+ spec.homepage = "https://github.com/jrochkind/scrub_rb"
14
+ spec.license = "MIT"
15
+
16
+ spec.files = `git ls-files`.split($/)
17
+ spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
18
+ spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
19
+ spec.require_paths = ["lib"]
20
+
21
+ spec.add_development_dependency "bundler", "~> 1.3"
22
+ spec.add_development_dependency "rake"
23
+ end
@@ -0,0 +1,116 @@
1
+ # coding: US-ASCII
2
+
3
+ # This whole file borrowered from string-scrub:
4
+ # https://raw.github.com/hsbt/string-scrub/master/test/test_scrub.rb
5
+ # Actually adapted originally from MRI test suite:
6
+ # https://github.com/ruby/ruby/blob/3ac0ec4ecdea849143ed64e8935e6675b341e44b/test/ruby/test_m17n.rb#L1505
7
+ # We want to make sure we pass the same tests.
8
+ #
9
+ # NOTE: Some tests of multiple contiguous illegal bytes, we've had
10
+ # to change to match scrub_rb behavior.
11
+ # See README under 'Discrepency'; search this source for 'SKIPPED'
12
+ # and 'OWN'
13
+
14
+ require 'scrub_rb'
15
+ require 'scrub_rb/monkey_patch'
16
+ require 'test/unit'
17
+
18
+
19
+ class BorrowedStringScrubTest < Test::Unit::TestCase
20
+ module AESU
21
+ def ua(str) str.dup.force_encoding("US-ASCII") end
22
+ def a(str) str.dup.force_encoding("ASCII-8BIT") end
23
+ def e(str) str.dup.force_encoding("EUC-JP") end
24
+ def s(str) str.dup.force_encoding("Windows-31J") end
25
+ def u(str) str.dup.force_encoding("UTF-8") end
26
+ end
27
+ include AESU
28
+
29
+ def test_scrub
30
+ str = "\u3042\u3044"
31
+ assert_not_same(str, str.scrub)
32
+ str.force_encoding(Encoding::ISO_2022_JP) # dummy encoding
33
+ assert_not_same(str, str.scrub)
34
+
35
+ # SKIPPED, discrepency
36
+ #assert_equal("\uFFFD\uFFFD\uFFFD", u("\x80\x80\x80").scrub)
37
+ # OWN equivalent
38
+ assert_equal("\uFFFD", u("\x80\x80\x80").scrub)
39
+
40
+ #assert_equal("\uFFFDA", u("\xF4\x80\x80A").scrub)
41
+
42
+ # examples in Unicode 6.1.0 D93b
43
+ # SKIPPED, discrepency
44
+ #assert_equal("\x41\uFFFD\uFFFD\x41\uFFFD\x41",
45
+ # u("\x41\xC0\xAF\x41\xF4\x80\x80\x41").scrub)
46
+ # OWN equivalent
47
+ assert_equal("\x41\uFFFD\x41\uFFFD\x41",
48
+ u("\x41\xC0\xAF\x41\xF4\x80\x80\x41").scrub)
49
+
50
+ # SKIPPED, discrepency
51
+ #assert_equal("\x41\uFFFD\uFFFD\uFFFD\x41",
52
+ # u("\x41\xE0\x9F\x80\x41").scrub)
53
+ # OWN equivalent
54
+ assert_equal("\x41\uFFFD\x41",
55
+ u("\x41\xE0\x9F\x80\x41").scrub)
56
+
57
+ # SKIPPED, discrepency
58
+ #assert_equal("\u0061\uFFFD\uFFFD\uFFFD\u0062\uFFFD\u0063\uFFFD\uFFFD\u0064",
59
+ # u("\x61\xF1\x80\x80\xE1\x80\xC2\x62\x80\x63\x80\xBF\x64").scrub)
60
+ # OWN equivalent
61
+ assert_equal("\u0061\uFFFD\u0062\uFFFD\u0063\uFFFD\u0064",
62
+ u("\x61\xF1\x80\x80\xE1\x80\xC2\x62\x80\x63\x80\xBF\x64").scrub)
63
+ # SKIPPED discrepency
64
+ #assert_equal("abcdefghijklmnopqrstuvwxyz\u0061\uFFFD\uFFFD\uFFFD\u0062\uFFFD\u0063\uFFFD\uFFFD\u0064",
65
+ # u("abcdefghijklmnopqrstuvwxyz\x61\xF1\x80\x80\xE1\x80\xC2\x62\x80\x63\x80\xBF\x64").scrub)
66
+ # OWN equivalent
67
+ assert_equal("abcdefghijklmnopqrstuvwxyz\u0061\uFFFD\u0062\uFFFD\u0063\uFFFD\u0064",
68
+ u("abcdefghijklmnopqrstuvwxyz\x61\xF1\x80\x80\xE1\x80\xC2\x62\x80\x63\x80\xBF\x64").scrub)
69
+
70
+
71
+ assert_equal("\u3042\u3013", u("\xE3\x81\x82\xE3\x81").scrub("\u3013"))
72
+ assert_raise(Encoding::CompatibilityError){ u("\xE3\x81\x82\xE3\x81").scrub(e("\xA4\xA2")) }
73
+ assert_raise(TypeError){ u("\xE3\x81\x82\xE3\x81").scrub(1) }
74
+ assert_raise(ArgumentError){ u("\xE3\x81\x82\xE3\x81\x82\xE3\x81").scrub(u("\x81")) }
75
+ assert_equal(e("\xA4\xA2\xA2\xAE"), e("\xA4\xA2\xA4").scrub(e("\xA2\xAE")))
76
+
77
+ assert_equal("\u3042<e381>", u("\xE3\x81\x82\xE3\x81").scrub{|x|'<'+x.unpack('H*')[0]+'>'})
78
+ assert_raise(Encoding::CompatibilityError){ u("\xE3\x81\x82\xE3\x81").scrub{e("\xA4\xA2")} }
79
+
80
+
81
+ assert_raise(TypeError){ u("\xE3\x81\x82\xE3\x81").scrub{1} }
82
+ assert_raise(ArgumentError){ u("\xE3\x81\x82\xE3\x81\x82\xE3\x81").scrub{u("\x81")} }
83
+ assert_equal(e("\xA4\xA2\xA2\xAE"), e("\xA4\xA2\xA4").scrub{e("\xA2\xAE")})
84
+
85
+ assert_equal(u("\x81"), u("a\x81").scrub {|c| break c})
86
+ assert_raise(ArgumentError) {u("a\x81").scrub {|c| c}}
87
+
88
+ assert_equal("\uFFFD\u3042".encode("UTF-16BE"),
89
+ "\xD8\x00\x30\x42".force_encoding(Encoding::UTF_16BE).
90
+ scrub)
91
+ assert_equal("\uFFFD\u3042".encode("UTF-16LE"),
92
+ "\x00\xD8\x42\x30".force_encoding(Encoding::UTF_16LE).
93
+ scrub)
94
+ assert_equal("\uFFFD".encode("UTF-32BE"),
95
+ "\xff".force_encoding(Encoding::UTF_32BE).
96
+ scrub)
97
+ assert_equal("\uFFFD".encode("UTF-32LE"),
98
+ "\xff".force_encoding(Encoding::UTF_32LE).
99
+ scrub)
100
+ end
101
+
102
+ def test_scrub_bang
103
+ str = "\u3042\u3044"
104
+ assert_same(str, str.scrub!)
105
+ str.force_encoding(Encoding::ISO_2022_JP) # dummy encoding
106
+ assert_same(str, str.scrub!)
107
+
108
+ str = u("\x80\x80\x80")
109
+ str.scrub!
110
+ assert_same(str, str.scrub!)
111
+ # SKIPPED, discrepency
112
+ #assert_equal("\uFFFD\uFFFD\uFFFD", str)
113
+ # OWN, single replacement
114
+ assert_equal("\uFFFD", str)
115
+ end
116
+ end
@@ -0,0 +1,35 @@
1
+ # Encoding: utf-8
2
+
3
+ require 'minitest/spec'
4
+ require 'minitest/autorun'
5
+
6
+ require 'scrub_rb'
7
+
8
+ # Going to require the monkey-patch, which will end up
9
+ # monkey-patching String for entire program execution, don't
10
+ # know any way to monkey patch just for this test, sorry.
11
+
12
+ require 'scrub_rb/monkey_patch'
13
+
14
+ describe "Monkey-patched String#scrub does same thing as ScrubRb.scrub" do
15
+ it "abc\\u304\\x81" do
16
+ "abc\u3042\x81".scrub.must_equal ScrubRb.scrub("abc\u3042\x81")
17
+ end
18
+
19
+ it "abc\\u3042\\x81, *" do
20
+ "abc\u3042\x81".scrub("*").must_equal ScrubRb.scrub("abc\u3042\x81", "*")
21
+ end
22
+
23
+ it "abc\\u3042\\xE3\\x80 with block" do
24
+ block = lambda do |bytes|
25
+ '<'+bytes.unpack('H*')[0]+'>'
26
+ end
27
+
28
+ "abc\u3042\xE3\x80".scrub(&block).must_equal ScrubRb.scrub("abc\u3042\xE3\x80", &block)
29
+ end
30
+
31
+ it "no bad bytes" do
32
+ "no bad bytes".scrub.must_equal ScrubRb.scrub("no bad bytes")
33
+ end
34
+
35
+ end
@@ -0,0 +1,88 @@
1
+ # Encoding: UTF-8
2
+
3
+ require 'minitest/spec'
4
+ require 'minitest/autorun'
5
+
6
+ require 'scrub_rb'
7
+
8
+ describe "ScrubRb" do
9
+ describe "examples from ruby 2.1 String#scrub" do
10
+ it '"abc\u3042\x81".scrub #=> "abc\u3042\uFFFD"' do
11
+ ScrubRb.scrub("abc\u3042\x81").must_equal("abc\u3042\uFFFD")
12
+ end
13
+
14
+ it '"abc\u3042\x81".scrub("*") #=> "abc\u3042*"' do
15
+ ScrubRb.scrub("abc\u3042\x81", "*").must_equal("abc\u3042*")
16
+ end
17
+
18
+ it 'block' do
19
+ ScrubRb.scrub("abc\u3042\xE3\x80") do |bytes|
20
+ '<'+bytes.unpack('H*')[0]+'>'
21
+ end.must_equal("abc\u3042<e380>")
22
+ end
23
+ end
24
+
25
+ # Things investigated in ruby 2.1 String#scrub to make sure
26
+ # we're doing the same things.
27
+ describe "compatible with ruby 2.1 String#scrub edge cases" do
28
+ it "returns copy even on legal string" do
29
+ original = "perfectly legal"
30
+ scrubbed = ScrubRb.scrub(original)
31
+
32
+ # not identity
33
+ refute scrubbed.equal? original
34
+ # yes equality
35
+ assert_equal original, scrubbed
36
+ end
37
+ it "collapses multiple bad bytes into one replacement" do
38
+ ScrubRb.scrub("abc\u3042\xE3\x80").must_equal("abc\u3042\uFFFD")
39
+ end
40
+ end
41
+
42
+
43
+ before do
44
+ @bad_bytes_utf8 = "M\xE9xico".force_encoding("UTF-8")
45
+ @bad_bytes_utf16 = "M\x00\xDFxico".force_encoding( Encoding::UTF_16LE )
46
+ @bad_bytes_ascii = "M\xA1xico".force_encoding("ASCII")
47
+ end
48
+
49
+
50
+ it "replaces with unicode replacement string" do
51
+ scrubbed = ScrubRb.scrub(@bad_bytes_utf8)
52
+
53
+ assert scrubbed.valid_encoding?
54
+ assert_equal scrubbed, "M\uFFFDxico"
55
+ end
56
+
57
+ it "replaces with chosen replacement string" do
58
+ ScrubRb.scrub(@bad_bytes_utf8, "*").must_equal("M*xico")
59
+ end
60
+
61
+ it "replaces with empty string" do
62
+ ScrubRb.scrub(@bad_bytes_utf8, '').must_equal("Mxico")
63
+ end
64
+
65
+
66
+ it "replaces non-unicode encoding with ? replacement str" do
67
+ if RUBY_PLATFORM == "java"
68
+ skip("known not to pass on JRuby, reported to JRuby github #1361")
69
+ end
70
+ ScrubRb.scrub(@bad_bytes_ascii).must_equal("M?xico")
71
+ end
72
+
73
+
74
+ it "works with first byte bad" do
75
+ str = "\xE9xico".force_encoding("UTF-8")
76
+ ScrubRb.scrub(str, "?").must_equal("?xico")
77
+ end
78
+
79
+ it "works with last bad byte" do
80
+ str = "Mexico\xE9".force_encoding("UTF-8")
81
+ ScrubRb.scrub(str, "?").must_equal("Mexico?")
82
+ end
83
+
84
+ it "with works for nil input" do
85
+ ScrubRb.scrub(nil).must_be_nil
86
+ end
87
+
88
+ end
metadata ADDED
@@ -0,0 +1,89 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: scrub_rb
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Jonathan Rochkind
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2013-12-26 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: bundler
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ~>
18
+ - !ruby/object:Gem::Version
19
+ version: '1.3'
20
+ type: :development
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ~>
25
+ - !ruby/object:Gem::Version
26
+ version: '1.3'
27
+ - !ruby/object:Gem::Dependency
28
+ name: rake
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - '>='
32
+ - !ruby/object:Gem::Version
33
+ version: '0'
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - '>='
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ description:
42
+ email:
43
+ - jonathan@dnil.net
44
+ executables: []
45
+ extensions: []
46
+ extra_rdoc_files: []
47
+ files:
48
+ - .gitignore
49
+ - .travis.yml
50
+ - Gemfile
51
+ - LICENSE.txt
52
+ - README.md
53
+ - Rakefile
54
+ - benchmark/benchmark.rb
55
+ - lib/scrub_rb.rb
56
+ - lib/scrub_rb/monkey_patch.rb
57
+ - lib/scrub_rb/version.rb
58
+ - scrub_rb.gemspec
59
+ - test/borrowed_string_scrub_test.rb
60
+ - test/monkey_patch_test.rb
61
+ - test/scrub_test.rb
62
+ homepage: https://github.com/jrochkind/scrub_rb
63
+ licenses:
64
+ - MIT
65
+ metadata: {}
66
+ post_install_message:
67
+ rdoc_options: []
68
+ require_paths:
69
+ - lib
70
+ required_ruby_version: !ruby/object:Gem::Requirement
71
+ requirements:
72
+ - - '>='
73
+ - !ruby/object:Gem::Version
74
+ version: '0'
75
+ required_rubygems_version: !ruby/object:Gem::Requirement
76
+ requirements:
77
+ - - '>='
78
+ - !ruby/object:Gem::Version
79
+ version: '0'
80
+ requirements: []
81
+ rubyforge_project:
82
+ rubygems_version: 2.0.3
83
+ signing_key:
84
+ specification_version: 4
85
+ summary: Pure-ruby polyfill of MRI 2.1 String#scrub, for ruby 1.9 and 2.0 any interpreter
86
+ test_files:
87
+ - test/borrowed_string_scrub_test.rb
88
+ - test/monkey_patch_test.rb
89
+ - test/scrub_test.rb