scrub_rb 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.gitignore +17 -0
- data/.travis.yml +6 -0
- data/Gemfile +4 -0
- data/LICENSE.txt +22 -0
- data/README.md +79 -0
- data/Rakefile +12 -0
- data/benchmark/benchmark.rb +49 -0
- data/lib/scrub_rb.rb +69 -0
- data/lib/scrub_rb/monkey_patch.rb +20 -0
- data/lib/scrub_rb/version.rb +3 -0
- data/scrub_rb.gemspec +23 -0
- data/test/borrowed_string_scrub_test.rb +116 -0
- data/test/monkey_patch_test.rb +35 -0
- data/test/scrub_test.rb +88 -0
- metadata +89 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 0048aa4b025f832fcb8cdc3d1676f097fc887239
|
4
|
+
data.tar.gz: aed4530569a2dfedc28a434a6eff778d00bae38c
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 35e024e14c9fe573d9315775785dafed01cb73a7b0c11cb68939a969c3cde8cb2cb631fd7572a579ab89a99bb87945f3cabe9c6ffe01507848cbe679d8e23d5a
|
7
|
+
data.tar.gz: e4c13c2f1e0b119fcae0361e439d0b997d12f5dfbd5efa16d43503efccf9e40ad856a9e2c89ccd42f05bb4979a9f605f600966caf3b8e447e13c04f4e245cbbb
|
data/.gitignore
ADDED
data/.travis.yml
ADDED
data/Gemfile
ADDED
data/LICENSE.txt
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
Copyright (c) 2013 Jonathan Rochkind
|
2
|
+
|
3
|
+
MIT License
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
6
|
+
a copy of this software and associated documentation files (the
|
7
|
+
"Software"), to deal in the Software without restriction, including
|
8
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
9
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
10
|
+
permit persons to whom the Software is furnished to do so, subject to
|
11
|
+
the following conditions:
|
12
|
+
|
13
|
+
The above copyright notice and this permission notice shall be
|
14
|
+
included in all copies or substantial portions of the Software.
|
15
|
+
|
16
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
17
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
18
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
19
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
20
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
21
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
22
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,79 @@
|
|
1
|
+
# ScrubRb
|
2
|
+
|
3
|
+
Pure-ruby polyfill of MRI 2.1 String#scrub, for ruby 1.9 and 2.0 any interpreter
|
4
|
+
|
5
|
+
[](https://travis-ci.org/jrochkind/scrub_rb)
|
6
|
+
|
7
|
+
## Installation
|
8
|
+
|
9
|
+
Add this line to your application's Gemfile:
|
10
|
+
|
11
|
+
gem 'scrub_rb'
|
12
|
+
|
13
|
+
And then execute:
|
14
|
+
|
15
|
+
$ bundle
|
16
|
+
|
17
|
+
Or install it yourself as:
|
18
|
+
|
19
|
+
$ gem install scrub_rb
|
20
|
+
|
21
|
+
|
22
|
+
## What it is
|
23
|
+
|
24
|
+
Ruby 2.1 introduces String#scrub, a method to replace invalid bytes in a given string
|
25
|
+
and it's specified encoding. See docs in [MRI ruby source](https://github.com/ruby/ruby/blob/1e8a05c1dfee94db9b6b825097e1d192ad32930a/string.c#L7772)
|
26
|
+
|
27
|
+
If you need String#scrub in MRI ruby 2.0, you can use the [string-scrub gem](https://github.com/hsbt/string-scrub), which provides a backport of the C code from MRI ruby 2.1 into MRI 2.0.
|
28
|
+
|
29
|
+
What if you need this functionality in ruby 1.9, in jruby in 1.9 or 2.0 modes, or in
|
30
|
+
any other ruby platform that does not (or does not yet) support String#scrub? What if
|
31
|
+
you need to write code that will work on any of these platforms?
|
32
|
+
|
33
|
+
This gem provides a pure-ruby implementation of `String#scrub` and `#scrub!`, monkey-patched into
|
34
|
+
String, that should work on any ruby platform. It will only monkey-patch String
|
35
|
+
if String does not already have a #scrub method -- so it's safe to include
|
36
|
+
this gem in multi-platform code, when the code runs on ruby 2.1, String#scrub will
|
37
|
+
still be the original stdlib implementation.
|
38
|
+
|
39
|
+
~~~ruby
|
40
|
+
# Encoding: utf-8
|
41
|
+
|
42
|
+
"abc\u3042\x81".scrub #=> "abc\u3042\uFFFD"
|
43
|
+
"abc\u3042\x81".scrub("*") #=> "abc\u3042*"
|
44
|
+
"abc\u3042\xE3\x80".scrub{|bytes| '<'+bytes.unpack('H*')[0]+'>' } #=> "abc\u3042<e380>"
|
45
|
+
~~~
|
46
|
+
|
47
|
+
## Performance
|
48
|
+
|
49
|
+
This pure ruby implementation is about an order of magnitude slower than stdlib String#scrub on ruby 2.1, or than `string-scrub` C gem on MRI 2.0. For most applications, string-scrubbing will probably be a small portion of total execution time, is still fairly fast, and hopefully won't be a problem.
|
50
|
+
|
51
|
+
## Discrepency with MRI 2.1 String#scrub
|
52
|
+
|
53
|
+
If there are more than one concurrent invalid byte in a string, should the entire block be replaced with only one replacement, or should each invalid byte be replaced with a replacement?
|
54
|
+
|
55
|
+
I have not been able to understand the logic MRI 2.1 uses to divide contiguous invalid bytes into
|
56
|
+
certain sub-sequences for replacement, as represented in the [test suite](https://github.com/ruby/ruby/blob/3ac0ec4ecdea849143ed64e8935e6675b341e44b/test/ruby/test_m17n.rb#L1505). The test suite may be suggesting that the examples are from unicode documentation, but I wasn't able to find such documentation to see if it shed any light on the matter.
|
57
|
+
|
58
|
+
`scrub_rb` always combines contiguous invalid bytes into a single replacement. As a result, it fails several tests from the original String#scrub test suite, which want other divisions of contiguous invalid bytes. I've altered our local tests for our current behavior.
|
59
|
+
|
60
|
+
Beware of this potential difference when using the block form of #scrub especially -- you may get a different number of calls with sequence of invalid bytes divided into different substrings with `scrub_rb` as compared to official MRI 2.1 String#scrub or `string-scrub`.
|
61
|
+
|
62
|
+
For most uses, this discrepency is probably not of consequence.
|
63
|
+
|
64
|
+
If anyone can explain whats going on here, I'm very curious! I can't read C very well to try and figure it out from source.
|
65
|
+
|
66
|
+
## Jruby may raise
|
67
|
+
|
68
|
+
Due to an apparent JRuby bug, some invalid strings cause an internal
|
69
|
+
exception from JRuby when trying to scrub_rb. The entire original MRI test suite
|
70
|
+
does passes against scrub_rb in JRuby -- but [one test original to us, involving
|
71
|
+
input tagged 'ascii' encoding](./test/scrub_test.rb#L67), fails raising an ArrayIndexOutOfBoundsException
|
72
|
+
from inside of JRuby. I have filed an [issue with JRuby](https://github.com/jruby/jruby/issues/1361).
|
73
|
+
|
74
|
+
I believe this problem should be rare -- so far, the only reproduction case involves an input string tagged 'ascii' encoding, which probably isn't a common use case. But it's unfortunate
|
75
|
+
that `scrub_rb` isn't reliable on jruby. I haven't been able to figure out any workaround in ruby to the jruby bug -- you could theoretically provide a Java alternate implementation usable in jruby, but I'm not sure what Java tools are available and how hard it would be to match the scrub api.
|
76
|
+
|
77
|
+
## Contributions
|
78
|
+
|
79
|
+
Pull requests or suggestions welcome, especially on performance, on JRuby issue, and on discrepencies with official String#scrub.
|
data/Rakefile
ADDED
@@ -0,0 +1,49 @@
|
|
1
|
+
# Encoding: utf-8
|
2
|
+
|
3
|
+
# Just gives us a ballpark. Some issues with this benchmark:
|
4
|
+
# * our strings might not be representative of real work
|
5
|
+
# * we're testing against static class method, not actual monkey patch, which
|
6
|
+
# would have one more method call, which may or may not matter.
|
7
|
+
|
8
|
+
require 'benchmark'
|
9
|
+
|
10
|
+
# for MRI 2.0, let's load the C scrub gem
|
11
|
+
begin
|
12
|
+
require 'string/scrub'
|
13
|
+
rescue LoadError
|
14
|
+
puts "(Could not load scrub gem C backfill)"
|
15
|
+
end
|
16
|
+
|
17
|
+
require 'scrub_rb'
|
18
|
+
|
19
|
+
test_strings = [
|
20
|
+
"abc\u3042\x81",
|
21
|
+
"good string",
|
22
|
+
"abc\u3042\xE3\x80",
|
23
|
+
"another good string",
|
24
|
+
"M\xE9xico",
|
25
|
+
"More good string"
|
26
|
+
]
|
27
|
+
|
28
|
+
n = 10000
|
29
|
+
Benchmark.bmbm do |x|
|
30
|
+
x.report("built-in") do
|
31
|
+
n.times do
|
32
|
+
test_strings.each do |str|
|
33
|
+
str.scrub
|
34
|
+
str.scrub("*")
|
35
|
+
str.scrub {|bytes| '<'+bytes.unpack('H*')[0]+'>'}
|
36
|
+
end
|
37
|
+
end
|
38
|
+
end
|
39
|
+
|
40
|
+
x.report("ScrubRb") do |x|
|
41
|
+
n.times do
|
42
|
+
test_strings.each do |str|
|
43
|
+
ScrubRb.scrub(str)
|
44
|
+
ScrubRb.scrub(str, "*")
|
45
|
+
ScrubRb.scrub(str) {|bytes| '<'+bytes.unpack('H*')[0]+'>'}
|
46
|
+
end
|
47
|
+
end
|
48
|
+
end
|
49
|
+
end
|
data/lib/scrub_rb.rb
ADDED
@@ -0,0 +1,69 @@
|
|
1
|
+
require "scrub_rb/version"
|
2
|
+
|
3
|
+
module ScrubRb
|
4
|
+
|
5
|
+
# static function implementation of String#scrub, where
|
6
|
+
# first arg is the string.
|
7
|
+
#
|
8
|
+
# ScrubRb.scrub("abc\u3042\x81") #=> "abc\u3042\uFFFD"
|
9
|
+
# ScrubRb.scrub("abc\u3042\x81", "*") #=> "abc\u3042*"
|
10
|
+
# ScrubRb.scrub("abc\u3042\xE3\x80") {|bytes| '<'+bytes.unpack('H*')[0]+'>' } #=> "abc\u3042<e380>"
|
11
|
+
def self.scrub(str, replacement=nil, &block)
|
12
|
+
return str if str.nil?
|
13
|
+
|
14
|
+
if replacement.nil? && ! block_given?
|
15
|
+
replacement =
|
16
|
+
# UTF-8 for unicode replacement char \uFFFD, encode in
|
17
|
+
# encoding of input string, using '?' as a fallback where
|
18
|
+
# it can't be (which should be non-unicode encodings)
|
19
|
+
"\xEF\xBF\xBD".force_encoding("UTF-8").encode( str.encoding,
|
20
|
+
:undef => :replace,
|
21
|
+
:replace => '?' )
|
22
|
+
end
|
23
|
+
|
24
|
+
result = ""
|
25
|
+
bad_chars = ""
|
26
|
+
bad_char_flag = false # weirdly, optimization to use flag
|
27
|
+
|
28
|
+
str.chars.each do |c|
|
29
|
+
if c.valid_encoding?
|
30
|
+
if bad_char_flag
|
31
|
+
scrub_replace(result, bad_chars, replacement, block)
|
32
|
+
bad_char_flag = false
|
33
|
+
end
|
34
|
+
result << c
|
35
|
+
else
|
36
|
+
bad_char_flag = true
|
37
|
+
bad_chars << c
|
38
|
+
end
|
39
|
+
end
|
40
|
+
if bad_char_flag
|
41
|
+
scrub_replace(result, bad_chars, replacement, block)
|
42
|
+
end
|
43
|
+
|
44
|
+
return result
|
45
|
+
end
|
46
|
+
|
47
|
+
private
|
48
|
+
def self.scrub_replace(result, bad_chars, replacement, block)
|
49
|
+
if block
|
50
|
+
r = block.call(bad_chars)
|
51
|
+
else
|
52
|
+
r = replacement
|
53
|
+
end
|
54
|
+
|
55
|
+
if r.respond_to?(:to_str)
|
56
|
+
r = r.to_str
|
57
|
+
else
|
58
|
+
raise TypeError, "no implicit conversion of #{r.class} into String"
|
59
|
+
end
|
60
|
+
|
61
|
+
unless r.valid_encoding?
|
62
|
+
raise ArgumentError, "replacement must be valid byte sequence '#{replacement}'"
|
63
|
+
end
|
64
|
+
|
65
|
+
result << r
|
66
|
+
bad_chars.clear
|
67
|
+
end
|
68
|
+
|
69
|
+
end
|
@@ -0,0 +1,20 @@
|
|
1
|
+
# Have to explicitly require this file to get the monkey
|
2
|
+
# patching of String#scrub in there, this file won't and shouldn't
|
3
|
+
# be 'require'd in automatically.
|
4
|
+
#
|
5
|
+
# However if there's already a String#scrub defiend, requiring
|
6
|
+
# this file will do nothing.
|
7
|
+
|
8
|
+
class String
|
9
|
+
# Only monkey patch if not already defined....
|
10
|
+
unless instance_methods.include? :scrub
|
11
|
+
def scrub(replacement=nil, &block)
|
12
|
+
ScrubRb.scrub(self, replacement, &block)
|
13
|
+
end
|
14
|
+
|
15
|
+
def scrub!(*args)
|
16
|
+
self.replace( self.scrub(*args) )
|
17
|
+
end
|
18
|
+
end
|
19
|
+
|
20
|
+
end
|
data/scrub_rb.gemspec
ADDED
@@ -0,0 +1,23 @@
|
|
1
|
+
# coding: utf-8
|
2
|
+
lib = File.expand_path('../lib', __FILE__)
|
3
|
+
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
|
4
|
+
require 'scrub_rb/version'
|
5
|
+
|
6
|
+
Gem::Specification.new do |spec|
|
7
|
+
spec.name = "scrub_rb"
|
8
|
+
spec.version = ScrubRb::VERSION
|
9
|
+
spec.authors = ["Jonathan Rochkind"]
|
10
|
+
spec.email = ["jonathan@dnil.net"]
|
11
|
+
spec.summary = %q{Pure-ruby polyfill of MRI 2.1 String#scrub, for ruby 1.9 and 2.0 any interpreter
|
12
|
+
}
|
13
|
+
spec.homepage = "https://github.com/jrochkind/scrub_rb"
|
14
|
+
spec.license = "MIT"
|
15
|
+
|
16
|
+
spec.files = `git ls-files`.split($/)
|
17
|
+
spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
|
18
|
+
spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
|
19
|
+
spec.require_paths = ["lib"]
|
20
|
+
|
21
|
+
spec.add_development_dependency "bundler", "~> 1.3"
|
22
|
+
spec.add_development_dependency "rake"
|
23
|
+
end
|
@@ -0,0 +1,116 @@
|
|
1
|
+
# coding: US-ASCII
|
2
|
+
|
3
|
+
# This whole file borrowered from string-scrub:
|
4
|
+
# https://raw.github.com/hsbt/string-scrub/master/test/test_scrub.rb
|
5
|
+
# Actually adapted originally from MRI test suite:
|
6
|
+
# https://github.com/ruby/ruby/blob/3ac0ec4ecdea849143ed64e8935e6675b341e44b/test/ruby/test_m17n.rb#L1505
|
7
|
+
# We want to make sure we pass the same tests.
|
8
|
+
#
|
9
|
+
# NOTE: Some tests of multiple contiguous illegal bytes, we've had
|
10
|
+
# to change to match scrub_rb behavior.
|
11
|
+
# See README under 'Discrepency'; search this source for 'SKIPPED'
|
12
|
+
# and 'OWN'
|
13
|
+
|
14
|
+
require 'scrub_rb'
|
15
|
+
require 'scrub_rb/monkey_patch'
|
16
|
+
require 'test/unit'
|
17
|
+
|
18
|
+
|
19
|
+
class BorrowedStringScrubTest < Test::Unit::TestCase
|
20
|
+
module AESU
|
21
|
+
def ua(str) str.dup.force_encoding("US-ASCII") end
|
22
|
+
def a(str) str.dup.force_encoding("ASCII-8BIT") end
|
23
|
+
def e(str) str.dup.force_encoding("EUC-JP") end
|
24
|
+
def s(str) str.dup.force_encoding("Windows-31J") end
|
25
|
+
def u(str) str.dup.force_encoding("UTF-8") end
|
26
|
+
end
|
27
|
+
include AESU
|
28
|
+
|
29
|
+
def test_scrub
|
30
|
+
str = "\u3042\u3044"
|
31
|
+
assert_not_same(str, str.scrub)
|
32
|
+
str.force_encoding(Encoding::ISO_2022_JP) # dummy encoding
|
33
|
+
assert_not_same(str, str.scrub)
|
34
|
+
|
35
|
+
# SKIPPED, discrepency
|
36
|
+
#assert_equal("\uFFFD\uFFFD\uFFFD", u("\x80\x80\x80").scrub)
|
37
|
+
# OWN equivalent
|
38
|
+
assert_equal("\uFFFD", u("\x80\x80\x80").scrub)
|
39
|
+
|
40
|
+
#assert_equal("\uFFFDA", u("\xF4\x80\x80A").scrub)
|
41
|
+
|
42
|
+
# examples in Unicode 6.1.0 D93b
|
43
|
+
# SKIPPED, discrepency
|
44
|
+
#assert_equal("\x41\uFFFD\uFFFD\x41\uFFFD\x41",
|
45
|
+
# u("\x41\xC0\xAF\x41\xF4\x80\x80\x41").scrub)
|
46
|
+
# OWN equivalent
|
47
|
+
assert_equal("\x41\uFFFD\x41\uFFFD\x41",
|
48
|
+
u("\x41\xC0\xAF\x41\xF4\x80\x80\x41").scrub)
|
49
|
+
|
50
|
+
# SKIPPED, discrepency
|
51
|
+
#assert_equal("\x41\uFFFD\uFFFD\uFFFD\x41",
|
52
|
+
# u("\x41\xE0\x9F\x80\x41").scrub)
|
53
|
+
# OWN equivalent
|
54
|
+
assert_equal("\x41\uFFFD\x41",
|
55
|
+
u("\x41\xE0\x9F\x80\x41").scrub)
|
56
|
+
|
57
|
+
# SKIPPED, discrepency
|
58
|
+
#assert_equal("\u0061\uFFFD\uFFFD\uFFFD\u0062\uFFFD\u0063\uFFFD\uFFFD\u0064",
|
59
|
+
# u("\x61\xF1\x80\x80\xE1\x80\xC2\x62\x80\x63\x80\xBF\x64").scrub)
|
60
|
+
# OWN equivalent
|
61
|
+
assert_equal("\u0061\uFFFD\u0062\uFFFD\u0063\uFFFD\u0064",
|
62
|
+
u("\x61\xF1\x80\x80\xE1\x80\xC2\x62\x80\x63\x80\xBF\x64").scrub)
|
63
|
+
# SKIPPED discrepency
|
64
|
+
#assert_equal("abcdefghijklmnopqrstuvwxyz\u0061\uFFFD\uFFFD\uFFFD\u0062\uFFFD\u0063\uFFFD\uFFFD\u0064",
|
65
|
+
# u("abcdefghijklmnopqrstuvwxyz\x61\xF1\x80\x80\xE1\x80\xC2\x62\x80\x63\x80\xBF\x64").scrub)
|
66
|
+
# OWN equivalent
|
67
|
+
assert_equal("abcdefghijklmnopqrstuvwxyz\u0061\uFFFD\u0062\uFFFD\u0063\uFFFD\u0064",
|
68
|
+
u("abcdefghijklmnopqrstuvwxyz\x61\xF1\x80\x80\xE1\x80\xC2\x62\x80\x63\x80\xBF\x64").scrub)
|
69
|
+
|
70
|
+
|
71
|
+
assert_equal("\u3042\u3013", u("\xE3\x81\x82\xE3\x81").scrub("\u3013"))
|
72
|
+
assert_raise(Encoding::CompatibilityError){ u("\xE3\x81\x82\xE3\x81").scrub(e("\xA4\xA2")) }
|
73
|
+
assert_raise(TypeError){ u("\xE3\x81\x82\xE3\x81").scrub(1) }
|
74
|
+
assert_raise(ArgumentError){ u("\xE3\x81\x82\xE3\x81\x82\xE3\x81").scrub(u("\x81")) }
|
75
|
+
assert_equal(e("\xA4\xA2\xA2\xAE"), e("\xA4\xA2\xA4").scrub(e("\xA2\xAE")))
|
76
|
+
|
77
|
+
assert_equal("\u3042<e381>", u("\xE3\x81\x82\xE3\x81").scrub{|x|'<'+x.unpack('H*')[0]+'>'})
|
78
|
+
assert_raise(Encoding::CompatibilityError){ u("\xE3\x81\x82\xE3\x81").scrub{e("\xA4\xA2")} }
|
79
|
+
|
80
|
+
|
81
|
+
assert_raise(TypeError){ u("\xE3\x81\x82\xE3\x81").scrub{1} }
|
82
|
+
assert_raise(ArgumentError){ u("\xE3\x81\x82\xE3\x81\x82\xE3\x81").scrub{u("\x81")} }
|
83
|
+
assert_equal(e("\xA4\xA2\xA2\xAE"), e("\xA4\xA2\xA4").scrub{e("\xA2\xAE")})
|
84
|
+
|
85
|
+
assert_equal(u("\x81"), u("a\x81").scrub {|c| break c})
|
86
|
+
assert_raise(ArgumentError) {u("a\x81").scrub {|c| c}}
|
87
|
+
|
88
|
+
assert_equal("\uFFFD\u3042".encode("UTF-16BE"),
|
89
|
+
"\xD8\x00\x30\x42".force_encoding(Encoding::UTF_16BE).
|
90
|
+
scrub)
|
91
|
+
assert_equal("\uFFFD\u3042".encode("UTF-16LE"),
|
92
|
+
"\x00\xD8\x42\x30".force_encoding(Encoding::UTF_16LE).
|
93
|
+
scrub)
|
94
|
+
assert_equal("\uFFFD".encode("UTF-32BE"),
|
95
|
+
"\xff".force_encoding(Encoding::UTF_32BE).
|
96
|
+
scrub)
|
97
|
+
assert_equal("\uFFFD".encode("UTF-32LE"),
|
98
|
+
"\xff".force_encoding(Encoding::UTF_32LE).
|
99
|
+
scrub)
|
100
|
+
end
|
101
|
+
|
102
|
+
def test_scrub_bang
|
103
|
+
str = "\u3042\u3044"
|
104
|
+
assert_same(str, str.scrub!)
|
105
|
+
str.force_encoding(Encoding::ISO_2022_JP) # dummy encoding
|
106
|
+
assert_same(str, str.scrub!)
|
107
|
+
|
108
|
+
str = u("\x80\x80\x80")
|
109
|
+
str.scrub!
|
110
|
+
assert_same(str, str.scrub!)
|
111
|
+
# SKIPPED, discrepency
|
112
|
+
#assert_equal("\uFFFD\uFFFD\uFFFD", str)
|
113
|
+
# OWN, single replacement
|
114
|
+
assert_equal("\uFFFD", str)
|
115
|
+
end
|
116
|
+
end
|
@@ -0,0 +1,35 @@
|
|
1
|
+
# Encoding: utf-8
|
2
|
+
|
3
|
+
require 'minitest/spec'
|
4
|
+
require 'minitest/autorun'
|
5
|
+
|
6
|
+
require 'scrub_rb'
|
7
|
+
|
8
|
+
# Going to require the monkey-patch, which will end up
|
9
|
+
# monkey-patching String for entire program execution, don't
|
10
|
+
# know any way to monkey patch just for this test, sorry.
|
11
|
+
|
12
|
+
require 'scrub_rb/monkey_patch'
|
13
|
+
|
14
|
+
describe "Monkey-patched String#scrub does same thing as ScrubRb.scrub" do
|
15
|
+
it "abc\\u304\\x81" do
|
16
|
+
"abc\u3042\x81".scrub.must_equal ScrubRb.scrub("abc\u3042\x81")
|
17
|
+
end
|
18
|
+
|
19
|
+
it "abc\\u3042\\x81, *" do
|
20
|
+
"abc\u3042\x81".scrub("*").must_equal ScrubRb.scrub("abc\u3042\x81", "*")
|
21
|
+
end
|
22
|
+
|
23
|
+
it "abc\\u3042\\xE3\\x80 with block" do
|
24
|
+
block = lambda do |bytes|
|
25
|
+
'<'+bytes.unpack('H*')[0]+'>'
|
26
|
+
end
|
27
|
+
|
28
|
+
"abc\u3042\xE3\x80".scrub(&block).must_equal ScrubRb.scrub("abc\u3042\xE3\x80", &block)
|
29
|
+
end
|
30
|
+
|
31
|
+
it "no bad bytes" do
|
32
|
+
"no bad bytes".scrub.must_equal ScrubRb.scrub("no bad bytes")
|
33
|
+
end
|
34
|
+
|
35
|
+
end
|
data/test/scrub_test.rb
ADDED
@@ -0,0 +1,88 @@
|
|
1
|
+
# Encoding: UTF-8
|
2
|
+
|
3
|
+
require 'minitest/spec'
|
4
|
+
require 'minitest/autorun'
|
5
|
+
|
6
|
+
require 'scrub_rb'
|
7
|
+
|
8
|
+
describe "ScrubRb" do
|
9
|
+
describe "examples from ruby 2.1 String#scrub" do
|
10
|
+
it '"abc\u3042\x81".scrub #=> "abc\u3042\uFFFD"' do
|
11
|
+
ScrubRb.scrub("abc\u3042\x81").must_equal("abc\u3042\uFFFD")
|
12
|
+
end
|
13
|
+
|
14
|
+
it '"abc\u3042\x81".scrub("*") #=> "abc\u3042*"' do
|
15
|
+
ScrubRb.scrub("abc\u3042\x81", "*").must_equal("abc\u3042*")
|
16
|
+
end
|
17
|
+
|
18
|
+
it 'block' do
|
19
|
+
ScrubRb.scrub("abc\u3042\xE3\x80") do |bytes|
|
20
|
+
'<'+bytes.unpack('H*')[0]+'>'
|
21
|
+
end.must_equal("abc\u3042<e380>")
|
22
|
+
end
|
23
|
+
end
|
24
|
+
|
25
|
+
# Things investigated in ruby 2.1 String#scrub to make sure
|
26
|
+
# we're doing the same things.
|
27
|
+
describe "compatible with ruby 2.1 String#scrub edge cases" do
|
28
|
+
it "returns copy even on legal string" do
|
29
|
+
original = "perfectly legal"
|
30
|
+
scrubbed = ScrubRb.scrub(original)
|
31
|
+
|
32
|
+
# not identity
|
33
|
+
refute scrubbed.equal? original
|
34
|
+
# yes equality
|
35
|
+
assert_equal original, scrubbed
|
36
|
+
end
|
37
|
+
it "collapses multiple bad bytes into one replacement" do
|
38
|
+
ScrubRb.scrub("abc\u3042\xE3\x80").must_equal("abc\u3042\uFFFD")
|
39
|
+
end
|
40
|
+
end
|
41
|
+
|
42
|
+
|
43
|
+
before do
|
44
|
+
@bad_bytes_utf8 = "M\xE9xico".force_encoding("UTF-8")
|
45
|
+
@bad_bytes_utf16 = "M\x00\xDFxico".force_encoding( Encoding::UTF_16LE )
|
46
|
+
@bad_bytes_ascii = "M\xA1xico".force_encoding("ASCII")
|
47
|
+
end
|
48
|
+
|
49
|
+
|
50
|
+
it "replaces with unicode replacement string" do
|
51
|
+
scrubbed = ScrubRb.scrub(@bad_bytes_utf8)
|
52
|
+
|
53
|
+
assert scrubbed.valid_encoding?
|
54
|
+
assert_equal scrubbed, "M\uFFFDxico"
|
55
|
+
end
|
56
|
+
|
57
|
+
it "replaces with chosen replacement string" do
|
58
|
+
ScrubRb.scrub(@bad_bytes_utf8, "*").must_equal("M*xico")
|
59
|
+
end
|
60
|
+
|
61
|
+
it "replaces with empty string" do
|
62
|
+
ScrubRb.scrub(@bad_bytes_utf8, '').must_equal("Mxico")
|
63
|
+
end
|
64
|
+
|
65
|
+
|
66
|
+
it "replaces non-unicode encoding with ? replacement str" do
|
67
|
+
if RUBY_PLATFORM == "java"
|
68
|
+
skip("known not to pass on JRuby, reported to JRuby github #1361")
|
69
|
+
end
|
70
|
+
ScrubRb.scrub(@bad_bytes_ascii).must_equal("M?xico")
|
71
|
+
end
|
72
|
+
|
73
|
+
|
74
|
+
it "works with first byte bad" do
|
75
|
+
str = "\xE9xico".force_encoding("UTF-8")
|
76
|
+
ScrubRb.scrub(str, "?").must_equal("?xico")
|
77
|
+
end
|
78
|
+
|
79
|
+
it "works with last bad byte" do
|
80
|
+
str = "Mexico\xE9".force_encoding("UTF-8")
|
81
|
+
ScrubRb.scrub(str, "?").must_equal("Mexico?")
|
82
|
+
end
|
83
|
+
|
84
|
+
it "with works for nil input" do
|
85
|
+
ScrubRb.scrub(nil).must_be_nil
|
86
|
+
end
|
87
|
+
|
88
|
+
end
|
metadata
ADDED
@@ -0,0 +1,89 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: scrub_rb
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.1.0
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Jonathan Rochkind
|
8
|
+
autorequire:
|
9
|
+
bindir: bin
|
10
|
+
cert_chain: []
|
11
|
+
date: 2013-12-26 00:00:00.000000000 Z
|
12
|
+
dependencies:
|
13
|
+
- !ruby/object:Gem::Dependency
|
14
|
+
name: bundler
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
16
|
+
requirements:
|
17
|
+
- - ~>
|
18
|
+
- !ruby/object:Gem::Version
|
19
|
+
version: '1.3'
|
20
|
+
type: :development
|
21
|
+
prerelease: false
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
23
|
+
requirements:
|
24
|
+
- - ~>
|
25
|
+
- !ruby/object:Gem::Version
|
26
|
+
version: '1.3'
|
27
|
+
- !ruby/object:Gem::Dependency
|
28
|
+
name: rake
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
30
|
+
requirements:
|
31
|
+
- - '>='
|
32
|
+
- !ruby/object:Gem::Version
|
33
|
+
version: '0'
|
34
|
+
type: :development
|
35
|
+
prerelease: false
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
37
|
+
requirements:
|
38
|
+
- - '>='
|
39
|
+
- !ruby/object:Gem::Version
|
40
|
+
version: '0'
|
41
|
+
description:
|
42
|
+
email:
|
43
|
+
- jonathan@dnil.net
|
44
|
+
executables: []
|
45
|
+
extensions: []
|
46
|
+
extra_rdoc_files: []
|
47
|
+
files:
|
48
|
+
- .gitignore
|
49
|
+
- .travis.yml
|
50
|
+
- Gemfile
|
51
|
+
- LICENSE.txt
|
52
|
+
- README.md
|
53
|
+
- Rakefile
|
54
|
+
- benchmark/benchmark.rb
|
55
|
+
- lib/scrub_rb.rb
|
56
|
+
- lib/scrub_rb/monkey_patch.rb
|
57
|
+
- lib/scrub_rb/version.rb
|
58
|
+
- scrub_rb.gemspec
|
59
|
+
- test/borrowed_string_scrub_test.rb
|
60
|
+
- test/monkey_patch_test.rb
|
61
|
+
- test/scrub_test.rb
|
62
|
+
homepage: https://github.com/jrochkind/scrub_rb
|
63
|
+
licenses:
|
64
|
+
- MIT
|
65
|
+
metadata: {}
|
66
|
+
post_install_message:
|
67
|
+
rdoc_options: []
|
68
|
+
require_paths:
|
69
|
+
- lib
|
70
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
71
|
+
requirements:
|
72
|
+
- - '>='
|
73
|
+
- !ruby/object:Gem::Version
|
74
|
+
version: '0'
|
75
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
76
|
+
requirements:
|
77
|
+
- - '>='
|
78
|
+
- !ruby/object:Gem::Version
|
79
|
+
version: '0'
|
80
|
+
requirements: []
|
81
|
+
rubyforge_project:
|
82
|
+
rubygems_version: 2.0.3
|
83
|
+
signing_key:
|
84
|
+
specification_version: 4
|
85
|
+
summary: Pure-ruby polyfill of MRI 2.1 String#scrub, for ruby 1.9 and 2.0 any interpreter
|
86
|
+
test_files:
|
87
|
+
- test/borrowed_string_scrub_test.rb
|
88
|
+
- test/monkey_patch_test.rb
|
89
|
+
- test/scrub_test.rb
|