scrub_rb 0.1.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/.gitignore +17 -0
- data/.travis.yml +6 -0
- data/Gemfile +4 -0
- data/LICENSE.txt +22 -0
- data/README.md +79 -0
- data/Rakefile +12 -0
- data/benchmark/benchmark.rb +49 -0
- data/lib/scrub_rb.rb +69 -0
- data/lib/scrub_rb/monkey_patch.rb +20 -0
- data/lib/scrub_rb/version.rb +3 -0
- data/scrub_rb.gemspec +23 -0
- data/test/borrowed_string_scrub_test.rb +116 -0
- data/test/monkey_patch_test.rb +35 -0
- data/test/scrub_test.rb +88 -0
- metadata +89 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 0048aa4b025f832fcb8cdc3d1676f097fc887239
|
4
|
+
data.tar.gz: aed4530569a2dfedc28a434a6eff778d00bae38c
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 35e024e14c9fe573d9315775785dafed01cb73a7b0c11cb68939a969c3cde8cb2cb631fd7572a579ab89a99bb87945f3cabe9c6ffe01507848cbe679d8e23d5a
|
7
|
+
data.tar.gz: e4c13c2f1e0b119fcae0361e439d0b997d12f5dfbd5efa16d43503efccf9e40ad856a9e2c89ccd42f05bb4979a9f605f600966caf3b8e447e13c04f4e245cbbb
|
data/.gitignore
ADDED
data/.travis.yml
ADDED
data/Gemfile
ADDED
data/LICENSE.txt
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
Copyright (c) 2013 Jonathan Rochkind
|
2
|
+
|
3
|
+
MIT License
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
6
|
+
a copy of this software and associated documentation files (the
|
7
|
+
"Software"), to deal in the Software without restriction, including
|
8
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
9
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
10
|
+
permit persons to whom the Software is furnished to do so, subject to
|
11
|
+
the following conditions:
|
12
|
+
|
13
|
+
The above copyright notice and this permission notice shall be
|
14
|
+
included in all copies or substantial portions of the Software.
|
15
|
+
|
16
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
17
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
18
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
19
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
20
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
21
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
22
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,79 @@
|
|
1
|
+
# ScrubRb
|
2
|
+
|
3
|
+
Pure-ruby polyfill of MRI 2.1 String#scrub, for ruby 1.9 and 2.0 any interpreter
|
4
|
+
|
5
|
+
[![Build Status](https://travis-ci.org/jrochkind/scrub_rb.png?branch=master)](https://travis-ci.org/jrochkind/scrub_rb)
|
6
|
+
|
7
|
+
## Installation
|
8
|
+
|
9
|
+
Add this line to your application's Gemfile:
|
10
|
+
|
11
|
+
gem 'scrub_rb'
|
12
|
+
|
13
|
+
And then execute:
|
14
|
+
|
15
|
+
$ bundle
|
16
|
+
|
17
|
+
Or install it yourself as:
|
18
|
+
|
19
|
+
$ gem install scrub_rb
|
20
|
+
|
21
|
+
|
22
|
+
## What it is
|
23
|
+
|
24
|
+
Ruby 2.1 introduces String#scrub, a method to replace invalid bytes in a given string
|
25
|
+
and it's specified encoding. See docs in [MRI ruby source](https://github.com/ruby/ruby/blob/1e8a05c1dfee94db9b6b825097e1d192ad32930a/string.c#L7772)
|
26
|
+
|
27
|
+
If you need String#scrub in MRI ruby 2.0, you can use the [string-scrub gem](https://github.com/hsbt/string-scrub), which provides a backport of the C code from MRI ruby 2.1 into MRI 2.0.
|
28
|
+
|
29
|
+
What if you need this functionality in ruby 1.9, in jruby in 1.9 or 2.0 modes, or in
|
30
|
+
any other ruby platform that does not (or does not yet) support String#scrub? What if
|
31
|
+
you need to write code that will work on any of these platforms?
|
32
|
+
|
33
|
+
This gem provides a pure-ruby implementation of `String#scrub` and `#scrub!`, monkey-patched into
|
34
|
+
String, that should work on any ruby platform. It will only monkey-patch String
|
35
|
+
if String does not already have a #scrub method -- so it's safe to include
|
36
|
+
this gem in multi-platform code, when the code runs on ruby 2.1, String#scrub will
|
37
|
+
still be the original stdlib implementation.
|
38
|
+
|
39
|
+
~~~ruby
|
40
|
+
# Encoding: utf-8
|
41
|
+
|
42
|
+
"abc\u3042\x81".scrub #=> "abc\u3042\uFFFD"
|
43
|
+
"abc\u3042\x81".scrub("*") #=> "abc\u3042*"
|
44
|
+
"abc\u3042\xE3\x80".scrub{|bytes| '<'+bytes.unpack('H*')[0]+'>' } #=> "abc\u3042<e380>"
|
45
|
+
~~~
|
46
|
+
|
47
|
+
## Performance
|
48
|
+
|
49
|
+
This pure ruby implementation is about an order of magnitude slower than stdlib String#scrub on ruby 2.1, or than `string-scrub` C gem on MRI 2.0. For most applications, string-scrubbing will probably be a small portion of total execution time, is still fairly fast, and hopefully won't be a problem.
|
50
|
+
|
51
|
+
## Discrepency with MRI 2.1 String#scrub
|
52
|
+
|
53
|
+
If there are more than one concurrent invalid byte in a string, should the entire block be replaced with only one replacement, or should each invalid byte be replaced with a replacement?
|
54
|
+
|
55
|
+
I have not been able to understand the logic MRI 2.1 uses to divide contiguous invalid bytes into
|
56
|
+
certain sub-sequences for replacement, as represented in the [test suite](https://github.com/ruby/ruby/blob/3ac0ec4ecdea849143ed64e8935e6675b341e44b/test/ruby/test_m17n.rb#L1505). The test suite may be suggesting that the examples are from unicode documentation, but I wasn't able to find such documentation to see if it shed any light on the matter.
|
57
|
+
|
58
|
+
`scrub_rb` always combines contiguous invalid bytes into a single replacement. As a result, it fails several tests from the original String#scrub test suite, which want other divisions of contiguous invalid bytes. I've altered our local tests for our current behavior.
|
59
|
+
|
60
|
+
Beware of this potential difference when using the block form of #scrub especially -- you may get a different number of calls with sequence of invalid bytes divided into different substrings with `scrub_rb` as compared to official MRI 2.1 String#scrub or `string-scrub`.
|
61
|
+
|
62
|
+
For most uses, this discrepency is probably not of consequence.
|
63
|
+
|
64
|
+
If anyone can explain whats going on here, I'm very curious! I can't read C very well to try and figure it out from source.
|
65
|
+
|
66
|
+
## Jruby may raise
|
67
|
+
|
68
|
+
Due to an apparent JRuby bug, some invalid strings cause an internal
|
69
|
+
exception from JRuby when trying to scrub_rb. The entire original MRI test suite
|
70
|
+
does passes against scrub_rb in JRuby -- but [one test original to us, involving
|
71
|
+
input tagged 'ascii' encoding](./test/scrub_test.rb#L67), fails raising an ArrayIndexOutOfBoundsException
|
72
|
+
from inside of JRuby. I have filed an [issue with JRuby](https://github.com/jruby/jruby/issues/1361).
|
73
|
+
|
74
|
+
I believe this problem should be rare -- so far, the only reproduction case involves an input string tagged 'ascii' encoding, which probably isn't a common use case. But it's unfortunate
|
75
|
+
that `scrub_rb` isn't reliable on jruby. I haven't been able to figure out any workaround in ruby to the jruby bug -- you could theoretically provide a Java alternate implementation usable in jruby, but I'm not sure what Java tools are available and how hard it would be to match the scrub api.
|
76
|
+
|
77
|
+
## Contributions
|
78
|
+
|
79
|
+
Pull requests or suggestions welcome, especially on performance, on JRuby issue, and on discrepencies with official String#scrub.
|
data/Rakefile
ADDED
@@ -0,0 +1,49 @@
|
|
1
|
+
# Encoding: utf-8
|
2
|
+
|
3
|
+
# Just gives us a ballpark. Some issues with this benchmark:
|
4
|
+
# * our strings might not be representative of real work
|
5
|
+
# * we're testing against static class method, not actual monkey patch, which
|
6
|
+
# would have one more method call, which may or may not matter.
|
7
|
+
|
8
|
+
require 'benchmark'
|
9
|
+
|
10
|
+
# for MRI 2.0, let's load the C scrub gem
|
11
|
+
begin
|
12
|
+
require 'string/scrub'
|
13
|
+
rescue LoadError
|
14
|
+
puts "(Could not load scrub gem C backfill)"
|
15
|
+
end
|
16
|
+
|
17
|
+
require 'scrub_rb'
|
18
|
+
|
19
|
+
test_strings = [
|
20
|
+
"abc\u3042\x81",
|
21
|
+
"good string",
|
22
|
+
"abc\u3042\xE3\x80",
|
23
|
+
"another good string",
|
24
|
+
"M\xE9xico",
|
25
|
+
"More good string"
|
26
|
+
]
|
27
|
+
|
28
|
+
n = 10000
|
29
|
+
Benchmark.bmbm do |x|
|
30
|
+
x.report("built-in") do
|
31
|
+
n.times do
|
32
|
+
test_strings.each do |str|
|
33
|
+
str.scrub
|
34
|
+
str.scrub("*")
|
35
|
+
str.scrub {|bytes| '<'+bytes.unpack('H*')[0]+'>'}
|
36
|
+
end
|
37
|
+
end
|
38
|
+
end
|
39
|
+
|
40
|
+
x.report("ScrubRb") do |x|
|
41
|
+
n.times do
|
42
|
+
test_strings.each do |str|
|
43
|
+
ScrubRb.scrub(str)
|
44
|
+
ScrubRb.scrub(str, "*")
|
45
|
+
ScrubRb.scrub(str) {|bytes| '<'+bytes.unpack('H*')[0]+'>'}
|
46
|
+
end
|
47
|
+
end
|
48
|
+
end
|
49
|
+
end
|
data/lib/scrub_rb.rb
ADDED
@@ -0,0 +1,69 @@
|
|
1
|
+
require "scrub_rb/version"
|
2
|
+
|
3
|
+
module ScrubRb
|
4
|
+
|
5
|
+
# static function implementation of String#scrub, where
|
6
|
+
# first arg is the string.
|
7
|
+
#
|
8
|
+
# ScrubRb.scrub("abc\u3042\x81") #=> "abc\u3042\uFFFD"
|
9
|
+
# ScrubRb.scrub("abc\u3042\x81", "*") #=> "abc\u3042*"
|
10
|
+
# ScrubRb.scrub("abc\u3042\xE3\x80") {|bytes| '<'+bytes.unpack('H*')[0]+'>' } #=> "abc\u3042<e380>"
|
11
|
+
def self.scrub(str, replacement=nil, &block)
|
12
|
+
return str if str.nil?
|
13
|
+
|
14
|
+
if replacement.nil? && ! block_given?
|
15
|
+
replacement =
|
16
|
+
# UTF-8 for unicode replacement char \uFFFD, encode in
|
17
|
+
# encoding of input string, using '?' as a fallback where
|
18
|
+
# it can't be (which should be non-unicode encodings)
|
19
|
+
"\xEF\xBF\xBD".force_encoding("UTF-8").encode( str.encoding,
|
20
|
+
:undef => :replace,
|
21
|
+
:replace => '?' )
|
22
|
+
end
|
23
|
+
|
24
|
+
result = ""
|
25
|
+
bad_chars = ""
|
26
|
+
bad_char_flag = false # weirdly, optimization to use flag
|
27
|
+
|
28
|
+
str.chars.each do |c|
|
29
|
+
if c.valid_encoding?
|
30
|
+
if bad_char_flag
|
31
|
+
scrub_replace(result, bad_chars, replacement, block)
|
32
|
+
bad_char_flag = false
|
33
|
+
end
|
34
|
+
result << c
|
35
|
+
else
|
36
|
+
bad_char_flag = true
|
37
|
+
bad_chars << c
|
38
|
+
end
|
39
|
+
end
|
40
|
+
if bad_char_flag
|
41
|
+
scrub_replace(result, bad_chars, replacement, block)
|
42
|
+
end
|
43
|
+
|
44
|
+
return result
|
45
|
+
end
|
46
|
+
|
47
|
+
private
|
48
|
+
def self.scrub_replace(result, bad_chars, replacement, block)
|
49
|
+
if block
|
50
|
+
r = block.call(bad_chars)
|
51
|
+
else
|
52
|
+
r = replacement
|
53
|
+
end
|
54
|
+
|
55
|
+
if r.respond_to?(:to_str)
|
56
|
+
r = r.to_str
|
57
|
+
else
|
58
|
+
raise TypeError, "no implicit conversion of #{r.class} into String"
|
59
|
+
end
|
60
|
+
|
61
|
+
unless r.valid_encoding?
|
62
|
+
raise ArgumentError, "replacement must be valid byte sequence '#{replacement}'"
|
63
|
+
end
|
64
|
+
|
65
|
+
result << r
|
66
|
+
bad_chars.clear
|
67
|
+
end
|
68
|
+
|
69
|
+
end
|
@@ -0,0 +1,20 @@
|
|
1
|
+
# Have to explicitly require this file to get the monkey
|
2
|
+
# patching of String#scrub in there, this file won't and shouldn't
|
3
|
+
# be 'require'd in automatically.
|
4
|
+
#
|
5
|
+
# However if there's already a String#scrub defiend, requiring
|
6
|
+
# this file will do nothing.
|
7
|
+
|
8
|
+
class String
|
9
|
+
# Only monkey patch if not already defined....
|
10
|
+
unless instance_methods.include? :scrub
|
11
|
+
def scrub(replacement=nil, &block)
|
12
|
+
ScrubRb.scrub(self, replacement, &block)
|
13
|
+
end
|
14
|
+
|
15
|
+
def scrub!(*args)
|
16
|
+
self.replace( self.scrub(*args) )
|
17
|
+
end
|
18
|
+
end
|
19
|
+
|
20
|
+
end
|
data/scrub_rb.gemspec
ADDED
@@ -0,0 +1,23 @@
|
|
1
|
+
# coding: utf-8
|
2
|
+
lib = File.expand_path('../lib', __FILE__)
|
3
|
+
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
|
4
|
+
require 'scrub_rb/version'
|
5
|
+
|
6
|
+
Gem::Specification.new do |spec|
|
7
|
+
spec.name = "scrub_rb"
|
8
|
+
spec.version = ScrubRb::VERSION
|
9
|
+
spec.authors = ["Jonathan Rochkind"]
|
10
|
+
spec.email = ["jonathan@dnil.net"]
|
11
|
+
spec.summary = %q{Pure-ruby polyfill of MRI 2.1 String#scrub, for ruby 1.9 and 2.0 any interpreter
|
12
|
+
}
|
13
|
+
spec.homepage = "https://github.com/jrochkind/scrub_rb"
|
14
|
+
spec.license = "MIT"
|
15
|
+
|
16
|
+
spec.files = `git ls-files`.split($/)
|
17
|
+
spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
|
18
|
+
spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
|
19
|
+
spec.require_paths = ["lib"]
|
20
|
+
|
21
|
+
spec.add_development_dependency "bundler", "~> 1.3"
|
22
|
+
spec.add_development_dependency "rake"
|
23
|
+
end
|
@@ -0,0 +1,116 @@
|
|
1
|
+
# coding: US-ASCII
|
2
|
+
|
3
|
+
# This whole file borrowered from string-scrub:
|
4
|
+
# https://raw.github.com/hsbt/string-scrub/master/test/test_scrub.rb
|
5
|
+
# Actually adapted originally from MRI test suite:
|
6
|
+
# https://github.com/ruby/ruby/blob/3ac0ec4ecdea849143ed64e8935e6675b341e44b/test/ruby/test_m17n.rb#L1505
|
7
|
+
# We want to make sure we pass the same tests.
|
8
|
+
#
|
9
|
+
# NOTE: Some tests of multiple contiguous illegal bytes, we've had
|
10
|
+
# to change to match scrub_rb behavior.
|
11
|
+
# See README under 'Discrepency'; search this source for 'SKIPPED'
|
12
|
+
# and 'OWN'
|
13
|
+
|
14
|
+
require 'scrub_rb'
|
15
|
+
require 'scrub_rb/monkey_patch'
|
16
|
+
require 'test/unit'
|
17
|
+
|
18
|
+
|
19
|
+
class BorrowedStringScrubTest < Test::Unit::TestCase
|
20
|
+
module AESU
|
21
|
+
def ua(str) str.dup.force_encoding("US-ASCII") end
|
22
|
+
def a(str) str.dup.force_encoding("ASCII-8BIT") end
|
23
|
+
def e(str) str.dup.force_encoding("EUC-JP") end
|
24
|
+
def s(str) str.dup.force_encoding("Windows-31J") end
|
25
|
+
def u(str) str.dup.force_encoding("UTF-8") end
|
26
|
+
end
|
27
|
+
include AESU
|
28
|
+
|
29
|
+
def test_scrub
|
30
|
+
str = "\u3042\u3044"
|
31
|
+
assert_not_same(str, str.scrub)
|
32
|
+
str.force_encoding(Encoding::ISO_2022_JP) # dummy encoding
|
33
|
+
assert_not_same(str, str.scrub)
|
34
|
+
|
35
|
+
# SKIPPED, discrepency
|
36
|
+
#assert_equal("\uFFFD\uFFFD\uFFFD", u("\x80\x80\x80").scrub)
|
37
|
+
# OWN equivalent
|
38
|
+
assert_equal("\uFFFD", u("\x80\x80\x80").scrub)
|
39
|
+
|
40
|
+
#assert_equal("\uFFFDA", u("\xF4\x80\x80A").scrub)
|
41
|
+
|
42
|
+
# examples in Unicode 6.1.0 D93b
|
43
|
+
# SKIPPED, discrepency
|
44
|
+
#assert_equal("\x41\uFFFD\uFFFD\x41\uFFFD\x41",
|
45
|
+
# u("\x41\xC0\xAF\x41\xF4\x80\x80\x41").scrub)
|
46
|
+
# OWN equivalent
|
47
|
+
assert_equal("\x41\uFFFD\x41\uFFFD\x41",
|
48
|
+
u("\x41\xC0\xAF\x41\xF4\x80\x80\x41").scrub)
|
49
|
+
|
50
|
+
# SKIPPED, discrepency
|
51
|
+
#assert_equal("\x41\uFFFD\uFFFD\uFFFD\x41",
|
52
|
+
# u("\x41\xE0\x9F\x80\x41").scrub)
|
53
|
+
# OWN equivalent
|
54
|
+
assert_equal("\x41\uFFFD\x41",
|
55
|
+
u("\x41\xE0\x9F\x80\x41").scrub)
|
56
|
+
|
57
|
+
# SKIPPED, discrepency
|
58
|
+
#assert_equal("\u0061\uFFFD\uFFFD\uFFFD\u0062\uFFFD\u0063\uFFFD\uFFFD\u0064",
|
59
|
+
# u("\x61\xF1\x80\x80\xE1\x80\xC2\x62\x80\x63\x80\xBF\x64").scrub)
|
60
|
+
# OWN equivalent
|
61
|
+
assert_equal("\u0061\uFFFD\u0062\uFFFD\u0063\uFFFD\u0064",
|
62
|
+
u("\x61\xF1\x80\x80\xE1\x80\xC2\x62\x80\x63\x80\xBF\x64").scrub)
|
63
|
+
# SKIPPED discrepency
|
64
|
+
#assert_equal("abcdefghijklmnopqrstuvwxyz\u0061\uFFFD\uFFFD\uFFFD\u0062\uFFFD\u0063\uFFFD\uFFFD\u0064",
|
65
|
+
# u("abcdefghijklmnopqrstuvwxyz\x61\xF1\x80\x80\xE1\x80\xC2\x62\x80\x63\x80\xBF\x64").scrub)
|
66
|
+
# OWN equivalent
|
67
|
+
assert_equal("abcdefghijklmnopqrstuvwxyz\u0061\uFFFD\u0062\uFFFD\u0063\uFFFD\u0064",
|
68
|
+
u("abcdefghijklmnopqrstuvwxyz\x61\xF1\x80\x80\xE1\x80\xC2\x62\x80\x63\x80\xBF\x64").scrub)
|
69
|
+
|
70
|
+
|
71
|
+
assert_equal("\u3042\u3013", u("\xE3\x81\x82\xE3\x81").scrub("\u3013"))
|
72
|
+
assert_raise(Encoding::CompatibilityError){ u("\xE3\x81\x82\xE3\x81").scrub(e("\xA4\xA2")) }
|
73
|
+
assert_raise(TypeError){ u("\xE3\x81\x82\xE3\x81").scrub(1) }
|
74
|
+
assert_raise(ArgumentError){ u("\xE3\x81\x82\xE3\x81\x82\xE3\x81").scrub(u("\x81")) }
|
75
|
+
assert_equal(e("\xA4\xA2\xA2\xAE"), e("\xA4\xA2\xA4").scrub(e("\xA2\xAE")))
|
76
|
+
|
77
|
+
assert_equal("\u3042<e381>", u("\xE3\x81\x82\xE3\x81").scrub{|x|'<'+x.unpack('H*')[0]+'>'})
|
78
|
+
assert_raise(Encoding::CompatibilityError){ u("\xE3\x81\x82\xE3\x81").scrub{e("\xA4\xA2")} }
|
79
|
+
|
80
|
+
|
81
|
+
assert_raise(TypeError){ u("\xE3\x81\x82\xE3\x81").scrub{1} }
|
82
|
+
assert_raise(ArgumentError){ u("\xE3\x81\x82\xE3\x81\x82\xE3\x81").scrub{u("\x81")} }
|
83
|
+
assert_equal(e("\xA4\xA2\xA2\xAE"), e("\xA4\xA2\xA4").scrub{e("\xA2\xAE")})
|
84
|
+
|
85
|
+
assert_equal(u("\x81"), u("a\x81").scrub {|c| break c})
|
86
|
+
assert_raise(ArgumentError) {u("a\x81").scrub {|c| c}}
|
87
|
+
|
88
|
+
assert_equal("\uFFFD\u3042".encode("UTF-16BE"),
|
89
|
+
"\xD8\x00\x30\x42".force_encoding(Encoding::UTF_16BE).
|
90
|
+
scrub)
|
91
|
+
assert_equal("\uFFFD\u3042".encode("UTF-16LE"),
|
92
|
+
"\x00\xD8\x42\x30".force_encoding(Encoding::UTF_16LE).
|
93
|
+
scrub)
|
94
|
+
assert_equal("\uFFFD".encode("UTF-32BE"),
|
95
|
+
"\xff".force_encoding(Encoding::UTF_32BE).
|
96
|
+
scrub)
|
97
|
+
assert_equal("\uFFFD".encode("UTF-32LE"),
|
98
|
+
"\xff".force_encoding(Encoding::UTF_32LE).
|
99
|
+
scrub)
|
100
|
+
end
|
101
|
+
|
102
|
+
def test_scrub_bang
|
103
|
+
str = "\u3042\u3044"
|
104
|
+
assert_same(str, str.scrub!)
|
105
|
+
str.force_encoding(Encoding::ISO_2022_JP) # dummy encoding
|
106
|
+
assert_same(str, str.scrub!)
|
107
|
+
|
108
|
+
str = u("\x80\x80\x80")
|
109
|
+
str.scrub!
|
110
|
+
assert_same(str, str.scrub!)
|
111
|
+
# SKIPPED, discrepency
|
112
|
+
#assert_equal("\uFFFD\uFFFD\uFFFD", str)
|
113
|
+
# OWN, single replacement
|
114
|
+
assert_equal("\uFFFD", str)
|
115
|
+
end
|
116
|
+
end
|
@@ -0,0 +1,35 @@
|
|
1
|
+
# Encoding: utf-8
|
2
|
+
|
3
|
+
require 'minitest/spec'
|
4
|
+
require 'minitest/autorun'
|
5
|
+
|
6
|
+
require 'scrub_rb'
|
7
|
+
|
8
|
+
# Going to require the monkey-patch, which will end up
|
9
|
+
# monkey-patching String for entire program execution, don't
|
10
|
+
# know any way to monkey patch just for this test, sorry.
|
11
|
+
|
12
|
+
require 'scrub_rb/monkey_patch'
|
13
|
+
|
14
|
+
describe "Monkey-patched String#scrub does same thing as ScrubRb.scrub" do
|
15
|
+
it "abc\\u304\\x81" do
|
16
|
+
"abc\u3042\x81".scrub.must_equal ScrubRb.scrub("abc\u3042\x81")
|
17
|
+
end
|
18
|
+
|
19
|
+
it "abc\\u3042\\x81, *" do
|
20
|
+
"abc\u3042\x81".scrub("*").must_equal ScrubRb.scrub("abc\u3042\x81", "*")
|
21
|
+
end
|
22
|
+
|
23
|
+
it "abc\\u3042\\xE3\\x80 with block" do
|
24
|
+
block = lambda do |bytes|
|
25
|
+
'<'+bytes.unpack('H*')[0]+'>'
|
26
|
+
end
|
27
|
+
|
28
|
+
"abc\u3042\xE3\x80".scrub(&block).must_equal ScrubRb.scrub("abc\u3042\xE3\x80", &block)
|
29
|
+
end
|
30
|
+
|
31
|
+
it "no bad bytes" do
|
32
|
+
"no bad bytes".scrub.must_equal ScrubRb.scrub("no bad bytes")
|
33
|
+
end
|
34
|
+
|
35
|
+
end
|
data/test/scrub_test.rb
ADDED
@@ -0,0 +1,88 @@
|
|
1
|
+
# Encoding: UTF-8
|
2
|
+
|
3
|
+
require 'minitest/spec'
|
4
|
+
require 'minitest/autorun'
|
5
|
+
|
6
|
+
require 'scrub_rb'
|
7
|
+
|
8
|
+
describe "ScrubRb" do
|
9
|
+
describe "examples from ruby 2.1 String#scrub" do
|
10
|
+
it '"abc\u3042\x81".scrub #=> "abc\u3042\uFFFD"' do
|
11
|
+
ScrubRb.scrub("abc\u3042\x81").must_equal("abc\u3042\uFFFD")
|
12
|
+
end
|
13
|
+
|
14
|
+
it '"abc\u3042\x81".scrub("*") #=> "abc\u3042*"' do
|
15
|
+
ScrubRb.scrub("abc\u3042\x81", "*").must_equal("abc\u3042*")
|
16
|
+
end
|
17
|
+
|
18
|
+
it 'block' do
|
19
|
+
ScrubRb.scrub("abc\u3042\xE3\x80") do |bytes|
|
20
|
+
'<'+bytes.unpack('H*')[0]+'>'
|
21
|
+
end.must_equal("abc\u3042<e380>")
|
22
|
+
end
|
23
|
+
end
|
24
|
+
|
25
|
+
# Things investigated in ruby 2.1 String#scrub to make sure
|
26
|
+
# we're doing the same things.
|
27
|
+
describe "compatible with ruby 2.1 String#scrub edge cases" do
|
28
|
+
it "returns copy even on legal string" do
|
29
|
+
original = "perfectly legal"
|
30
|
+
scrubbed = ScrubRb.scrub(original)
|
31
|
+
|
32
|
+
# not identity
|
33
|
+
refute scrubbed.equal? original
|
34
|
+
# yes equality
|
35
|
+
assert_equal original, scrubbed
|
36
|
+
end
|
37
|
+
it "collapses multiple bad bytes into one replacement" do
|
38
|
+
ScrubRb.scrub("abc\u3042\xE3\x80").must_equal("abc\u3042\uFFFD")
|
39
|
+
end
|
40
|
+
end
|
41
|
+
|
42
|
+
|
43
|
+
before do
|
44
|
+
@bad_bytes_utf8 = "M\xE9xico".force_encoding("UTF-8")
|
45
|
+
@bad_bytes_utf16 = "M\x00\xDFxico".force_encoding( Encoding::UTF_16LE )
|
46
|
+
@bad_bytes_ascii = "M\xA1xico".force_encoding("ASCII")
|
47
|
+
end
|
48
|
+
|
49
|
+
|
50
|
+
it "replaces with unicode replacement string" do
|
51
|
+
scrubbed = ScrubRb.scrub(@bad_bytes_utf8)
|
52
|
+
|
53
|
+
assert scrubbed.valid_encoding?
|
54
|
+
assert_equal scrubbed, "M\uFFFDxico"
|
55
|
+
end
|
56
|
+
|
57
|
+
it "replaces with chosen replacement string" do
|
58
|
+
ScrubRb.scrub(@bad_bytes_utf8, "*").must_equal("M*xico")
|
59
|
+
end
|
60
|
+
|
61
|
+
it "replaces with empty string" do
|
62
|
+
ScrubRb.scrub(@bad_bytes_utf8, '').must_equal("Mxico")
|
63
|
+
end
|
64
|
+
|
65
|
+
|
66
|
+
it "replaces non-unicode encoding with ? replacement str" do
|
67
|
+
if RUBY_PLATFORM == "java"
|
68
|
+
skip("known not to pass on JRuby, reported to JRuby github #1361")
|
69
|
+
end
|
70
|
+
ScrubRb.scrub(@bad_bytes_ascii).must_equal("M?xico")
|
71
|
+
end
|
72
|
+
|
73
|
+
|
74
|
+
it "works with first byte bad" do
|
75
|
+
str = "\xE9xico".force_encoding("UTF-8")
|
76
|
+
ScrubRb.scrub(str, "?").must_equal("?xico")
|
77
|
+
end
|
78
|
+
|
79
|
+
it "works with last bad byte" do
|
80
|
+
str = "Mexico\xE9".force_encoding("UTF-8")
|
81
|
+
ScrubRb.scrub(str, "?").must_equal("Mexico?")
|
82
|
+
end
|
83
|
+
|
84
|
+
it "with works for nil input" do
|
85
|
+
ScrubRb.scrub(nil).must_be_nil
|
86
|
+
end
|
87
|
+
|
88
|
+
end
|
metadata
ADDED
@@ -0,0 +1,89 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: scrub_rb
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.1.0
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Jonathan Rochkind
|
8
|
+
autorequire:
|
9
|
+
bindir: bin
|
10
|
+
cert_chain: []
|
11
|
+
date: 2013-12-26 00:00:00.000000000 Z
|
12
|
+
dependencies:
|
13
|
+
- !ruby/object:Gem::Dependency
|
14
|
+
name: bundler
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
16
|
+
requirements:
|
17
|
+
- - ~>
|
18
|
+
- !ruby/object:Gem::Version
|
19
|
+
version: '1.3'
|
20
|
+
type: :development
|
21
|
+
prerelease: false
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
23
|
+
requirements:
|
24
|
+
- - ~>
|
25
|
+
- !ruby/object:Gem::Version
|
26
|
+
version: '1.3'
|
27
|
+
- !ruby/object:Gem::Dependency
|
28
|
+
name: rake
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
30
|
+
requirements:
|
31
|
+
- - '>='
|
32
|
+
- !ruby/object:Gem::Version
|
33
|
+
version: '0'
|
34
|
+
type: :development
|
35
|
+
prerelease: false
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
37
|
+
requirements:
|
38
|
+
- - '>='
|
39
|
+
- !ruby/object:Gem::Version
|
40
|
+
version: '0'
|
41
|
+
description:
|
42
|
+
email:
|
43
|
+
- jonathan@dnil.net
|
44
|
+
executables: []
|
45
|
+
extensions: []
|
46
|
+
extra_rdoc_files: []
|
47
|
+
files:
|
48
|
+
- .gitignore
|
49
|
+
- .travis.yml
|
50
|
+
- Gemfile
|
51
|
+
- LICENSE.txt
|
52
|
+
- README.md
|
53
|
+
- Rakefile
|
54
|
+
- benchmark/benchmark.rb
|
55
|
+
- lib/scrub_rb.rb
|
56
|
+
- lib/scrub_rb/monkey_patch.rb
|
57
|
+
- lib/scrub_rb/version.rb
|
58
|
+
- scrub_rb.gemspec
|
59
|
+
- test/borrowed_string_scrub_test.rb
|
60
|
+
- test/monkey_patch_test.rb
|
61
|
+
- test/scrub_test.rb
|
62
|
+
homepage: https://github.com/jrochkind/scrub_rb
|
63
|
+
licenses:
|
64
|
+
- MIT
|
65
|
+
metadata: {}
|
66
|
+
post_install_message:
|
67
|
+
rdoc_options: []
|
68
|
+
require_paths:
|
69
|
+
- lib
|
70
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
71
|
+
requirements:
|
72
|
+
- - '>='
|
73
|
+
- !ruby/object:Gem::Version
|
74
|
+
version: '0'
|
75
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
76
|
+
requirements:
|
77
|
+
- - '>='
|
78
|
+
- !ruby/object:Gem::Version
|
79
|
+
version: '0'
|
80
|
+
requirements: []
|
81
|
+
rubyforge_project:
|
82
|
+
rubygems_version: 2.0.3
|
83
|
+
signing_key:
|
84
|
+
specification_version: 4
|
85
|
+
summary: Pure-ruby polyfill of MRI 2.1 String#scrub, for ruby 1.9 and 2.0 any interpreter
|
86
|
+
test_files:
|
87
|
+
- test/borrowed_string_scrub_test.rb
|
88
|
+
- test/monkey_patch_test.rb
|
89
|
+
- test/scrub_test.rb
|