ensure-encoding 0.1
Sign up to get free protection for your applications and to get access to all the features.
- data/LICENSE +18 -0
- data/README.rdoc +95 -0
- data/lib/ensure.rb +3 -0
- data/lib/ensure/encoding.rb +132 -0
- metadata +61 -0
data/LICENSE
ADDED
@@ -0,0 +1,18 @@
|
|
1
|
+
Copyright (c) 2009 Manfred Stienstra, Fingertips <manfred@fngtps.com>
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy of
|
4
|
+
this software and associated documentation files (the "Software"), to deal in
|
5
|
+
the Software without restriction, including without limitation the rights to
|
6
|
+
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
|
7
|
+
the Software, and to permit persons to whom the Software is furnished to do so,
|
8
|
+
subject to the following conditions:
|
9
|
+
|
10
|
+
The above copyright notice and this permission notice shall be included in all
|
11
|
+
copies or substantial portions of the Software.
|
12
|
+
|
13
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
14
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
|
15
|
+
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
|
16
|
+
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
|
17
|
+
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
|
18
|
+
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.rdoc
ADDED
@@ -0,0 +1,95 @@
|
|
1
|
+
= Ensure Encoding
|
2
|
+
|
3
|
+
Experimental project to find the best way to ensure a preferred encoding in
|
4
|
+
Strings coming from untrusted sources.
|
5
|
+
|
6
|
+
== Algorithms
|
7
|
+
|
8
|
+
The most sane way of dealing with character data is choosing an internal
|
9
|
+
representation for your application and convert all incoming and outgoing data
|
10
|
+
when necessary.
|
11
|
+
|
12
|
+
To ensure that our internal encoding is always at least valid we can choose to
|
13
|
+
do one of two things.
|
14
|
+
|
15
|
+
1. Throw an exception when we receive invalid character data. (ie. Our internal
|
16
|
+
encoding is UTF-8 and we receive Latin-1 data or invalid UTF-8)
|
17
|
+
2. Accept whatever we get and try to mold it in such a way that it becomes
|
18
|
+
usable in our application.
|
19
|
+
|
20
|
+
It is generally accepted to go for the second option, most of the times the
|
21
|
+
end-user has no way of solving these problems because a vendor made a mistake.
|
22
|
+
It's not very nice to shut them out.
|
23
|
+
|
24
|
+
There are a number of techniques when molding the character data to our needs.
|
25
|
+
Sniffing encoding, transcoding, dropping invalid characters are just a few
|
26
|
+
examples. We implement a number of these techniques so you can easily protect
|
27
|
+
your application from bad data.
|
28
|
+
|
29
|
+
== Ensure encoding
|
30
|
+
|
31
|
+
We've crammed at lot of functionality into the ensure_encoding method because
|
32
|
+
we want to keep the number of new methods on String to a minimum. We'll walk
|
33
|
+
through an example to show how it works.
|
34
|
+
|
35
|
+
example = 'Café'
|
36
|
+
example.encoding => #<Encoding:ISO-8859-1>
|
37
|
+
|
38
|
+
After ensuring the encoding of a string you can at least assume that you can
|
39
|
+
concatenate the string to another string with the same encoding. In other
|
40
|
+
words, it contains data valid for the specified encoding.
|
41
|
+
|
42
|
+
example.ensure_encoding('UTF-8')
|
43
|
+
example.encoding => #<Encoding:UTF-8>
|
44
|
+
|
45
|
+
Beyond this you can specify a number of options to perform more operations
|
46
|
+
to make sure the data in the string didn't become unreadable garbage. Let's
|
47
|
+
look at a number of situations.
|
48
|
+
|
49
|
+
=== Untrusted source with known encoding
|
50
|
+
|
51
|
+
For instance, when you're excepting data from browsers you can be pretty sure
|
52
|
+
the character data is properly encoded. When someone does send bad data it's
|
53
|
+
probably a hacker and you can discard the request.
|
54
|
+
|
55
|
+
example.ensure_encoding('UTF-8',
|
56
|
+
:external_encoding => 'UTF-8,
|
57
|
+
:invalid_characters => :raise
|
58
|
+
)
|
59
|
+
|
60
|
+
=== Friendly source with known encoding
|
61
|
+
|
62
|
+
In this scenario you're connecting to web API through a ReST library and you
|
63
|
+
know the encoding of the source data because it's in the headers. However
|
64
|
+
you're not sure the encoding of the strings is valid.
|
65
|
+
|
66
|
+
example.ensure_encoding(Encoding::UTF_8
|
67
|
+
:external_encoding => Encoding::UTF_8,
|
68
|
+
:invalid_characters => :drop
|
69
|
+
)
|
70
|
+
|
71
|
+
=== Untrusted source with variable encoding
|
72
|
+
|
73
|
+
Assume we have a legacy database and some of the fields contain Shift JIS,
|
74
|
+
while some of the newer fields contain UTF-8 because someone screwed up the
|
75
|
+
server configuration. You're not even sure the encoding property on the
|
76
|
+
strings you got make any sense because your database adapter is confused too.
|
77
|
+
|
78
|
+
example.ensure_encoding('UTF-8',
|
79
|
+
:external_encoding => [Encoding::Shift_JIS, Encoding::UTF_8],
|
80
|
+
:invalid_characters => :transcode
|
81
|
+
)
|
82
|
+
|
83
|
+
=== Untrusted source with unknown encoding
|
84
|
+
|
85
|
+
As a last resort you're trying to read some random files from disk and you
|
86
|
+
have no idea what the external encoding is. You've just read them as binary
|
87
|
+
and are hoping to make some sense from the data.
|
88
|
+
|
89
|
+
example.ensure_encoding('UTF-8',
|
90
|
+
:external_encoding => :sniff,
|
91
|
+
:invalid_characters => :transcode
|
92
|
+
)
|
93
|
+
|
94
|
+
Note that the encoding sniffer is currently very naive and might not always be
|
95
|
+
of any help.
|
data/lib/ensure.rb
ADDED
@@ -0,0 +1,132 @@
|
|
1
|
+
# encoding: utf-8
|
2
|
+
|
3
|
+
module Ensure
|
4
|
+
module Encoding
|
5
|
+
BYTE_ORDER_MARKS = {
|
6
|
+
::Encoding::UTF_16BE => [0xfe, 0xff],
|
7
|
+
::Encoding::UTF_16LE => [0xff, 0xfe],
|
8
|
+
::Encoding::UTF_8 => [0xef, 0xbb, 0xbf]
|
9
|
+
}
|
10
|
+
|
11
|
+
# Tries to guess the encoding of the string and returns the most likely
|
12
|
+
# encoding.
|
13
|
+
def self.sniff_encoding(string)
|
14
|
+
first_bytes = string.unpack('C3')
|
15
|
+
BYTE_ORDER_MARKS.each do |encoding, bytes|
|
16
|
+
if first_bytes[0...bytes.length] == bytes
|
17
|
+
return encoding
|
18
|
+
end
|
19
|
+
end
|
20
|
+
::Encoding::UTF_8
|
21
|
+
end
|
22
|
+
|
23
|
+
# Checks the encodings in +guesses+ from front to back and returns the
|
24
|
+
# first encoding in which the character data is a valid sequence.
|
25
|
+
def self.guess_encoding(string, guesses)
|
26
|
+
original_encoding = string.encoding
|
27
|
+
guessed_encoding = nil
|
28
|
+
|
29
|
+
guesses.each do |guess|
|
30
|
+
string.force_encoding(guess)
|
31
|
+
if string.valid_encoding?
|
32
|
+
guessed_encoding = string.encoding
|
33
|
+
break
|
34
|
+
end
|
35
|
+
end
|
36
|
+
|
37
|
+
string.force_encoding(original_encoding)
|
38
|
+
guessed_encoding
|
39
|
+
end
|
40
|
+
|
41
|
+
# Forces the encoding of +string+ to +target_encoding+ and using a number
|
42
|
+
# of smart tricks. See String#ensure_encoding for more details.
|
43
|
+
def self.force_encoding(string, target_encoding, options={})
|
44
|
+
target_string = string.dup
|
45
|
+
force_encoding!(target_string, target_encoding, options)
|
46
|
+
target_string
|
47
|
+
end
|
48
|
+
|
49
|
+
# Performs just like +force_encoding+, only it changes the string
|
50
|
+
# in place instead of returning it.
|
51
|
+
def self.force_encoding!(string, target_encoding, options={})
|
52
|
+
if options[:external_encoding] == :sniff
|
53
|
+
external_encoding = sniff_encoding(string)
|
54
|
+
else
|
55
|
+
external_encoding = options[:external_encoding] || [target_encoding, string.encoding]
|
56
|
+
end
|
57
|
+
|
58
|
+
if external_encoding.respond_to?(:each)
|
59
|
+
external_encoding = guess_encoding(string, external_encoding) || target_encoding
|
60
|
+
end
|
61
|
+
|
62
|
+
if options[:invalid_characters] == :raise
|
63
|
+
string.force_encoding(target_encoding)
|
64
|
+
raise ::Encoding::InvalidByteSequenceError, "String is not encoded as `#{target_encoding}'" unless string.valid_encoding?
|
65
|
+
else
|
66
|
+
filters = (options[:invalid_characters] == :drop) ? { :replace => '', :undef => :replace, :invalid => :replace } : {}
|
67
|
+
string.encode!(target_encoding, external_encoding, filters)
|
68
|
+
end
|
69
|
+
end
|
70
|
+
|
71
|
+
module String
|
72
|
+
# Ensures the character encoding in a string. It employs a number of
|
73
|
+
# techniques to detect and transcode characters to make sure they end
|
74
|
+
# up in a usuable form in the encoding you need.
|
75
|
+
#
|
76
|
+
# == Arguments
|
77
|
+
#
|
78
|
+
# +target_encoding+
|
79
|
+
# The character encoding you want to ensure; this is usually the
|
80
|
+
# internal encoding of your application. Accepts both string
|
81
|
+
# constants and encoding constants. (ie. 'UTF-8' or Encoding::UTF_8)
|
82
|
+
# +options+
|
83
|
+
# Options to trigger activate certain algorithms.
|
84
|
+
#
|
85
|
+
# === Options
|
86
|
+
#
|
87
|
+
# :external_encoding
|
88
|
+
# Specifies both your certainty about the external encoding and what
|
89
|
+
# you think it might be. Valid options are :sniff, an array of
|
90
|
+
# encodings, or a single encoding. When you specify :sniff, we will
|
91
|
+
# sniff around in the data to guess which encoding it is. When you
|
92
|
+
# supply a list of possible encodings we will check them from begin
|
93
|
+
# to end if one of them matches the data. Finally, when you specify
|
94
|
+
# a specific encoding we assume you know which it is and we will use
|
95
|
+
# that. By default we use :external_encoding => [target_encoding,
|
96
|
+
# self.encoding].
|
97
|
+
# :invalid_characters
|
98
|
+
# Specifies what to do with invalid characters. There are three valid
|
99
|
+
# values: :raise, :drop, and :transcode. The first raises and exception
|
100
|
+
# on an invalid character. The second will strip all invalid characters
|
101
|
+
# the last will try to transcode them to the wanted encoding. By default
|
102
|
+
# we transcode.
|
103
|
+
#
|
104
|
+
# == Example
|
105
|
+
#
|
106
|
+
# response = REST.get('http://www.google.com')
|
107
|
+
#
|
108
|
+
# if match = /charset=([^;]*)/.match(response.content_type)
|
109
|
+
# encoding = match[1]
|
110
|
+
# else
|
111
|
+
# encoding = 'UTF-8'
|
112
|
+
# end
|
113
|
+
#
|
114
|
+
# body = response.body.ensure_encoding('UTF-8',
|
115
|
+
# :external_encoding => encoding,
|
116
|
+
# :invalid_characters => :drop)
|
117
|
+
def ensure_encoding(target_encoding, options={})
|
118
|
+
Ensure::Encoding.force_encoding(self, target_encoding, options)
|
119
|
+
end
|
120
|
+
|
121
|
+
# Performs just like String#ensure_encoding, only it changes the string
|
122
|
+
# in place instead of returning it.
|
123
|
+
def ensure_encoding!(target_encoding, options={})
|
124
|
+
Ensure::Encoding.force_encoding!(self, target_encoding, options)
|
125
|
+
end
|
126
|
+
end
|
127
|
+
end
|
128
|
+
end
|
129
|
+
|
130
|
+
class String
|
131
|
+
include Ensure::Encoding::String
|
132
|
+
end
|
metadata
ADDED
@@ -0,0 +1,61 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: ensure-encoding
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: "0.1"
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Manfred Stienstra
|
8
|
+
autorequire:
|
9
|
+
bindir: bin
|
10
|
+
cert_chain: []
|
11
|
+
|
12
|
+
date: 2010-01-11 00:00:00 +01:00
|
13
|
+
default_executable:
|
14
|
+
dependencies: []
|
15
|
+
|
16
|
+
description: " Ensure the character encoding in Strings coming from untrusted sources.\n"
|
17
|
+
email: manfred@fngtps.com
|
18
|
+
executables: []
|
19
|
+
|
20
|
+
extensions: []
|
21
|
+
|
22
|
+
extra_rdoc_files:
|
23
|
+
- README.rdoc
|
24
|
+
- LICENSE
|
25
|
+
files:
|
26
|
+
- README.rdoc
|
27
|
+
- LICENSE
|
28
|
+
- lib/ensure/encoding.rb
|
29
|
+
- lib/ensure.rb
|
30
|
+
has_rdoc: true
|
31
|
+
homepage:
|
32
|
+
licenses: []
|
33
|
+
|
34
|
+
post_install_message:
|
35
|
+
rdoc_options:
|
36
|
+
- --all
|
37
|
+
- --charset
|
38
|
+
- utf-8
|
39
|
+
require_paths:
|
40
|
+
- lib
|
41
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
42
|
+
requirements:
|
43
|
+
- - ">="
|
44
|
+
- !ruby/object:Gem::Version
|
45
|
+
version: "0"
|
46
|
+
version:
|
47
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
48
|
+
requirements:
|
49
|
+
- - ">="
|
50
|
+
- !ruby/object:Gem::Version
|
51
|
+
version: "0"
|
52
|
+
version:
|
53
|
+
requirements: []
|
54
|
+
|
55
|
+
rubyforge_project:
|
56
|
+
rubygems_version: 1.3.5
|
57
|
+
signing_key:
|
58
|
+
specification_version: 3
|
59
|
+
summary: Ensure the character encoding in Strings coming from untrusted sources.
|
60
|
+
test_files: []
|
61
|
+
|