ensure-encoding 0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/LICENSE +18 -0
- data/README.rdoc +95 -0
- data/lib/ensure.rb +3 -0
- data/lib/ensure/encoding.rb +132 -0
- metadata +61 -0
data/LICENSE
ADDED
@@ -0,0 +1,18 @@
|
|
1
|
+
Copyright (c) 2009 Manfred Stienstra, Fingertips <manfred@fngtps.com>
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy of
|
4
|
+
this software and associated documentation files (the "Software"), to deal in
|
5
|
+
the Software without restriction, including without limitation the rights to
|
6
|
+
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
|
7
|
+
the Software, and to permit persons to whom the Software is furnished to do so,
|
8
|
+
subject to the following conditions:
|
9
|
+
|
10
|
+
The above copyright notice and this permission notice shall be included in all
|
11
|
+
copies or substantial portions of the Software.
|
12
|
+
|
13
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
14
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
|
15
|
+
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
|
16
|
+
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
|
17
|
+
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
|
18
|
+
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.rdoc
ADDED
@@ -0,0 +1,95 @@
|
|
1
|
+
= Ensure Encoding
|
2
|
+
|
3
|
+
Experimental project to find the best way to ensure a preferred encoding in
|
4
|
+
Strings coming from untrusted sources.
|
5
|
+
|
6
|
+
== Algorithms
|
7
|
+
|
8
|
+
The most sane way of dealing with character data is choosing an internal
|
9
|
+
representation for your application and convert all incoming and outgoing data
|
10
|
+
when necessary.
|
11
|
+
|
12
|
+
To ensure that our internal encoding is always at least valid we can choose to
|
13
|
+
do one of two things.
|
14
|
+
|
15
|
+
1. Throw an exception when we receive invalid character data. (ie. Our internal
|
16
|
+
encoding is UTF-8 and we receive Latin-1 data or invalid UTF-8)
|
17
|
+
2. Accept whatever we get and try to mold it in such a way that it becomes
|
18
|
+
usable in our application.
|
19
|
+
|
20
|
+
It is generally accepted to go for the second option, most of the times the
|
21
|
+
end-user has no way of solving these problems because a vendor made a mistake.
|
22
|
+
It's not very nice to shut them out.
|
23
|
+
|
24
|
+
There are a number of techniques when molding the character data to our needs.
|
25
|
+
Sniffing encoding, transcoding, dropping invalid characters are just a few
|
26
|
+
examples. We implement a number of these techniques so you can easily protect
|
27
|
+
your application from bad data.
|
28
|
+
|
29
|
+
== Ensure encoding
|
30
|
+
|
31
|
+
We've crammed at lot of functionality into the ensure_encoding method because
|
32
|
+
we want to keep the number of new methods on String to a minimum. We'll walk
|
33
|
+
through an example to show how it works.
|
34
|
+
|
35
|
+
example = 'Café'
|
36
|
+
example.encoding => #<Encoding:ISO-8859-1>
|
37
|
+
|
38
|
+
After ensuring the encoding of a string you can at least assume that you can
|
39
|
+
concatenate the string to another string with the same encoding. In other
|
40
|
+
words, it contains data valid for the specified encoding.
|
41
|
+
|
42
|
+
example.ensure_encoding('UTF-8')
|
43
|
+
example.encoding => #<Encoding:UTF-8>
|
44
|
+
|
45
|
+
Beyond this you can specify a number of options to perform more operations
|
46
|
+
to make sure the data in the string didn't become unreadable garbage. Let's
|
47
|
+
look at a number of situations.
|
48
|
+
|
49
|
+
=== Untrusted source with known encoding
|
50
|
+
|
51
|
+
For instance, when you're excepting data from browsers you can be pretty sure
|
52
|
+
the character data is properly encoded. When someone does send bad data it's
|
53
|
+
probably a hacker and you can discard the request.
|
54
|
+
|
55
|
+
example.ensure_encoding('UTF-8',
|
56
|
+
:external_encoding => 'UTF-8,
|
57
|
+
:invalid_characters => :raise
|
58
|
+
)
|
59
|
+
|
60
|
+
=== Friendly source with known encoding
|
61
|
+
|
62
|
+
In this scenario you're connecting to web API through a ReST library and you
|
63
|
+
know the encoding of the source data because it's in the headers. However
|
64
|
+
you're not sure the encoding of the strings is valid.
|
65
|
+
|
66
|
+
example.ensure_encoding(Encoding::UTF_8
|
67
|
+
:external_encoding => Encoding::UTF_8,
|
68
|
+
:invalid_characters => :drop
|
69
|
+
)
|
70
|
+
|
71
|
+
=== Untrusted source with variable encoding
|
72
|
+
|
73
|
+
Assume we have a legacy database and some of the fields contain Shift JIS,
|
74
|
+
while some of the newer fields contain UTF-8 because someone screwed up the
|
75
|
+
server configuration. You're not even sure the encoding property on the
|
76
|
+
strings you got make any sense because your database adapter is confused too.
|
77
|
+
|
78
|
+
example.ensure_encoding('UTF-8',
|
79
|
+
:external_encoding => [Encoding::Shift_JIS, Encoding::UTF_8],
|
80
|
+
:invalid_characters => :transcode
|
81
|
+
)
|
82
|
+
|
83
|
+
=== Untrusted source with unknown encoding
|
84
|
+
|
85
|
+
As a last resort you're trying to read some random files from disk and you
|
86
|
+
have no idea what the external encoding is. You've just read them as binary
|
87
|
+
and are hoping to make some sense from the data.
|
88
|
+
|
89
|
+
example.ensure_encoding('UTF-8',
|
90
|
+
:external_encoding => :sniff,
|
91
|
+
:invalid_characters => :transcode
|
92
|
+
)
|
93
|
+
|
94
|
+
Note that the encoding sniffer is currently very naive and might not always be
|
95
|
+
of any help.
|
data/lib/ensure.rb
ADDED
@@ -0,0 +1,132 @@
|
|
1
|
+
# encoding: utf-8
|
2
|
+
|
3
|
+
module Ensure
|
4
|
+
module Encoding
|
5
|
+
BYTE_ORDER_MARKS = {
|
6
|
+
::Encoding::UTF_16BE => [0xfe, 0xff],
|
7
|
+
::Encoding::UTF_16LE => [0xff, 0xfe],
|
8
|
+
::Encoding::UTF_8 => [0xef, 0xbb, 0xbf]
|
9
|
+
}
|
10
|
+
|
11
|
+
# Tries to guess the encoding of the string and returns the most likely
|
12
|
+
# encoding.
|
13
|
+
def self.sniff_encoding(string)
|
14
|
+
first_bytes = string.unpack('C3')
|
15
|
+
BYTE_ORDER_MARKS.each do |encoding, bytes|
|
16
|
+
if first_bytes[0...bytes.length] == bytes
|
17
|
+
return encoding
|
18
|
+
end
|
19
|
+
end
|
20
|
+
::Encoding::UTF_8
|
21
|
+
end
|
22
|
+
|
23
|
+
# Checks the encodings in +guesses+ from front to back and returns the
|
24
|
+
# first encoding in which the character data is a valid sequence.
|
25
|
+
def self.guess_encoding(string, guesses)
|
26
|
+
original_encoding = string.encoding
|
27
|
+
guessed_encoding = nil
|
28
|
+
|
29
|
+
guesses.each do |guess|
|
30
|
+
string.force_encoding(guess)
|
31
|
+
if string.valid_encoding?
|
32
|
+
guessed_encoding = string.encoding
|
33
|
+
break
|
34
|
+
end
|
35
|
+
end
|
36
|
+
|
37
|
+
string.force_encoding(original_encoding)
|
38
|
+
guessed_encoding
|
39
|
+
end
|
40
|
+
|
41
|
+
# Forces the encoding of +string+ to +target_encoding+ and using a number
|
42
|
+
# of smart tricks. See String#ensure_encoding for more details.
|
43
|
+
def self.force_encoding(string, target_encoding, options={})
|
44
|
+
target_string = string.dup
|
45
|
+
force_encoding!(target_string, target_encoding, options)
|
46
|
+
target_string
|
47
|
+
end
|
48
|
+
|
49
|
+
# Performs just like +force_encoding+, only it changes the string
|
50
|
+
# in place instead of returning it.
|
51
|
+
def self.force_encoding!(string, target_encoding, options={})
|
52
|
+
if options[:external_encoding] == :sniff
|
53
|
+
external_encoding = sniff_encoding(string)
|
54
|
+
else
|
55
|
+
external_encoding = options[:external_encoding] || [target_encoding, string.encoding]
|
56
|
+
end
|
57
|
+
|
58
|
+
if external_encoding.respond_to?(:each)
|
59
|
+
external_encoding = guess_encoding(string, external_encoding) || target_encoding
|
60
|
+
end
|
61
|
+
|
62
|
+
if options[:invalid_characters] == :raise
|
63
|
+
string.force_encoding(target_encoding)
|
64
|
+
raise ::Encoding::InvalidByteSequenceError, "String is not encoded as `#{target_encoding}'" unless string.valid_encoding?
|
65
|
+
else
|
66
|
+
filters = (options[:invalid_characters] == :drop) ? { :replace => '', :undef => :replace, :invalid => :replace } : {}
|
67
|
+
string.encode!(target_encoding, external_encoding, filters)
|
68
|
+
end
|
69
|
+
end
|
70
|
+
|
71
|
+
module String
|
72
|
+
# Ensures the character encoding in a string. It employs a number of
|
73
|
+
# techniques to detect and transcode characters to make sure they end
|
74
|
+
# up in a usuable form in the encoding you need.
|
75
|
+
#
|
76
|
+
# == Arguments
|
77
|
+
#
|
78
|
+
# +target_encoding+
|
79
|
+
# The character encoding you want to ensure; this is usually the
|
80
|
+
# internal encoding of your application. Accepts both string
|
81
|
+
# constants and encoding constants. (ie. 'UTF-8' or Encoding::UTF_8)
|
82
|
+
# +options+
|
83
|
+
# Options to trigger activate certain algorithms.
|
84
|
+
#
|
85
|
+
# === Options
|
86
|
+
#
|
87
|
+
# :external_encoding
|
88
|
+
# Specifies both your certainty about the external encoding and what
|
89
|
+
# you think it might be. Valid options are :sniff, an array of
|
90
|
+
# encodings, or a single encoding. When you specify :sniff, we will
|
91
|
+
# sniff around in the data to guess which encoding it is. When you
|
92
|
+
# supply a list of possible encodings we will check them from begin
|
93
|
+
# to end if one of them matches the data. Finally, when you specify
|
94
|
+
# a specific encoding we assume you know which it is and we will use
|
95
|
+
# that. By default we use :external_encoding => [target_encoding,
|
96
|
+
# self.encoding].
|
97
|
+
# :invalid_characters
|
98
|
+
# Specifies what to do with invalid characters. There are three valid
|
99
|
+
# values: :raise, :drop, and :transcode. The first raises and exception
|
100
|
+
# on an invalid character. The second will strip all invalid characters
|
101
|
+
# the last will try to transcode them to the wanted encoding. By default
|
102
|
+
# we transcode.
|
103
|
+
#
|
104
|
+
# == Example
|
105
|
+
#
|
106
|
+
# response = REST.get('http://www.google.com')
|
107
|
+
#
|
108
|
+
# if match = /charset=([^;]*)/.match(response.content_type)
|
109
|
+
# encoding = match[1]
|
110
|
+
# else
|
111
|
+
# encoding = 'UTF-8'
|
112
|
+
# end
|
113
|
+
#
|
114
|
+
# body = response.body.ensure_encoding('UTF-8',
|
115
|
+
# :external_encoding => encoding,
|
116
|
+
# :invalid_characters => :drop)
|
117
|
+
def ensure_encoding(target_encoding, options={})
|
118
|
+
Ensure::Encoding.force_encoding(self, target_encoding, options)
|
119
|
+
end
|
120
|
+
|
121
|
+
# Performs just like String#ensure_encoding, only it changes the string
|
122
|
+
# in place instead of returning it.
|
123
|
+
def ensure_encoding!(target_encoding, options={})
|
124
|
+
Ensure::Encoding.force_encoding!(self, target_encoding, options)
|
125
|
+
end
|
126
|
+
end
|
127
|
+
end
|
128
|
+
end
|
129
|
+
|
130
|
+
class String
|
131
|
+
include Ensure::Encoding::String
|
132
|
+
end
|
metadata
ADDED
@@ -0,0 +1,61 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: ensure-encoding
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: "0.1"
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Manfred Stienstra
|
8
|
+
autorequire:
|
9
|
+
bindir: bin
|
10
|
+
cert_chain: []
|
11
|
+
|
12
|
+
date: 2010-01-11 00:00:00 +01:00
|
13
|
+
default_executable:
|
14
|
+
dependencies: []
|
15
|
+
|
16
|
+
description: " Ensure the character encoding in Strings coming from untrusted sources.\n"
|
17
|
+
email: manfred@fngtps.com
|
18
|
+
executables: []
|
19
|
+
|
20
|
+
extensions: []
|
21
|
+
|
22
|
+
extra_rdoc_files:
|
23
|
+
- README.rdoc
|
24
|
+
- LICENSE
|
25
|
+
files:
|
26
|
+
- README.rdoc
|
27
|
+
- LICENSE
|
28
|
+
- lib/ensure/encoding.rb
|
29
|
+
- lib/ensure.rb
|
30
|
+
has_rdoc: true
|
31
|
+
homepage:
|
32
|
+
licenses: []
|
33
|
+
|
34
|
+
post_install_message:
|
35
|
+
rdoc_options:
|
36
|
+
- --all
|
37
|
+
- --charset
|
38
|
+
- utf-8
|
39
|
+
require_paths:
|
40
|
+
- lib
|
41
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
42
|
+
requirements:
|
43
|
+
- - ">="
|
44
|
+
- !ruby/object:Gem::Version
|
45
|
+
version: "0"
|
46
|
+
version:
|
47
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
48
|
+
requirements:
|
49
|
+
- - ">="
|
50
|
+
- !ruby/object:Gem::Version
|
51
|
+
version: "0"
|
52
|
+
version:
|
53
|
+
requirements: []
|
54
|
+
|
55
|
+
rubyforge_project:
|
56
|
+
rubygems_version: 1.3.5
|
57
|
+
signing_key:
|
58
|
+
specification_version: 3
|
59
|
+
summary: Ensure the character encoding in Strings coming from untrusted sources.
|
60
|
+
test_files: []
|
61
|
+
|