ensure-encoding 0.1

Sign up to get free protection for your applications and to get access to all the features.
Files changed (5) hide show
  1. data/LICENSE +18 -0
  2. data/README.rdoc +95 -0
  3. data/lib/ensure.rb +3 -0
  4. data/lib/ensure/encoding.rb +132 -0
  5. metadata +61 -0
data/LICENSE ADDED
@@ -0,0 +1,18 @@
1
+ Copyright (c) 2009 Manfred Stienstra, Fingertips <manfred@fngtps.com>
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy of
4
+ this software and associated documentation files (the "Software"), to deal in
5
+ the Software without restriction, including without limitation the rights to
6
+ use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
7
+ the Software, and to permit persons to whom the Software is furnished to do so,
8
+ subject to the following conditions:
9
+
10
+ The above copyright notice and this permission notice shall be included in all
11
+ copies or substantial portions of the Software.
12
+
13
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
15
+ FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
16
+ COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
17
+ IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
18
+ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,95 @@
1
+ = Ensure Encoding
2
+
3
+ Experimental project to find the best way to ensure a preferred encoding in
4
+ Strings coming from untrusted sources.
5
+
6
+ == Algorithms
7
+
8
+ The most sane way of dealing with character data is choosing an internal
9
+ representation for your application and convert all incoming and outgoing data
10
+ when necessary.
11
+
12
+ To ensure that our internal encoding is always at least valid we can choose to
13
+ do one of two things.
14
+
15
+ 1. Throw an exception when we receive invalid character data. (ie. Our internal
16
+ encoding is UTF-8 and we receive Latin-1 data or invalid UTF-8)
17
+ 2. Accept whatever we get and try to mold it in such a way that it becomes
18
+ usable in our application.
19
+
20
+ It is generally accepted to go for the second option, most of the times the
21
+ end-user has no way of solving these problems because a vendor made a mistake.
22
+ It's not very nice to shut them out.
23
+
24
+ There are a number of techniques when molding the character data to our needs.
25
+ Sniffing encoding, transcoding, dropping invalid characters are just a few
26
+ examples. We implement a number of these techniques so you can easily protect
27
+ your application from bad data.
28
+
29
+ == Ensure encoding
30
+
31
+ We've crammed at lot of functionality into the ensure_encoding method because
32
+ we want to keep the number of new methods on String to a minimum. We'll walk
33
+ through an example to show how it works.
34
+
35
+ example = 'Café'
36
+ example.encoding => #<Encoding:ISO-8859-1>
37
+
38
+ After ensuring the encoding of a string you can at least assume that you can
39
+ concatenate the string to another string with the same encoding. In other
40
+ words, it contains data valid for the specified encoding.
41
+
42
+ example.ensure_encoding('UTF-8')
43
+ example.encoding => #<Encoding:UTF-8>
44
+
45
+ Beyond this you can specify a number of options to perform more operations
46
+ to make sure the data in the string didn't become unreadable garbage. Let's
47
+ look at a number of situations.
48
+
49
+ === Untrusted source with known encoding
50
+
51
+ For instance, when you're excepting data from browsers you can be pretty sure
52
+ the character data is properly encoded. When someone does send bad data it's
53
+ probably a hacker and you can discard the request.
54
+
55
+ example.ensure_encoding('UTF-8',
56
+ :external_encoding => 'UTF-8,
57
+ :invalid_characters => :raise
58
+ )
59
+
60
+ === Friendly source with known encoding
61
+
62
+ In this scenario you're connecting to web API through a ReST library and you
63
+ know the encoding of the source data because it's in the headers. However
64
+ you're not sure the encoding of the strings is valid.
65
+
66
+ example.ensure_encoding(Encoding::UTF_8
67
+ :external_encoding => Encoding::UTF_8,
68
+ :invalid_characters => :drop
69
+ )
70
+
71
+ === Untrusted source with variable encoding
72
+
73
+ Assume we have a legacy database and some of the fields contain Shift JIS,
74
+ while some of the newer fields contain UTF-8 because someone screwed up the
75
+ server configuration. You're not even sure the encoding property on the
76
+ strings you got make any sense because your database adapter is confused too.
77
+
78
+ example.ensure_encoding('UTF-8',
79
+ :external_encoding => [Encoding::Shift_JIS, Encoding::UTF_8],
80
+ :invalid_characters => :transcode
81
+ )
82
+
83
+ === Untrusted source with unknown encoding
84
+
85
+ As a last resort you're trying to read some random files from disk and you
86
+ have no idea what the external encoding is. You've just read them as binary
87
+ and are hoping to make some sense from the data.
88
+
89
+ example.ensure_encoding('UTF-8',
90
+ :external_encoding => :sniff,
91
+ :invalid_characters => :transcode
92
+ )
93
+
94
+ Note that the encoding sniffer is currently very naive and might not always be
95
+ of any help.
@@ -0,0 +1,3 @@
1
+ module Ensure
2
+ autoload :Encoding, 'ensure/encoding'
3
+ end
@@ -0,0 +1,132 @@
1
+ # encoding: utf-8
2
+
3
+ module Ensure
4
+ module Encoding
5
+ BYTE_ORDER_MARKS = {
6
+ ::Encoding::UTF_16BE => [0xfe, 0xff],
7
+ ::Encoding::UTF_16LE => [0xff, 0xfe],
8
+ ::Encoding::UTF_8 => [0xef, 0xbb, 0xbf]
9
+ }
10
+
11
+ # Tries to guess the encoding of the string and returns the most likely
12
+ # encoding.
13
+ def self.sniff_encoding(string)
14
+ first_bytes = string.unpack('C3')
15
+ BYTE_ORDER_MARKS.each do |encoding, bytes|
16
+ if first_bytes[0...bytes.length] == bytes
17
+ return encoding
18
+ end
19
+ end
20
+ ::Encoding::UTF_8
21
+ end
22
+
23
+ # Checks the encodings in +guesses+ from front to back and returns the
24
+ # first encoding in which the character data is a valid sequence.
25
+ def self.guess_encoding(string, guesses)
26
+ original_encoding = string.encoding
27
+ guessed_encoding = nil
28
+
29
+ guesses.each do |guess|
30
+ string.force_encoding(guess)
31
+ if string.valid_encoding?
32
+ guessed_encoding = string.encoding
33
+ break
34
+ end
35
+ end
36
+
37
+ string.force_encoding(original_encoding)
38
+ guessed_encoding
39
+ end
40
+
41
+ # Forces the encoding of +string+ to +target_encoding+ and using a number
42
+ # of smart tricks. See String#ensure_encoding for more details.
43
+ def self.force_encoding(string, target_encoding, options={})
44
+ target_string = string.dup
45
+ force_encoding!(target_string, target_encoding, options)
46
+ target_string
47
+ end
48
+
49
+ # Performs just like +force_encoding+, only it changes the string
50
+ # in place instead of returning it.
51
+ def self.force_encoding!(string, target_encoding, options={})
52
+ if options[:external_encoding] == :sniff
53
+ external_encoding = sniff_encoding(string)
54
+ else
55
+ external_encoding = options[:external_encoding] || [target_encoding, string.encoding]
56
+ end
57
+
58
+ if external_encoding.respond_to?(:each)
59
+ external_encoding = guess_encoding(string, external_encoding) || target_encoding
60
+ end
61
+
62
+ if options[:invalid_characters] == :raise
63
+ string.force_encoding(target_encoding)
64
+ raise ::Encoding::InvalidByteSequenceError, "String is not encoded as `#{target_encoding}'" unless string.valid_encoding?
65
+ else
66
+ filters = (options[:invalid_characters] == :drop) ? { :replace => '', :undef => :replace, :invalid => :replace } : {}
67
+ string.encode!(target_encoding, external_encoding, filters)
68
+ end
69
+ end
70
+
71
+ module String
72
+ # Ensures the character encoding in a string. It employs a number of
73
+ # techniques to detect and transcode characters to make sure they end
74
+ # up in a usuable form in the encoding you need.
75
+ #
76
+ # == Arguments
77
+ #
78
+ # +target_encoding+
79
+ # The character encoding you want to ensure; this is usually the
80
+ # internal encoding of your application. Accepts both string
81
+ # constants and encoding constants. (ie. 'UTF-8' or Encoding::UTF_8)
82
+ # +options+
83
+ # Options to trigger activate certain algorithms.
84
+ #
85
+ # === Options
86
+ #
87
+ # :external_encoding
88
+ # Specifies both your certainty about the external encoding and what
89
+ # you think it might be. Valid options are :sniff, an array of
90
+ # encodings, or a single encoding. When you specify :sniff, we will
91
+ # sniff around in the data to guess which encoding it is. When you
92
+ # supply a list of possible encodings we will check them from begin
93
+ # to end if one of them matches the data. Finally, when you specify
94
+ # a specific encoding we assume you know which it is and we will use
95
+ # that. By default we use :external_encoding => [target_encoding,
96
+ # self.encoding].
97
+ # :invalid_characters
98
+ # Specifies what to do with invalid characters. There are three valid
99
+ # values: :raise, :drop, and :transcode. The first raises and exception
100
+ # on an invalid character. The second will strip all invalid characters
101
+ # the last will try to transcode them to the wanted encoding. By default
102
+ # we transcode.
103
+ #
104
+ # == Example
105
+ #
106
+ # response = REST.get('http://www.google.com')
107
+ #
108
+ # if match = /charset=([^;]*)/.match(response.content_type)
109
+ # encoding = match[1]
110
+ # else
111
+ # encoding = 'UTF-8'
112
+ # end
113
+ #
114
+ # body = response.body.ensure_encoding('UTF-8',
115
+ # :external_encoding => encoding,
116
+ # :invalid_characters => :drop)
117
+ def ensure_encoding(target_encoding, options={})
118
+ Ensure::Encoding.force_encoding(self, target_encoding, options)
119
+ end
120
+
121
+ # Performs just like String#ensure_encoding, only it changes the string
122
+ # in place instead of returning it.
123
+ def ensure_encoding!(target_encoding, options={})
124
+ Ensure::Encoding.force_encoding!(self, target_encoding, options)
125
+ end
126
+ end
127
+ end
128
+ end
129
+
130
+ class String
131
+ include Ensure::Encoding::String
132
+ end
metadata ADDED
@@ -0,0 +1,61 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: ensure-encoding
3
+ version: !ruby/object:Gem::Version
4
+ version: "0.1"
5
+ platform: ruby
6
+ authors:
7
+ - Manfred Stienstra
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+
12
+ date: 2010-01-11 00:00:00 +01:00
13
+ default_executable:
14
+ dependencies: []
15
+
16
+ description: " Ensure the character encoding in Strings coming from untrusted sources.\n"
17
+ email: manfred@fngtps.com
18
+ executables: []
19
+
20
+ extensions: []
21
+
22
+ extra_rdoc_files:
23
+ - README.rdoc
24
+ - LICENSE
25
+ files:
26
+ - README.rdoc
27
+ - LICENSE
28
+ - lib/ensure/encoding.rb
29
+ - lib/ensure.rb
30
+ has_rdoc: true
31
+ homepage:
32
+ licenses: []
33
+
34
+ post_install_message:
35
+ rdoc_options:
36
+ - --all
37
+ - --charset
38
+ - utf-8
39
+ require_paths:
40
+ - lib
41
+ required_ruby_version: !ruby/object:Gem::Requirement
42
+ requirements:
43
+ - - ">="
44
+ - !ruby/object:Gem::Version
45
+ version: "0"
46
+ version:
47
+ required_rubygems_version: !ruby/object:Gem::Requirement
48
+ requirements:
49
+ - - ">="
50
+ - !ruby/object:Gem::Version
51
+ version: "0"
52
+ version:
53
+ requirements: []
54
+
55
+ rubyforge_project:
56
+ rubygems_version: 1.3.5
57
+ signing_key:
58
+ specification_version: 3
59
+ summary: Ensure the character encoding in Strings coming from untrusted sources.
60
+ test_files: []
61
+