ensure-encoding 0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (5) hide show
  1. data/LICENSE +18 -0
  2. data/README.rdoc +95 -0
  3. data/lib/ensure.rb +3 -0
  4. data/lib/ensure/encoding.rb +132 -0
  5. metadata +61 -0
data/LICENSE ADDED
@@ -0,0 +1,18 @@
1
+ Copyright (c) 2009 Manfred Stienstra, Fingertips <manfred@fngtps.com>
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy of
4
+ this software and associated documentation files (the "Software"), to deal in
5
+ the Software without restriction, including without limitation the rights to
6
+ use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
7
+ the Software, and to permit persons to whom the Software is furnished to do so,
8
+ subject to the following conditions:
9
+
10
+ The above copyright notice and this permission notice shall be included in all
11
+ copies or substantial portions of the Software.
12
+
13
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
15
+ FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
16
+ COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
17
+ IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
18
+ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,95 @@
1
+ = Ensure Encoding
2
+
3
+ Experimental project to find the best way to ensure a preferred encoding in
4
+ Strings coming from untrusted sources.
5
+
6
+ == Algorithms
7
+
8
+ The most sane way of dealing with character data is choosing an internal
9
+ representation for your application and convert all incoming and outgoing data
10
+ when necessary.
11
+
12
+ To ensure that our internal encoding is always at least valid we can choose to
13
+ do one of two things.
14
+
15
+ 1. Throw an exception when we receive invalid character data. (ie. Our internal
16
+ encoding is UTF-8 and we receive Latin-1 data or invalid UTF-8)
17
+ 2. Accept whatever we get and try to mold it in such a way that it becomes
18
+ usable in our application.
19
+
20
+ It is generally accepted to go for the second option, most of the times the
21
+ end-user has no way of solving these problems because a vendor made a mistake.
22
+ It's not very nice to shut them out.
23
+
24
+ There are a number of techniques when molding the character data to our needs.
25
+ Sniffing encoding, transcoding, dropping invalid characters are just a few
26
+ examples. We implement a number of these techniques so you can easily protect
27
+ your application from bad data.
28
+
29
+ == Ensure encoding
30
+
31
+ We've crammed at lot of functionality into the ensure_encoding method because
32
+ we want to keep the number of new methods on String to a minimum. We'll walk
33
+ through an example to show how it works.
34
+
35
+ example = 'Café'
36
+ example.encoding => #<Encoding:ISO-8859-1>
37
+
38
+ After ensuring the encoding of a string you can at least assume that you can
39
+ concatenate the string to another string with the same encoding. In other
40
+ words, it contains data valid for the specified encoding.
41
+
42
+ example.ensure_encoding('UTF-8')
43
+ example.encoding => #<Encoding:UTF-8>
44
+
45
+ Beyond this you can specify a number of options to perform more operations
46
+ to make sure the data in the string didn't become unreadable garbage. Let's
47
+ look at a number of situations.
48
+
49
+ === Untrusted source with known encoding
50
+
51
+ For instance, when you're excepting data from browsers you can be pretty sure
52
+ the character data is properly encoded. When someone does send bad data it's
53
+ probably a hacker and you can discard the request.
54
+
55
+ example.ensure_encoding('UTF-8',
56
+ :external_encoding => 'UTF-8,
57
+ :invalid_characters => :raise
58
+ )
59
+
60
+ === Friendly source with known encoding
61
+
62
+ In this scenario you're connecting to web API through a ReST library and you
63
+ know the encoding of the source data because it's in the headers. However
64
+ you're not sure the encoding of the strings is valid.
65
+
66
+ example.ensure_encoding(Encoding::UTF_8
67
+ :external_encoding => Encoding::UTF_8,
68
+ :invalid_characters => :drop
69
+ )
70
+
71
+ === Untrusted source with variable encoding
72
+
73
+ Assume we have a legacy database and some of the fields contain Shift JIS,
74
+ while some of the newer fields contain UTF-8 because someone screwed up the
75
+ server configuration. You're not even sure the encoding property on the
76
+ strings you got make any sense because your database adapter is confused too.
77
+
78
+ example.ensure_encoding('UTF-8',
79
+ :external_encoding => [Encoding::Shift_JIS, Encoding::UTF_8],
80
+ :invalid_characters => :transcode
81
+ )
82
+
83
+ === Untrusted source with unknown encoding
84
+
85
+ As a last resort you're trying to read some random files from disk and you
86
+ have no idea what the external encoding is. You've just read them as binary
87
+ and are hoping to make some sense from the data.
88
+
89
+ example.ensure_encoding('UTF-8',
90
+ :external_encoding => :sniff,
91
+ :invalid_characters => :transcode
92
+ )
93
+
94
+ Note that the encoding sniffer is currently very naive and might not always be
95
+ of any help.
@@ -0,0 +1,3 @@
1
+ module Ensure
2
+ autoload :Encoding, 'ensure/encoding'
3
+ end
@@ -0,0 +1,132 @@
1
+ # encoding: utf-8
2
+
3
+ module Ensure
4
+ module Encoding
5
+ BYTE_ORDER_MARKS = {
6
+ ::Encoding::UTF_16BE => [0xfe, 0xff],
7
+ ::Encoding::UTF_16LE => [0xff, 0xfe],
8
+ ::Encoding::UTF_8 => [0xef, 0xbb, 0xbf]
9
+ }
10
+
11
+ # Tries to guess the encoding of the string and returns the most likely
12
+ # encoding.
13
+ def self.sniff_encoding(string)
14
+ first_bytes = string.unpack('C3')
15
+ BYTE_ORDER_MARKS.each do |encoding, bytes|
16
+ if first_bytes[0...bytes.length] == bytes
17
+ return encoding
18
+ end
19
+ end
20
+ ::Encoding::UTF_8
21
+ end
22
+
23
+ # Checks the encodings in +guesses+ from front to back and returns the
24
+ # first encoding in which the character data is a valid sequence.
25
+ def self.guess_encoding(string, guesses)
26
+ original_encoding = string.encoding
27
+ guessed_encoding = nil
28
+
29
+ guesses.each do |guess|
30
+ string.force_encoding(guess)
31
+ if string.valid_encoding?
32
+ guessed_encoding = string.encoding
33
+ break
34
+ end
35
+ end
36
+
37
+ string.force_encoding(original_encoding)
38
+ guessed_encoding
39
+ end
40
+
41
+ # Forces the encoding of +string+ to +target_encoding+ and using a number
42
+ # of smart tricks. See String#ensure_encoding for more details.
43
+ def self.force_encoding(string, target_encoding, options={})
44
+ target_string = string.dup
45
+ force_encoding!(target_string, target_encoding, options)
46
+ target_string
47
+ end
48
+
49
+ # Performs just like +force_encoding+, only it changes the string
50
+ # in place instead of returning it.
51
+ def self.force_encoding!(string, target_encoding, options={})
52
+ if options[:external_encoding] == :sniff
53
+ external_encoding = sniff_encoding(string)
54
+ else
55
+ external_encoding = options[:external_encoding] || [target_encoding, string.encoding]
56
+ end
57
+
58
+ if external_encoding.respond_to?(:each)
59
+ external_encoding = guess_encoding(string, external_encoding) || target_encoding
60
+ end
61
+
62
+ if options[:invalid_characters] == :raise
63
+ string.force_encoding(target_encoding)
64
+ raise ::Encoding::InvalidByteSequenceError, "String is not encoded as `#{target_encoding}'" unless string.valid_encoding?
65
+ else
66
+ filters = (options[:invalid_characters] == :drop) ? { :replace => '', :undef => :replace, :invalid => :replace } : {}
67
+ string.encode!(target_encoding, external_encoding, filters)
68
+ end
69
+ end
70
+
71
+ module String
72
+ # Ensures the character encoding in a string. It employs a number of
73
+ # techniques to detect and transcode characters to make sure they end
74
+ # up in a usuable form in the encoding you need.
75
+ #
76
+ # == Arguments
77
+ #
78
+ # +target_encoding+
79
+ # The character encoding you want to ensure; this is usually the
80
+ # internal encoding of your application. Accepts both string
81
+ # constants and encoding constants. (ie. 'UTF-8' or Encoding::UTF_8)
82
+ # +options+
83
+ # Options to trigger activate certain algorithms.
84
+ #
85
+ # === Options
86
+ #
87
+ # :external_encoding
88
+ # Specifies both your certainty about the external encoding and what
89
+ # you think it might be. Valid options are :sniff, an array of
90
+ # encodings, or a single encoding. When you specify :sniff, we will
91
+ # sniff around in the data to guess which encoding it is. When you
92
+ # supply a list of possible encodings we will check them from begin
93
+ # to end if one of them matches the data. Finally, when you specify
94
+ # a specific encoding we assume you know which it is and we will use
95
+ # that. By default we use :external_encoding => [target_encoding,
96
+ # self.encoding].
97
+ # :invalid_characters
98
+ # Specifies what to do with invalid characters. There are three valid
99
+ # values: :raise, :drop, and :transcode. The first raises and exception
100
+ # on an invalid character. The second will strip all invalid characters
101
+ # the last will try to transcode them to the wanted encoding. By default
102
+ # we transcode.
103
+ #
104
+ # == Example
105
+ #
106
+ # response = REST.get('http://www.google.com')
107
+ #
108
+ # if match = /charset=([^;]*)/.match(response.content_type)
109
+ # encoding = match[1]
110
+ # else
111
+ # encoding = 'UTF-8'
112
+ # end
113
+ #
114
+ # body = response.body.ensure_encoding('UTF-8',
115
+ # :external_encoding => encoding,
116
+ # :invalid_characters => :drop)
117
+ def ensure_encoding(target_encoding, options={})
118
+ Ensure::Encoding.force_encoding(self, target_encoding, options)
119
+ end
120
+
121
+ # Performs just like String#ensure_encoding, only it changes the string
122
+ # in place instead of returning it.
123
+ def ensure_encoding!(target_encoding, options={})
124
+ Ensure::Encoding.force_encoding!(self, target_encoding, options)
125
+ end
126
+ end
127
+ end
128
+ end
129
+
130
+ class String
131
+ include Ensure::Encoding::String
132
+ end
metadata ADDED
@@ -0,0 +1,61 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: ensure-encoding
3
+ version: !ruby/object:Gem::Version
4
+ version: "0.1"
5
+ platform: ruby
6
+ authors:
7
+ - Manfred Stienstra
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+
12
+ date: 2010-01-11 00:00:00 +01:00
13
+ default_executable:
14
+ dependencies: []
15
+
16
+ description: " Ensure the character encoding in Strings coming from untrusted sources.\n"
17
+ email: manfred@fngtps.com
18
+ executables: []
19
+
20
+ extensions: []
21
+
22
+ extra_rdoc_files:
23
+ - README.rdoc
24
+ - LICENSE
25
+ files:
26
+ - README.rdoc
27
+ - LICENSE
28
+ - lib/ensure/encoding.rb
29
+ - lib/ensure.rb
30
+ has_rdoc: true
31
+ homepage:
32
+ licenses: []
33
+
34
+ post_install_message:
35
+ rdoc_options:
36
+ - --all
37
+ - --charset
38
+ - utf-8
39
+ require_paths:
40
+ - lib
41
+ required_ruby_version: !ruby/object:Gem::Requirement
42
+ requirements:
43
+ - - ">="
44
+ - !ruby/object:Gem::Version
45
+ version: "0"
46
+ version:
47
+ required_rubygems_version: !ruby/object:Gem::Requirement
48
+ requirements:
49
+ - - ">="
50
+ - !ruby/object:Gem::Version
51
+ version: "0"
52
+ version:
53
+ requirements: []
54
+
55
+ rubyforge_project:
56
+ rubygems_version: 1.3.5
57
+ signing_key:
58
+ specification_version: 3
59
+ summary: Ensure the character encoding in Strings coming from untrusted sources.
60
+ test_files: []
61
+