sconv 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (4) hide show
  1. data/README +31 -0
  2. data/lib/ceslist +357 -0
  3. data/lib/sconv.rb +161 -0
  4. metadata +48 -0
data/README ADDED
@@ -0,0 +1,31 @@
1
+ == Sconv version 0.1.0
2
+
3
+ === Overview
4
+ The Module sconv provides a convenience layer for character
5
+ encoding conversion. It is included by class String.
6
+
7
+ It automatically selects between kconv (for Japanese encodings)
8
+ and iconv (for other encodings world-wide). It provides
9
+ conversion methods directly on Strings, including destructive
10
+ conversions. It provides methods with names of the form
11
+ String#inputEncoding_to_outputEncoding automatically
12
+ for any encoding supported by iconv or kconv.
13
+
14
+ For further information, please see:
15
+ "A Study Concerning Multilingual Processing in Ruby",
16
+ Takuya SHIMADA, Kazunari ITO, and Martin J. Du"rst,
17
+ Proceedings of the 69th Information Processing Society
18
+ of Japan National Convention, March 2007 (in Japanese)
19
+
20
+ === Future Work
21
+ - Write tests and run them
22
+ - Provide support for error messages
23
+ - Provide support for other CES conversions
24
+ - Provide automatic update for list of
25
+ supported character encodings in ces.rb
26
+ (depends on locally installed iconv library)
27
+
28
+ === Copyright
29
+ Copyright (c) 2007 Takuya Shimada, Martin J. Du"rst
30
+ Licensed under the same terms as Ruby. Absolutely no warranty.
31
+ (see http://www.ruby-lang.org/en/LICENSE.txt)
@@ -0,0 +1,357 @@
1
+ ANSI_X3.4-1968
2
+ ANSI_X3.4-1986
3
+ ASCII
4
+ CP367
5
+ IBM367
6
+ ISO-IR-6
7
+ ISO646-US
8
+ ISO_646.IRV:1991
9
+ US
10
+ US-ASCII
11
+ CSASCII
12
+ UTF-8
13
+ ISO-10646-UCS-2
14
+ UCS-2
15
+ CSUNICODE
16
+ UCS-2BE
17
+ UNICODE-1-1
18
+ UNICODEBIG
19
+ CSUNICODE11
20
+ UCS-2LE
21
+ UNICODELITTLE
22
+ ISO-10646-UCS-4
23
+ UCS-4
24
+ CSUCS4
25
+ UCS-4BE
26
+ UCS-4LE
27
+ UTF-16
28
+ UTF-16BE
29
+ UTF-16LE
30
+ UTF-32
31
+ UTF-32BE
32
+ UTF-32LE
33
+ UNICODE-1-1-UTF-7
34
+ UTF-7
35
+ CSUNICODE11UTF7
36
+ UCS-2-INTERNAL
37
+ UCS-2-SWAPPED
38
+ UCS-4-INTERNAL
39
+ UCS-4-SWAPPED
40
+ C99
41
+ JAVA
42
+ CP819
43
+ IBM819
44
+ ISO-8859-1
45
+ ISO-IR-100
46
+ ISO_8859-1
47
+ ISO_8859-1:1987
48
+ L1
49
+ LATIN1
50
+ CSISOLATIN1
51
+ ISO-8859-2
52
+ ISO-IR-101
53
+ ISO_8859-2
54
+ ISO_8859-2:1987
55
+ L2
56
+ LATIN2
57
+ CSISOLATIN2
58
+ ISO-8859-3
59
+ ISO-IR-109
60
+ ISO_8859-3
61
+ ISO_8859-3:1988
62
+ L3
63
+ LATIN3
64
+ CSISOLATIN3
65
+ ISO-8859-4
66
+ ISO-IR-110
67
+ ISO_8859-4
68
+ ISO_8859-4:1988
69
+ L4
70
+ LATIN4
71
+ CSISOLATIN4
72
+ CYRILLIC
73
+ ISO-8859-5
74
+ ISO-IR-144
75
+ ISO_8859-5
76
+ ISO_8859-5:1988
77
+ CSISOLATINCYRILLIC
78
+ ARABIC
79
+ ASMO-708
80
+ ECMA-114
81
+ ISO-8859-6
82
+ ISO-IR-127
83
+ ISO_8859-6
84
+ ISO_8859-6:1987
85
+ CSISOLATINARABIC
86
+ ECMA-118
87
+ ELOT_928
88
+ GREEK
89
+ GREEK8
90
+ ISO-8859-7
91
+ ISO-IR-126
92
+ ISO_8859-7
93
+ ISO_8859-7:1987
94
+ CSISOLATINGREEK
95
+ HEBREW
96
+ ISO-8859-8
97
+ ISO-IR-138
98
+ ISO_8859-8
99
+ ISO_8859-8:1988
100
+ CSISOLATINHEBREW
101
+ ISO-8859-9
102
+ ISO-IR-148
103
+ ISO_8859-9
104
+ ISO_8859-9:1989
105
+ L5
106
+ LATIN5
107
+ CSISOLATIN5
108
+ ISO-8859-10
109
+ ISO-IR-157
110
+ ISO_8859-10
111
+ ISO_8859-10:1992
112
+ L6
113
+ LATIN6
114
+ CSISOLATIN6
115
+ ISO-8859-13
116
+ ISO-IR-179
117
+ ISO_8859-13
118
+ L7
119
+ LATIN7
120
+ ISO-8859-14
121
+ ISO-CELTIC
122
+ ISO-IR-199
123
+ ISO_8859-14
124
+ ISO_8859-14:1998
125
+ L8
126
+ LATIN8
127
+ ISO-8859-15
128
+ ISO-IR-203
129
+ ISO_8859-15
130
+ ISO_8859-15:1998
131
+ ISO-8859-16
132
+ ISO-IR-226
133
+ ISO_8859-16
134
+ ISO_8859-16:2000
135
+ KOI8-R
136
+ CSKOI8R
137
+ KOI8-U
138
+ KOI8-RU
139
+ CP1250
140
+ MS-EE
141
+ WINDOWS-1250
142
+ CP1251
143
+ MS-CYRL
144
+ WINDOWS-1251
145
+ CP1252
146
+ MS-ANSI
147
+ WINDOWS-1252
148
+ CP1253
149
+ MS-GREEK
150
+ WINDOWS-1253
151
+ CP1254
152
+ MS-TURK
153
+ WINDOWS-1254
154
+ CP1255
155
+ MS-HEBR
156
+ WINDOWS-1255
157
+ CP1256
158
+ MS-ARAB
159
+ WINDOWS-1256
160
+ CP1257
161
+ WINBALTRIM
162
+ WINDOWS-1257
163
+ CP1258
164
+ WINDOWS-1258
165
+ 850
166
+ CP850
167
+ IBM850
168
+ CSPC850MULTILINGUAL
169
+ 862
170
+ CP862
171
+ IBM862
172
+ CSPC862LATINHEBREW
173
+ 866
174
+ CP866
175
+ IBM866
176
+ CSIBM866
177
+ MAC
178
+ MACINTOSH
179
+ MACROMAN
180
+ CSMACINTOSH
181
+ MACCENTRALEUROPE
182
+ MACICELAND
183
+ MACCROATIAN
184
+ MACROMANIA
185
+ MACCYRILLIC
186
+ MACUKRAINE
187
+ MACGREEK
188
+ MACTURKISH
189
+ MACHEBREW
190
+ MACARABIC
191
+ MACTHAI
192
+ HP-ROMAN8
193
+ R8
194
+ ROMAN8
195
+ CSHPROMAN8
196
+ NEXTSTEP
197
+ ARMSCII-8
198
+ GEORGIAN-ACADEMY
199
+ GEORGIAN-PS
200
+ KOI8-T
201
+ MULELAO-1
202
+ CP1133
203
+ IBM-CP1133
204
+ ISO-IR-166
205
+ TIS-620
206
+ TIS620
207
+ TIS620-0
208
+ TIS620.2529-1
209
+ TIS620.2533-0
210
+ TIS620.2533-1
211
+ CP874
212
+ WINDOWS-874
213
+ VISCII
214
+ VISCII1.1-1
215
+ CSVISCII
216
+ TCVN
217
+ TCVN-5712
218
+ TCVN5712-1
219
+ TCVN5712-1:1993
220
+ ISO-IR-14
221
+ ISO646-JP
222
+ JIS_C6220-1969-RO
223
+ JP
224
+ CSISO14JISC6220RO
225
+ JISX0201-1976
226
+ JIS_X0201
227
+ X0201
228
+ CSHALFWIDTHKATAKANA
229
+ ISO-IR-87
230
+ JIS0208
231
+ JIS_C6226-1983
232
+ JIS_X0208
233
+ JIS_X0208-1983
234
+ JIS_X0208-1990
235
+ X0208
236
+ CSISO87JISX0208
237
+ ISO-IR-159
238
+ JIS_X0212
239
+ JIS_X0212-1990
240
+ JIS_X0212.1990-0
241
+ X0212
242
+ CSISO159JISX02121990
243
+ CN
244
+ GB_1988-80
245
+ ISO-IR-57
246
+ ISO646-CN
247
+ CSISO57GB1988
248
+ CHINESE
249
+ GB_2312-80
250
+ ISO-IR-58
251
+ CSISO58GB231280
252
+ CN-GB-ISOIR165
253
+ ISO-IR-165
254
+ ISO-IR-149
255
+ KOREAN
256
+ KSC_5601
257
+ KS_C_5601-1987
258
+ KS_C_5601-1989
259
+ CSKSC56011987
260
+ EUC-JP
261
+ EUCJP
262
+ EXTENDED_UNIX_CODE_PACKED_FORMAT_FOR_JAPANESE
263
+ CSEUCPKDFMTJAPANESE
264
+ MS_KANJI
265
+ SHIFT-JIS
266
+ SHIFT_JIS
267
+ SJIS
268
+ CSSHIFTJIS
269
+ CP932
270
+ ISO-2022-JP
271
+ CSISO2022JP
272
+ ISO-2022-JP-1
273
+ ISO-2022-JP-2
274
+ CSISO2022JP2
275
+ CN-GB
276
+ EUC-CN
277
+ EUCCN
278
+ GB2312
279
+ CSGB2312
280
+ CP936
281
+ GBK
282
+ GB18030
283
+ ISO-2022-CN
284
+ CSISO2022CN
285
+ ISO-2022-CN-EXT
286
+ HZ
287
+ HZ-GB-2312
288
+ EUC-TW
289
+ EUCTW
290
+ CSEUCTW
291
+ BIG-5
292
+ BIG-FIVE
293
+ BIG5
294
+ BIGFIVE
295
+ CN-BIG5
296
+ CSBIG5
297
+ CP950
298
+ BIG5-HKSCS
299
+ BIG5HKSCS
300
+ EUC-KR
301
+ EUCKR
302
+ CSEUCKR
303
+ CP949
304
+ UHC
305
+ CP1361
306
+ JOHAB
307
+ ISO-2022-KR
308
+ CSISO2022KR
309
+ 437
310
+ CP437
311
+ IBM437
312
+ CSPC8CODEPAGE437
313
+ CP737
314
+ CP775
315
+ IBM775
316
+ CSPC775BALTIC
317
+ 852
318
+ CP852
319
+ IBM852
320
+ CSPCP852
321
+ CP853
322
+ 855
323
+ CP855
324
+ IBM855
325
+ CSIBM855
326
+ 857
327
+ CP857
328
+ IBM857
329
+ CSIBM857
330
+ CP858
331
+ 860
332
+ CP860
333
+ IBM860
334
+ CSIBM860
335
+ 861
336
+ CP-IS
337
+ CP861
338
+ IBM861
339
+ CSIBM861
340
+ 863
341
+ CP863
342
+ IBM863
343
+ CSIBM863
344
+ CP864
345
+ IBM864
346
+ CSIBM864
347
+ 865
348
+ CP865
349
+ IBM865
350
+ CSIBM865
351
+ 869
352
+ CP-GR
353
+ CP869
354
+ IBM869
355
+ CSIBM869
356
+ CP1125
357
+ JIS
@@ -0,0 +1,161 @@
1
+ require 'iconv'
2
+ require 'kconv'
3
+
4
+ # :include: ../README
5
+ module Sconv
6
+ # Convert CES (Character Encoding Scheme) of a string
7
+ # from "from" code to "to" code
8
+ # (selects kconv for Japanese encodings, iconv for others)
9
+ def sconv(from, to)
10
+ if (from =~ /iso-2022-jp|jis|shift_jis|sjis|euc-jp|utf-8|utf-16/i)&&(to =~ /iso-2022-jp|jis|shift_jis|sjis|euc-jp|utf-8|utf-16/i)
11
+
12
+ in_code = Kconv::JIS if from =~ /^i|^j/i
13
+ in_code = Kconv::SJIS if from =~ /^s/i
14
+ in_code = Kconv::EUC if from =~ /^e/i
15
+ in_code = Kconv::UTF8 if from =~ /8$/
16
+ in_code = Kconv::UTF16 if from =~ /6$/
17
+
18
+ out_code = Kconv::JIS if to =~ /^i|^j/i
19
+ out_code = Kconv::SJIS if to =~ /^s/i
20
+ out_code = Kconv::EUC if to =~ /^e/i
21
+ out_code = Kconv::UTF8 if to =~ /8$/
22
+ out_code = Kconv::UTF16 if to =~ /6$/
23
+
24
+ kconv(out_code, in_code)
25
+
26
+ elsif from =~ /iso-2022-jp|jis|shift_jis|sjis|euc-jp|utf-8|utf-16/i
27
+ in_code = Kconv::JIS if from =~ /^i|^j/i
28
+ in_code = Kconv::SJIS if from =~ /^s/i
29
+ in_code = Kconv::EUC if from =~ /^e/i
30
+ in_code = Kconv::UTF8 if from =~ /8$/
31
+ in_code = Kconv::UTF16 if from =~ /6$/
32
+
33
+ out_code = to
34
+
35
+ str_utf8 = self.kconv(Kconv::UTF8, in_code)
36
+ Iconv.conv(out_code, 'UTF-8', str_utf8)
37
+
38
+ elsif to =~ /iso-2022-jp|jis|shift_jis|sjis|euc-jp|utf-8|utf-16/i
39
+ in_code = from
40
+
41
+ out_code = Kconv::JIS if to =~ /^i|^j/i
42
+ out_code = Kconv::SJIS if to =~ /^s/i
43
+ out_code = Kconv::EUC if to =~ /^e/i
44
+ out_code = Kconv::UTF8 if to =~ /8$/
45
+ out_code = Kconv::UTF16 if to =~ /6$/
46
+
47
+ Iconv.conv('UTF-8', in_code, self).kconv(out_code, Kconv::UTF8)
48
+
49
+ else
50
+ in_code = from
51
+ out_code = to
52
+
53
+ Iconv.conv(out_code, in_code, self)
54
+ end
55
+ end
56
+
57
+ # Return name of CES
58
+ # without is not removed hyphens and underlines.
59
+ def normalize
60
+ ces = Hash.new
61
+ open('ceslist','r').read.scan(/\S+\n/).each do |i|
62
+ ces[i.chomp.delete('\-,_')] = i.chomp
63
+ end
64
+ ces.fetch(self)
65
+ end
66
+
67
+ # Check whether the string is EUC-JP or not
68
+ # (more Ruby-like method name)
69
+ def EUC?
70
+ if iseuc == nil
71
+ false
72
+ else
73
+ true
74
+ end
75
+ end
76
+
77
+ # Check whether the string is Shift_JIS or not
78
+ # (more Ruby-like method name)
79
+ def SJIS?
80
+ if issjis == nil
81
+ false
82
+ else
83
+ true
84
+ end
85
+ end
86
+
87
+ # Check whether the string is UTF-8 or not
88
+ # (more Ruby-like method name)
89
+ def UTF8?
90
+ if isutf8 == nil
91
+ false
92
+ else
93
+ true
94
+ end
95
+ end
96
+
97
+ # Guess the string's CES; return the result as a string
98
+ def guess_ces
99
+ case Kconv.guess(self)
100
+ when 1
101
+ 'ISO-2022-JP'
102
+ when 2
103
+ 'EUC-JP'
104
+ when 3
105
+ 'Shift_JIS'
106
+ when 4
107
+ 'BINARY'
108
+ when 5
109
+ 'ASCII'
110
+ when 6
111
+ 'UTF-8'
112
+ when 8
113
+ 'UTF-16'
114
+ end
115
+ end
116
+ end
117
+
118
+ # === Helper Functions for String Class
119
+ class String
120
+ include Sconv
121
+
122
+ alias sconf_old_method_missing method_missing
123
+
124
+ # Provides methods of the form String#inputEncoding_to_outputEnoding
125
+ # and String#inputEncoding_to_outputEncoding!. Examples:
126
+ # String#UTF8_to_ISO88591
127
+ def method_missing (method_name)
128
+ okay = false
129
+ if method_name.to_s =~ /^(.+)_to_([^!]+)(!)?$/
130
+ from_code, to_code, exclamation = $1.normalize, $2.normalize, $3
131
+ destructive = exclamation == '!'
132
+ okay = true
133
+ # test whether we can convert these encodings
134
+ begin
135
+ ''.sconv(from_code, to_code)
136
+ end
137
+ if okay
138
+ if destructive
139
+ self.class.send :define_method, method_name.to_sym do
140
+ replace self.sconv(from_code, to_code)
141
+ end
142
+ else
143
+ self.class.send :define_method, method_name.to_sym do
144
+ sconv(from_code, to_code)
145
+ end
146
+ end
147
+ send method_name
148
+ else
149
+ sconf_old_method_missing
150
+ end
151
+ end
152
+
153
+ if method_name.to_s =~ /^to_(\w+)$/
154
+ method_called = 'to' + $1.downcase
155
+ self.class.send :define_method, method_name.to_sym do
156
+ send method_called.to_sym
157
+ end
158
+ send method_name
159
+ end
160
+ end
161
+ end
metadata ADDED
@@ -0,0 +1,48 @@
1
+ --- !ruby/object:Gem::Specification
2
+ rubygems_version: 0.9.2
3
+ specification_version: 1
4
+ name: sconv
5
+ version: !ruby/object:Gem::Version
6
+ version: 0.1.0
7
+ date: 2007-05-28 00:00:00 +09:00
8
+ summary: A convenience layer for character encoding conversion, providing one-stop shopping for Kconv (for Japanese encodings) and Iconv (for other encodings).
9
+ require_paths:
10
+ - lib
11
+ email: duerst@it.aoyama.ac.jp
12
+ homepage:
13
+ rubyforge_project:
14
+ description:
15
+ autorequire: sconv
16
+ default_executable:
17
+ bindir: bin
18
+ has_rdoc: true
19
+ required_ruby_version: !ruby/object:Gem::Version::Requirement
20
+ requirements:
21
+ - - ">"
22
+ - !ruby/object:Gem::Version
23
+ version: 0.0.0
24
+ version:
25
+ platform: ruby
26
+ signing_key:
27
+ cert_chain:
28
+ post_install_message:
29
+ authors:
30
+ - Takuya Shimada, Martin J. Du"rst
31
+ files:
32
+ - lib/ceslist
33
+ - lib/sconv.rb
34
+ - README
35
+ test_files: []
36
+
37
+ rdoc_options: []
38
+
39
+ extra_rdoc_files:
40
+ - README
41
+ executables: []
42
+
43
+ extensions: []
44
+
45
+ requirements: []
46
+
47
+ dependencies: []
48
+