sconv 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (4) hide show
  1. data/README +31 -0
  2. data/lib/ceslist +357 -0
  3. data/lib/sconv.rb +161 -0
  4. metadata +48 -0
data/README ADDED
@@ -0,0 +1,31 @@
1
+ == Sconv version 0.1.0
2
+
3
+ === Overview
4
+ The Module sconv provides a convenience layer for character
5
+ encoding conversion. It is included by class String.
6
+
7
+ It automatically selects between kconv (for Japanese encodings)
8
+ and iconv (for other encodings world-wide). It provides
9
+ conversion methods directly on Strings, including destructive
10
+ conversions. It provides methods with names of the form
11
+ String#inputEncoding_to_outputEncoding automatically
12
+ for any encoding supported by iconv or kconv.
13
+
14
+ For further information, please see:
15
+ "A Study Concerning Multilingual Processing in Ruby",
16
+ Takuya SHIMADA, Kazunari ITO, and Martin J. Du"rst,
17
+ Proceedings of the 69th Information Processing Society
18
+ of Japan National Convention, March 2007 (in Japanese)
19
+
20
+ === Future Work
21
+ - Write tests and run them
22
+ - Provide support for error messages
23
+ - Provide support for other CES conversions
24
+ - Provide automatic update for list of
25
+ supported character encodings in ces.rb
26
+ (depends on locally installed iconv library)
27
+
28
+ === Copyright
29
+ Copyright (c) 2007 Takuya Shimada, Martin J. Du"rst
30
+ Licensed under the same terms as Ruby. Absolutely no warranty.
31
+ (see http://www.ruby-lang.org/en/LICENSE.txt)
@@ -0,0 +1,357 @@
1
+ ANSI_X3.4-1968
2
+ ANSI_X3.4-1986
3
+ ASCII
4
+ CP367
5
+ IBM367
6
+ ISO-IR-6
7
+ ISO646-US
8
+ ISO_646.IRV:1991
9
+ US
10
+ US-ASCII
11
+ CSASCII
12
+ UTF-8
13
+ ISO-10646-UCS-2
14
+ UCS-2
15
+ CSUNICODE
16
+ UCS-2BE
17
+ UNICODE-1-1
18
+ UNICODEBIG
19
+ CSUNICODE11
20
+ UCS-2LE
21
+ UNICODELITTLE
22
+ ISO-10646-UCS-4
23
+ UCS-4
24
+ CSUCS4
25
+ UCS-4BE
26
+ UCS-4LE
27
+ UTF-16
28
+ UTF-16BE
29
+ UTF-16LE
30
+ UTF-32
31
+ UTF-32BE
32
+ UTF-32LE
33
+ UNICODE-1-1-UTF-7
34
+ UTF-7
35
+ CSUNICODE11UTF7
36
+ UCS-2-INTERNAL
37
+ UCS-2-SWAPPED
38
+ UCS-4-INTERNAL
39
+ UCS-4-SWAPPED
40
+ C99
41
+ JAVA
42
+ CP819
43
+ IBM819
44
+ ISO-8859-1
45
+ ISO-IR-100
46
+ ISO_8859-1
47
+ ISO_8859-1:1987
48
+ L1
49
+ LATIN1
50
+ CSISOLATIN1
51
+ ISO-8859-2
52
+ ISO-IR-101
53
+ ISO_8859-2
54
+ ISO_8859-2:1987
55
+ L2
56
+ LATIN2
57
+ CSISOLATIN2
58
+ ISO-8859-3
59
+ ISO-IR-109
60
+ ISO_8859-3
61
+ ISO_8859-3:1988
62
+ L3
63
+ LATIN3
64
+ CSISOLATIN3
65
+ ISO-8859-4
66
+ ISO-IR-110
67
+ ISO_8859-4
68
+ ISO_8859-4:1988
69
+ L4
70
+ LATIN4
71
+ CSISOLATIN4
72
+ CYRILLIC
73
+ ISO-8859-5
74
+ ISO-IR-144
75
+ ISO_8859-5
76
+ ISO_8859-5:1988
77
+ CSISOLATINCYRILLIC
78
+ ARABIC
79
+ ASMO-708
80
+ ECMA-114
81
+ ISO-8859-6
82
+ ISO-IR-127
83
+ ISO_8859-6
84
+ ISO_8859-6:1987
85
+ CSISOLATINARABIC
86
+ ECMA-118
87
+ ELOT_928
88
+ GREEK
89
+ GREEK8
90
+ ISO-8859-7
91
+ ISO-IR-126
92
+ ISO_8859-7
93
+ ISO_8859-7:1987
94
+ CSISOLATINGREEK
95
+ HEBREW
96
+ ISO-8859-8
97
+ ISO-IR-138
98
+ ISO_8859-8
99
+ ISO_8859-8:1988
100
+ CSISOLATINHEBREW
101
+ ISO-8859-9
102
+ ISO-IR-148
103
+ ISO_8859-9
104
+ ISO_8859-9:1989
105
+ L5
106
+ LATIN5
107
+ CSISOLATIN5
108
+ ISO-8859-10
109
+ ISO-IR-157
110
+ ISO_8859-10
111
+ ISO_8859-10:1992
112
+ L6
113
+ LATIN6
114
+ CSISOLATIN6
115
+ ISO-8859-13
116
+ ISO-IR-179
117
+ ISO_8859-13
118
+ L7
119
+ LATIN7
120
+ ISO-8859-14
121
+ ISO-CELTIC
122
+ ISO-IR-199
123
+ ISO_8859-14
124
+ ISO_8859-14:1998
125
+ L8
126
+ LATIN8
127
+ ISO-8859-15
128
+ ISO-IR-203
129
+ ISO_8859-15
130
+ ISO_8859-15:1998
131
+ ISO-8859-16
132
+ ISO-IR-226
133
+ ISO_8859-16
134
+ ISO_8859-16:2000
135
+ KOI8-R
136
+ CSKOI8R
137
+ KOI8-U
138
+ KOI8-RU
139
+ CP1250
140
+ MS-EE
141
+ WINDOWS-1250
142
+ CP1251
143
+ MS-CYRL
144
+ WINDOWS-1251
145
+ CP1252
146
+ MS-ANSI
147
+ WINDOWS-1252
148
+ CP1253
149
+ MS-GREEK
150
+ WINDOWS-1253
151
+ CP1254
152
+ MS-TURK
153
+ WINDOWS-1254
154
+ CP1255
155
+ MS-HEBR
156
+ WINDOWS-1255
157
+ CP1256
158
+ MS-ARAB
159
+ WINDOWS-1256
160
+ CP1257
161
+ WINBALTRIM
162
+ WINDOWS-1257
163
+ CP1258
164
+ WINDOWS-1258
165
+ 850
166
+ CP850
167
+ IBM850
168
+ CSPC850MULTILINGUAL
169
+ 862
170
+ CP862
171
+ IBM862
172
+ CSPC862LATINHEBREW
173
+ 866
174
+ CP866
175
+ IBM866
176
+ CSIBM866
177
+ MAC
178
+ MACINTOSH
179
+ MACROMAN
180
+ CSMACINTOSH
181
+ MACCENTRALEUROPE
182
+ MACICELAND
183
+ MACCROATIAN
184
+ MACROMANIA
185
+ MACCYRILLIC
186
+ MACUKRAINE
187
+ MACGREEK
188
+ MACTURKISH
189
+ MACHEBREW
190
+ MACARABIC
191
+ MACTHAI
192
+ HP-ROMAN8
193
+ R8
194
+ ROMAN8
195
+ CSHPROMAN8
196
+ NEXTSTEP
197
+ ARMSCII-8
198
+ GEORGIAN-ACADEMY
199
+ GEORGIAN-PS
200
+ KOI8-T
201
+ MULELAO-1
202
+ CP1133
203
+ IBM-CP1133
204
+ ISO-IR-166
205
+ TIS-620
206
+ TIS620
207
+ TIS620-0
208
+ TIS620.2529-1
209
+ TIS620.2533-0
210
+ TIS620.2533-1
211
+ CP874
212
+ WINDOWS-874
213
+ VISCII
214
+ VISCII1.1-1
215
+ CSVISCII
216
+ TCVN
217
+ TCVN-5712
218
+ TCVN5712-1
219
+ TCVN5712-1:1993
220
+ ISO-IR-14
221
+ ISO646-JP
222
+ JIS_C6220-1969-RO
223
+ JP
224
+ CSISO14JISC6220RO
225
+ JISX0201-1976
226
+ JIS_X0201
227
+ X0201
228
+ CSHALFWIDTHKATAKANA
229
+ ISO-IR-87
230
+ JIS0208
231
+ JIS_C6226-1983
232
+ JIS_X0208
233
+ JIS_X0208-1983
234
+ JIS_X0208-1990
235
+ X0208
236
+ CSISO87JISX0208
237
+ ISO-IR-159
238
+ JIS_X0212
239
+ JIS_X0212-1990
240
+ JIS_X0212.1990-0
241
+ X0212
242
+ CSISO159JISX02121990
243
+ CN
244
+ GB_1988-80
245
+ ISO-IR-57
246
+ ISO646-CN
247
+ CSISO57GB1988
248
+ CHINESE
249
+ GB_2312-80
250
+ ISO-IR-58
251
+ CSISO58GB231280
252
+ CN-GB-ISOIR165
253
+ ISO-IR-165
254
+ ISO-IR-149
255
+ KOREAN
256
+ KSC_5601
257
+ KS_C_5601-1987
258
+ KS_C_5601-1989
259
+ CSKSC56011987
260
+ EUC-JP
261
+ EUCJP
262
+ EXTENDED_UNIX_CODE_PACKED_FORMAT_FOR_JAPANESE
263
+ CSEUCPKDFMTJAPANESE
264
+ MS_KANJI
265
+ SHIFT-JIS
266
+ SHIFT_JIS
267
+ SJIS
268
+ CSSHIFTJIS
269
+ CP932
270
+ ISO-2022-JP
271
+ CSISO2022JP
272
+ ISO-2022-JP-1
273
+ ISO-2022-JP-2
274
+ CSISO2022JP2
275
+ CN-GB
276
+ EUC-CN
277
+ EUCCN
278
+ GB2312
279
+ CSGB2312
280
+ CP936
281
+ GBK
282
+ GB18030
283
+ ISO-2022-CN
284
+ CSISO2022CN
285
+ ISO-2022-CN-EXT
286
+ HZ
287
+ HZ-GB-2312
288
+ EUC-TW
289
+ EUCTW
290
+ CSEUCTW
291
+ BIG-5
292
+ BIG-FIVE
293
+ BIG5
294
+ BIGFIVE
295
+ CN-BIG5
296
+ CSBIG5
297
+ CP950
298
+ BIG5-HKSCS
299
+ BIG5HKSCS
300
+ EUC-KR
301
+ EUCKR
302
+ CSEUCKR
303
+ CP949
304
+ UHC
305
+ CP1361
306
+ JOHAB
307
+ ISO-2022-KR
308
+ CSISO2022KR
309
+ 437
310
+ CP437
311
+ IBM437
312
+ CSPC8CODEPAGE437
313
+ CP737
314
+ CP775
315
+ IBM775
316
+ CSPC775BALTIC
317
+ 852
318
+ CP852
319
+ IBM852
320
+ CSPCP852
321
+ CP853
322
+ 855
323
+ CP855
324
+ IBM855
325
+ CSIBM855
326
+ 857
327
+ CP857
328
+ IBM857
329
+ CSIBM857
330
+ CP858
331
+ 860
332
+ CP860
333
+ IBM860
334
+ CSIBM860
335
+ 861
336
+ CP-IS
337
+ CP861
338
+ IBM861
339
+ CSIBM861
340
+ 863
341
+ CP863
342
+ IBM863
343
+ CSIBM863
344
+ CP864
345
+ IBM864
346
+ CSIBM864
347
+ 865
348
+ CP865
349
+ IBM865
350
+ CSIBM865
351
+ 869
352
+ CP-GR
353
+ CP869
354
+ IBM869
355
+ CSIBM869
356
+ CP1125
357
+ JIS
@@ -0,0 +1,161 @@
1
+ require 'iconv'
2
+ require 'kconv'
3
+
4
+ # :include: ../README
5
+ module Sconv
6
+ # Convert CES (Character Encoding Scheme) of a string
7
+ # from "from" code to "to" code
8
+ # (selects kconv for Japanese encodings, iconv for others)
9
+ def sconv(from, to)
10
+ if (from =~ /iso-2022-jp|jis|shift_jis|sjis|euc-jp|utf-8|utf-16/i)&&(to =~ /iso-2022-jp|jis|shift_jis|sjis|euc-jp|utf-8|utf-16/i)
11
+
12
+ in_code = Kconv::JIS if from =~ /^i|^j/i
13
+ in_code = Kconv::SJIS if from =~ /^s/i
14
+ in_code = Kconv::EUC if from =~ /^e/i
15
+ in_code = Kconv::UTF8 if from =~ /8$/
16
+ in_code = Kconv::UTF16 if from =~ /6$/
17
+
18
+ out_code = Kconv::JIS if to =~ /^i|^j/i
19
+ out_code = Kconv::SJIS if to =~ /^s/i
20
+ out_code = Kconv::EUC if to =~ /^e/i
21
+ out_code = Kconv::UTF8 if to =~ /8$/
22
+ out_code = Kconv::UTF16 if to =~ /6$/
23
+
24
+ kconv(out_code, in_code)
25
+
26
+ elsif from =~ /iso-2022-jp|jis|shift_jis|sjis|euc-jp|utf-8|utf-16/i
27
+ in_code = Kconv::JIS if from =~ /^i|^j/i
28
+ in_code = Kconv::SJIS if from =~ /^s/i
29
+ in_code = Kconv::EUC if from =~ /^e/i
30
+ in_code = Kconv::UTF8 if from =~ /8$/
31
+ in_code = Kconv::UTF16 if from =~ /6$/
32
+
33
+ out_code = to
34
+
35
+ str_utf8 = self.kconv(Kconv::UTF8, in_code)
36
+ Iconv.conv(out_code, 'UTF-8', str_utf8)
37
+
38
+ elsif to =~ /iso-2022-jp|jis|shift_jis|sjis|euc-jp|utf-8|utf-16/i
39
+ in_code = from
40
+
41
+ out_code = Kconv::JIS if to =~ /^i|^j/i
42
+ out_code = Kconv::SJIS if to =~ /^s/i
43
+ out_code = Kconv::EUC if to =~ /^e/i
44
+ out_code = Kconv::UTF8 if to =~ /8$/
45
+ out_code = Kconv::UTF16 if to =~ /6$/
46
+
47
+ Iconv.conv('UTF-8', in_code, self).kconv(out_code, Kconv::UTF8)
48
+
49
+ else
50
+ in_code = from
51
+ out_code = to
52
+
53
+ Iconv.conv(out_code, in_code, self)
54
+ end
55
+ end
56
+
57
+ # Return name of CES
58
+ # without is not removed hyphens and underlines.
59
+ def normalize
60
+ ces = Hash.new
61
+ open('ceslist','r').read.scan(/\S+\n/).each do |i|
62
+ ces[i.chomp.delete('\-,_')] = i.chomp
63
+ end
64
+ ces.fetch(self)
65
+ end
66
+
67
+ # Check whether the string is EUC-JP or not
68
+ # (more Ruby-like method name)
69
+ def EUC?
70
+ if iseuc == nil
71
+ false
72
+ else
73
+ true
74
+ end
75
+ end
76
+
77
+ # Check whether the string is Shift_JIS or not
78
+ # (more Ruby-like method name)
79
+ def SJIS?
80
+ if issjis == nil
81
+ false
82
+ else
83
+ true
84
+ end
85
+ end
86
+
87
+ # Check whether the string is UTF-8 or not
88
+ # (more Ruby-like method name)
89
+ def UTF8?
90
+ if isutf8 == nil
91
+ false
92
+ else
93
+ true
94
+ end
95
+ end
96
+
97
+ # Guess the string's CES; return the result as a string
98
+ def guess_ces
99
+ case Kconv.guess(self)
100
+ when 1
101
+ 'ISO-2022-JP'
102
+ when 2
103
+ 'EUC-JP'
104
+ when 3
105
+ 'Shift_JIS'
106
+ when 4
107
+ 'BINARY'
108
+ when 5
109
+ 'ASCII'
110
+ when 6
111
+ 'UTF-8'
112
+ when 8
113
+ 'UTF-16'
114
+ end
115
+ end
116
+ end
117
+
118
+ # === Helper Functions for String Class
119
+ class String
120
+ include Sconv
121
+
122
+ alias sconf_old_method_missing method_missing
123
+
124
+ # Provides methods of the form String#inputEncoding_to_outputEnoding
125
+ # and String#inputEncoding_to_outputEncoding!. Examples:
126
+ # String#UTF8_to_ISO88591
127
+ def method_missing (method_name)
128
+ okay = false
129
+ if method_name.to_s =~ /^(.+)_to_([^!]+)(!)?$/
130
+ from_code, to_code, exclamation = $1.normalize, $2.normalize, $3
131
+ destructive = exclamation == '!'
132
+ okay = true
133
+ # test whether we can convert these encodings
134
+ begin
135
+ ''.sconv(from_code, to_code)
136
+ end
137
+ if okay
138
+ if destructive
139
+ self.class.send :define_method, method_name.to_sym do
140
+ replace self.sconv(from_code, to_code)
141
+ end
142
+ else
143
+ self.class.send :define_method, method_name.to_sym do
144
+ sconv(from_code, to_code)
145
+ end
146
+ end
147
+ send method_name
148
+ else
149
+ sconf_old_method_missing
150
+ end
151
+ end
152
+
153
+ if method_name.to_s =~ /^to_(\w+)$/
154
+ method_called = 'to' + $1.downcase
155
+ self.class.send :define_method, method_name.to_sym do
156
+ send method_called.to_sym
157
+ end
158
+ send method_name
159
+ end
160
+ end
161
+ end
metadata ADDED
@@ -0,0 +1,48 @@
1
+ --- !ruby/object:Gem::Specification
2
+ rubygems_version: 0.9.2
3
+ specification_version: 1
4
+ name: sconv
5
+ version: !ruby/object:Gem::Version
6
+ version: 0.1.0
7
+ date: 2007-05-28 00:00:00 +09:00
8
+ summary: A convenience layer for character encoding conversion, providing one-stop shopping for Kconv (for Japanese encodings) and Iconv (for other encodings).
9
+ require_paths:
10
+ - lib
11
+ email: duerst@it.aoyama.ac.jp
12
+ homepage:
13
+ rubyforge_project:
14
+ description:
15
+ autorequire: sconv
16
+ default_executable:
17
+ bindir: bin
18
+ has_rdoc: true
19
+ required_ruby_version: !ruby/object:Gem::Version::Requirement
20
+ requirements:
21
+ - - ">"
22
+ - !ruby/object:Gem::Version
23
+ version: 0.0.0
24
+ version:
25
+ platform: ruby
26
+ signing_key:
27
+ cert_chain:
28
+ post_install_message:
29
+ authors:
30
+ - Takuya Shimada, Martin J. Du"rst
31
+ files:
32
+ - lib/ceslist
33
+ - lib/sconv.rb
34
+ - README
35
+ test_files: []
36
+
37
+ rdoc_options: []
38
+
39
+ extra_rdoc_files:
40
+ - README
41
+ executables: []
42
+
43
+ extensions: []
44
+
45
+ requirements: []
46
+
47
+ dependencies: []
48
+