icu4r 0.1.3.2006.01.26
Sign up to get free protection for your applications and to get access to all the features.
- data/MIT-LICENSE +20 -0
- data/README +153 -0
- data/calendar.c +576 -0
- data/docs/FORMATTING +131 -0
- data/docs/UNICODE_REGEXPS +204 -0
- data/extconf.rb +15 -0
- data/fmt.cpp +150 -0
- data/icu4r.c +14 -0
- data/icu_common.h +45 -0
- data/samples/demo_each.rb +23 -0
- data/samples/demo_locales.rb +16 -0
- data/samples/demo_regexp.rb +11 -0
- data/samples/resbundle/appmsg/root.res +0 -0
- data/samples/resbundle/appmsg/ru.res +0 -0
- data/samples/resbundle/demo_bundle.rb +4 -0
- data/samples/resbundle/mkres.sh +4 -0
- data/samples/resbundle/root.txt +10 -0
- data/samples/resbundle/ru.txt +4 -0
- data/test/test_calendar.rb +109 -0
- data/test/test_ustring.rb +381 -0
- data/tools/doc.sh +2 -0
- data/tools/km.rb +425 -0
- data/ubundle.c +209 -0
- data/ucore_ext.c +168 -0
- data/uregex.c +673 -0
- data/uregex.h +27 -0
- data/ustring.c +3042 -0
- metadata +81 -0
data/MIT-LICENSE
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
Copyright (c) 2006 Nikolai Lugovoi
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
"Software"), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
17
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README
ADDED
@@ -0,0 +1,153 @@
|
|
1
|
+
== ICU4R - ICU Unicode bindings for Ruby
|
2
|
+
|
3
|
+
ICU4R is an attempt to provide better Unicode support for Ruby,
|
4
|
+
where it lacks for a long time.
|
5
|
+
|
6
|
+
Current code is mostly rewritten string.c from Ruby 1.8.3.
|
7
|
+
|
8
|
+
ICU4R is Ruby C-extension binding for ICU library[1]
|
9
|
+
and provides following classes and functionality:
|
10
|
+
|
11
|
+
* UString:
|
12
|
+
- String-like class with internal UTF16 storage;
|
13
|
+
- UCA rules for UString comparisons (<=>, casecmp);
|
14
|
+
- encoding(codepage) conversion;
|
15
|
+
- Unicode normalization;
|
16
|
+
- transliteration, also rule-based;
|
17
|
+
|
18
|
+
Bunch of locale-sensitive functions:
|
19
|
+
- upcase/downcase;
|
20
|
+
- string collation;
|
21
|
+
- string search;
|
22
|
+
- iterators over text line/word/char/sentence breaks;
|
23
|
+
- message formatting (number/currency/string/time);
|
24
|
+
- date and number parsing.
|
25
|
+
|
26
|
+
* URegexp - unicode regular expressions.
|
27
|
+
|
28
|
+
* UResourceBundle - access to resource bundles, including ICU locale data.
|
29
|
+
|
30
|
+
* UCalendar - date manipulation and timezone info.
|
31
|
+
|
32
|
+
== Install and usage
|
33
|
+
|
34
|
+
> ruby extconf.rb
|
35
|
+
> make
|
36
|
+
> ruby test/test_ustring.rb
|
37
|
+
> make install
|
38
|
+
|
39
|
+
Now, in your scripts just require 'icu4r'.
|
40
|
+
|
41
|
+
To create RDoc, run
|
42
|
+
> sh tools/doc.sh
|
43
|
+
|
44
|
+
== Requirements
|
45
|
+
|
46
|
+
To build and use ICU4R you will need GCC and ICU v3.4 libraries[2].
|
47
|
+
|
48
|
+
== Differences from Ruby String and Regexp classes
|
49
|
+
|
50
|
+
=== UString vs String
|
51
|
+
|
52
|
+
1. UString substring/index methods use UTF16 codeunit indexes, not code points.
|
53
|
+
|
54
|
+
2. UString supports most methods from String class. Missing methods are:
|
55
|
+
capitalize, capitalize!, swapcase, swapcase!
|
56
|
+
%, center, ljust, rjust
|
57
|
+
chomp, chomp!, chop, chop!
|
58
|
+
count, delete, delete!, squeeze, squeeze!, tr, tr!, tr_s, tr_s!
|
59
|
+
crypt, intern, sum, unpack
|
60
|
+
dump, each_byte, each_line
|
61
|
+
hex, oct, to_i, to_sym
|
62
|
+
reverse, reverse!
|
63
|
+
succ, succ!, next, next!, upto
|
64
|
+
|
65
|
+
3. Instead of String#% method, UString#format is provided. See FORMATTING for short reference.
|
66
|
+
|
67
|
+
4. UStrings can be created via String.to_u(encoding='utf8') or global u(str,[encoding='utf8'])
|
68
|
+
calls. Note that +encoding+ parameter must be value of String class.
|
69
|
+
|
70
|
+
5. There's difference between character grapheme, codepoint and codeunit. See UNICODE reports for
|
71
|
+
gory details, but in short: locale dependent notion of character can be presented using
|
72
|
+
more than one codepoint - base letter and combining (accents) (also possible more than one!), and
|
73
|
+
each codepoint can require more than one codeunit to store (for UTF8 codeunit size is 8bit, though
|
74
|
+
some codepoints require up to 6bytes). So, UString has normalization and locale dependent break
|
75
|
+
iterators.
|
76
|
+
|
77
|
+
6. Currently UString doesn't include Enumerable module.
|
78
|
+
|
79
|
+
7. UString index/[] methods which accept URegexp, throw exception if Regexp passed.
|
80
|
+
|
81
|
+
8. UString#<=>, UString#casecmp use UCA rules.
|
82
|
+
|
83
|
+
=== URegexp
|
84
|
+
|
85
|
+
UString uses ICU regexp library. Pattern syntax is described in [./docs/UNICODE_REGEXPS] and ICU docs.
|
86
|
+
|
87
|
+
There are some differences between processing in Ruby Regexp and URegexp:
|
88
|
+
|
89
|
+
1. When UString#sub, UString#gsub are called with block, special vars ($~, $&, $1, ...) aren't
|
90
|
+
set, as their values are processed through deep ruby core code. Instead, block receives UMatch object,
|
91
|
+
which is essentially immutable array of matching groups:
|
92
|
+
"test".u.gsub(ure("(e)(.)")) do |match|
|
93
|
+
puts match[0] # => 'es' <--> $&
|
94
|
+
puts match[1] # => 'e' <--> $1
|
95
|
+
puts match[2] # => 's' <--> $2
|
96
|
+
end
|
97
|
+
|
98
|
+
2. In URegexp search pattern backreferences are in form \n (\1, \2, ...),
|
99
|
+
in replacement string - in form $1, $2, ...
|
100
|
+
|
101
|
+
NOTE: URegexp considers char to be a digit NOT ONLY ASCII (0x0030-0x0039), but
|
102
|
+
any Unicode char, which has property Decimal digit number (Nd), e.g.:
|
103
|
+
a = [?$, 0x1D7D9].pack("U*").u * 2
|
104
|
+
puts a.inspect_names
|
105
|
+
<U000024>DOLLAR SIGN
|
106
|
+
<U01D7D9>MATHEMATICAL DOUBLE-STRUCK DIGIT ONE
|
107
|
+
<U000024>DOLLAR SIGN
|
108
|
+
<U01D7D9>MATHEMATICAL DOUBLE-STRUCK DIGIT ONE
|
109
|
+
puts "abracadabra".u.gsub(/(b)/.U, a)
|
110
|
+
abbracadabbra
|
111
|
+
|
112
|
+
|
113
|
+
3. One can create URegexp using global Kernel#ure function, Regexp#U, Regexp#to_u, or
|
114
|
+
from UString using URegexp.new, e.g:
|
115
|
+
/pattern/.U =~ "string".u
|
116
|
+
|
117
|
+
4. There are differences about Regexp and URegexp multiline matching options:
|
118
|
+
t = "text\ntest"
|
119
|
+
# ^,$ handling : URegexp multiline <-> Ruby default
|
120
|
+
t.u =~ ure('^\w+$', URegexp::MULTILINE)
|
121
|
+
=> #<UMatch:0xf6f7de04 @ranges=[0..3], @cg=[\u0074\u0065\u0078\u0074]>
|
122
|
+
t =~ /^\w+$/
|
123
|
+
=> 0
|
124
|
+
# . matches \n : URegexp DOTALL <-> /m
|
125
|
+
t.u =~ ure('.+test', URegexp::DOTALL)
|
126
|
+
=> #<UMatch:0xf6fa4d88 ...
|
127
|
+
t.u =~ /.+test/m
|
128
|
+
|
129
|
+
5. UMatch.range(idx) returns range for capturing group idx. This range is in codeunits.
|
130
|
+
|
131
|
+
=== References
|
132
|
+
|
133
|
+
1. ICU Official Homepage http://ibm.com/software/globalization/icu/
|
134
|
+
2. ICU downloads http://ibm.com/software/globalization/icu/downloads.jsp
|
135
|
+
3. ICU Home Page http://icu.sf.net
|
136
|
+
4. Unicode Home Page http://www.unicode.org
|
137
|
+
|
138
|
+
==== BUGS, DOCS, TODO
|
139
|
+
|
140
|
+
The code is slow and inefficient yet, is still highly experimental,
|
141
|
+
so can have many security and memory leaks, bugs, inconsistent
|
142
|
+
documentation, incomplete test suite. Use it at your own risk.
|
143
|
+
|
144
|
+
Bug reports and feature requests are welcome :)
|
145
|
+
|
146
|
+
=== Copying
|
147
|
+
|
148
|
+
This extension module is copyrighted free software by Nikolai Lugovoi.
|
149
|
+
|
150
|
+
You can redistribute it and/or modify it under the terms of MIT License.
|
151
|
+
|
152
|
+
Nikolai Lugovoi <meadow.nnick@gmail.com>
|
153
|
+
|