icu4r 0.1.3.2006.01.26

Sign up to get free protection for your applications and to get access to all the features.
data/MIT-LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2006 Nikolai Lugovoi
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README ADDED
@@ -0,0 +1,153 @@
1
+ == ICU4R - ICU Unicode bindings for Ruby
2
+
3
+ ICU4R is an attempt to provide better Unicode support for Ruby,
4
+ where it lacks for a long time.
5
+
6
+ Current code is mostly rewritten string.c from Ruby 1.8.3.
7
+
8
+ ICU4R is Ruby C-extension binding for ICU library[1]
9
+ and provides following classes and functionality:
10
+
11
+ * UString:
12
+ - String-like class with internal UTF16 storage;
13
+ - UCA rules for UString comparisons (<=>, casecmp);
14
+ - encoding(codepage) conversion;
15
+ - Unicode normalization;
16
+ - transliteration, also rule-based;
17
+
18
+ Bunch of locale-sensitive functions:
19
+ - upcase/downcase;
20
+ - string collation;
21
+ - string search;
22
+ - iterators over text line/word/char/sentence breaks;
23
+ - message formatting (number/currency/string/time);
24
+ - date and number parsing.
25
+
26
+ * URegexp - unicode regular expressions.
27
+
28
+ * UResourceBundle - access to resource bundles, including ICU locale data.
29
+
30
+ * UCalendar - date manipulation and timezone info.
31
+
32
+ == Install and usage
33
+
34
+ > ruby extconf.rb
35
+ > make
36
+ > ruby test/test_ustring.rb
37
+ > make install
38
+
39
+ Now, in your scripts just require 'icu4r'.
40
+
41
+ To create RDoc, run
42
+ > sh tools/doc.sh
43
+
44
+ == Requirements
45
+
46
+ To build and use ICU4R you will need GCC and ICU v3.4 libraries[2].
47
+
48
+ == Differences from Ruby String and Regexp classes
49
+
50
+ === UString vs String
51
+
52
+ 1. UString substring/index methods use UTF16 codeunit indexes, not code points.
53
+
54
+ 2. UString supports most methods from String class. Missing methods are:
55
+ capitalize, capitalize!, swapcase, swapcase!
56
+ %, center, ljust, rjust
57
+ chomp, chomp!, chop, chop!
58
+ count, delete, delete!, squeeze, squeeze!, tr, tr!, tr_s, tr_s!
59
+ crypt, intern, sum, unpack
60
+ dump, each_byte, each_line
61
+ hex, oct, to_i, to_sym
62
+ reverse, reverse!
63
+ succ, succ!, next, next!, upto
64
+
65
+ 3. Instead of String#% method, UString#format is provided. See FORMATTING for short reference.
66
+
67
+ 4. UStrings can be created via String.to_u(encoding='utf8') or global u(str,[encoding='utf8'])
68
+ calls. Note that +encoding+ parameter must be value of String class.
69
+
70
+ 5. There's difference between character grapheme, codepoint and codeunit. See UNICODE reports for
71
+ gory details, but in short: locale dependent notion of character can be presented using
72
+ more than one codepoint - base letter and combining (accents) (also possible more than one!), and
73
+ each codepoint can require more than one codeunit to store (for UTF8 codeunit size is 8bit, though
74
+ some codepoints require up to 6bytes). So, UString has normalization and locale dependent break
75
+ iterators.
76
+
77
+ 6. Currently UString doesn't include Enumerable module.
78
+
79
+ 7. UString index/[] methods which accept URegexp, throw exception if Regexp passed.
80
+
81
+ 8. UString#<=>, UString#casecmp use UCA rules.
82
+
83
+ === URegexp
84
+
85
+ UString uses ICU regexp library. Pattern syntax is described in [./docs/UNICODE_REGEXPS] and ICU docs.
86
+
87
+ There are some differences between processing in Ruby Regexp and URegexp:
88
+
89
+ 1. When UString#sub, UString#gsub are called with block, special vars ($~, $&, $1, ...) aren't
90
+ set, as their values are processed through deep ruby core code. Instead, block receives UMatch object,
91
+ which is essentially immutable array of matching groups:
92
+ "test".u.gsub(ure("(e)(.)")) do |match|
93
+ puts match[0] # => 'es' <--> $&
94
+ puts match[1] # => 'e' <--> $1
95
+ puts match[2] # => 's' <--> $2
96
+ end
97
+
98
+ 2. In URegexp search pattern backreferences are in form \n (\1, \2, ...),
99
+ in replacement string - in form $1, $2, ...
100
+
101
+ NOTE: URegexp considers char to be a digit NOT ONLY ASCII (0x0030-0x0039), but
102
+ any Unicode char, which has property Decimal digit number (Nd), e.g.:
103
+ a = [?$, 0x1D7D9].pack("U*").u * 2
104
+ puts a.inspect_names
105
+ <U000024>DOLLAR SIGN
106
+ <U01D7D9>MATHEMATICAL DOUBLE-STRUCK DIGIT ONE
107
+ <U000024>DOLLAR SIGN
108
+ <U01D7D9>MATHEMATICAL DOUBLE-STRUCK DIGIT ONE
109
+ puts "abracadabra".u.gsub(/(b)/.U, a)
110
+ abbracadabbra
111
+
112
+
113
+ 3. One can create URegexp using global Kernel#ure function, Regexp#U, Regexp#to_u, or
114
+ from UString using URegexp.new, e.g:
115
+ /pattern/.U =~ "string".u
116
+
117
+ 4. There are differences about Regexp and URegexp multiline matching options:
118
+ t = "text\ntest"
119
+ # ^,$ handling : URegexp multiline <-> Ruby default
120
+ t.u =~ ure('^\w+$', URegexp::MULTILINE)
121
+ => #<UMatch:0xf6f7de04 @ranges=[0..3], @cg=[\u0074\u0065\u0078\u0074]>
122
+ t =~ /^\w+$/
123
+ => 0
124
+ # . matches \n : URegexp DOTALL <-> /m
125
+ t.u =~ ure('.+test', URegexp::DOTALL)
126
+ => #<UMatch:0xf6fa4d88 ...
127
+ t.u =~ /.+test/m
128
+
129
+ 5. UMatch.range(idx) returns range for capturing group idx. This range is in codeunits.
130
+
131
+ === References
132
+
133
+ 1. ICU Official Homepage http://ibm.com/software/globalization/icu/
134
+ 2. ICU downloads http://ibm.com/software/globalization/icu/downloads.jsp
135
+ 3. ICU Home Page http://icu.sf.net
136
+ 4. Unicode Home Page http://www.unicode.org
137
+
138
+ ==== BUGS, DOCS, TODO
139
+
140
+ The code is slow and inefficient yet, is still highly experimental,
141
+ so can have many security and memory leaks, bugs, inconsistent
142
+ documentation, incomplete test suite. Use it at your own risk.
143
+
144
+ Bug reports and feature requests are welcome :)
145
+
146
+ === Copying
147
+
148
+ This extension module is copyrighted free software by Nikolai Lugovoi.
149
+
150
+ You can redistribute it and/or modify it under the terms of MIT License.
151
+
152
+ Nikolai Lugovoi <meadow.nnick@gmail.com>
153
+