icu4r 0.1.3.2006.01.26

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/MIT-LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2006 Nikolai Lugovoi
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README ADDED
@@ -0,0 +1,153 @@
1
+ == ICU4R - ICU Unicode bindings for Ruby
2
+
3
+ ICU4R is an attempt to provide better Unicode support for Ruby,
4
+ where it lacks for a long time.
5
+
6
+ Current code is mostly rewritten string.c from Ruby 1.8.3.
7
+
8
+ ICU4R is Ruby C-extension binding for ICU library[1]
9
+ and provides following classes and functionality:
10
+
11
+ * UString:
12
+ - String-like class with internal UTF16 storage;
13
+ - UCA rules for UString comparisons (<=>, casecmp);
14
+ - encoding(codepage) conversion;
15
+ - Unicode normalization;
16
+ - transliteration, also rule-based;
17
+
18
+ Bunch of locale-sensitive functions:
19
+ - upcase/downcase;
20
+ - string collation;
21
+ - string search;
22
+ - iterators over text line/word/char/sentence breaks;
23
+ - message formatting (number/currency/string/time);
24
+ - date and number parsing.
25
+
26
+ * URegexp - unicode regular expressions.
27
+
28
+ * UResourceBundle - access to resource bundles, including ICU locale data.
29
+
30
+ * UCalendar - date manipulation and timezone info.
31
+
32
+ == Install and usage
33
+
34
+ > ruby extconf.rb
35
+ > make
36
+ > ruby test/test_ustring.rb
37
+ > make install
38
+
39
+ Now, in your scripts just require 'icu4r'.
40
+
41
+ To create RDoc, run
42
+ > sh tools/doc.sh
43
+
44
+ == Requirements
45
+
46
+ To build and use ICU4R you will need GCC and ICU v3.4 libraries[2].
47
+
48
+ == Differences from Ruby String and Regexp classes
49
+
50
+ === UString vs String
51
+
52
+ 1. UString substring/index methods use UTF16 codeunit indexes, not code points.
53
+
54
+ 2. UString supports most methods from String class. Missing methods are:
55
+ capitalize, capitalize!, swapcase, swapcase!
56
+ %, center, ljust, rjust
57
+ chomp, chomp!, chop, chop!
58
+ count, delete, delete!, squeeze, squeeze!, tr, tr!, tr_s, tr_s!
59
+ crypt, intern, sum, unpack
60
+ dump, each_byte, each_line
61
+ hex, oct, to_i, to_sym
62
+ reverse, reverse!
63
+ succ, succ!, next, next!, upto
64
+
65
+ 3. Instead of String#% method, UString#format is provided. See FORMATTING for short reference.
66
+
67
+ 4. UStrings can be created via String.to_u(encoding='utf8') or global u(str,[encoding='utf8'])
68
+ calls. Note that +encoding+ parameter must be value of String class.
69
+
70
+ 5. There's difference between character grapheme, codepoint and codeunit. See UNICODE reports for
71
+ gory details, but in short: locale dependent notion of character can be presented using
72
+ more than one codepoint - base letter and combining (accents) (also possible more than one!), and
73
+ each codepoint can require more than one codeunit to store (for UTF8 codeunit size is 8bit, though
74
+ some codepoints require up to 6bytes). So, UString has normalization and locale dependent break
75
+ iterators.
76
+
77
+ 6. Currently UString doesn't include Enumerable module.
78
+
79
+ 7. UString index/[] methods which accept URegexp, throw exception if Regexp passed.
80
+
81
+ 8. UString#<=>, UString#casecmp use UCA rules.
82
+
83
+ === URegexp
84
+
85
+ UString uses ICU regexp library. Pattern syntax is described in [./docs/UNICODE_REGEXPS] and ICU docs.
86
+
87
+ There are some differences between processing in Ruby Regexp and URegexp:
88
+
89
+ 1. When UString#sub, UString#gsub are called with block, special vars ($~, $&, $1, ...) aren't
90
+ set, as their values are processed through deep ruby core code. Instead, block receives UMatch object,
91
+ which is essentially immutable array of matching groups:
92
+ "test".u.gsub(ure("(e)(.)")) do |match|
93
+ puts match[0] # => 'es' <--> $&
94
+ puts match[1] # => 'e' <--> $1
95
+ puts match[2] # => 's' <--> $2
96
+ end
97
+
98
+ 2. In URegexp search pattern backreferences are in form \n (\1, \2, ...),
99
+ in replacement string - in form $1, $2, ...
100
+
101
+ NOTE: URegexp considers char to be a digit NOT ONLY ASCII (0x0030-0x0039), but
102
+ any Unicode char, which has property Decimal digit number (Nd), e.g.:
103
+ a = [?$, 0x1D7D9].pack("U*").u * 2
104
+ puts a.inspect_names
105
+ <U000024>DOLLAR SIGN
106
+ <U01D7D9>MATHEMATICAL DOUBLE-STRUCK DIGIT ONE
107
+ <U000024>DOLLAR SIGN
108
+ <U01D7D9>MATHEMATICAL DOUBLE-STRUCK DIGIT ONE
109
+ puts "abracadabra".u.gsub(/(b)/.U, a)
110
+ abbracadabbra
111
+
112
+
113
+ 3. One can create URegexp using global Kernel#ure function, Regexp#U, Regexp#to_u, or
114
+ from UString using URegexp.new, e.g:
115
+ /pattern/.U =~ "string".u
116
+
117
+ 4. There are differences about Regexp and URegexp multiline matching options:
118
+ t = "text\ntest"
119
+ # ^,$ handling : URegexp multiline <-> Ruby default
120
+ t.u =~ ure('^\w+$', URegexp::MULTILINE)
121
+ => #<UMatch:0xf6f7de04 @ranges=[0..3], @cg=[\u0074\u0065\u0078\u0074]>
122
+ t =~ /^\w+$/
123
+ => 0
124
+ # . matches \n : URegexp DOTALL <-> /m
125
+ t.u =~ ure('.+test', URegexp::DOTALL)
126
+ => #<UMatch:0xf6fa4d88 ...
127
+ t.u =~ /.+test/m
128
+
129
+ 5. UMatch.range(idx) returns range for capturing group idx. This range is in codeunits.
130
+
131
+ === References
132
+
133
+ 1. ICU Official Homepage http://ibm.com/software/globalization/icu/
134
+ 2. ICU downloads http://ibm.com/software/globalization/icu/downloads.jsp
135
+ 3. ICU Home Page http://icu.sf.net
136
+ 4. Unicode Home Page http://www.unicode.org
137
+
138
+ ==== BUGS, DOCS, TODO
139
+
140
+ The code is slow and inefficient yet, is still highly experimental,
141
+ so can have many security and memory leaks, bugs, inconsistent
142
+ documentation, incomplete test suite. Use it at your own risk.
143
+
144
+ Bug reports and feature requests are welcome :)
145
+
146
+ === Copying
147
+
148
+ This extension module is copyrighted free software by Nikolai Lugovoi.
149
+
150
+ You can redistribute it and/or modify it under the terms of MIT License.
151
+
152
+ Nikolai Lugovoi <meadow.nnick@gmail.com>
153
+