alphabets 0.1.0 → 0.1.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/{HISTORY.md → CHANGELOG.md} +0 -0
- data/Manifest.txt +1 -1
- data/NOTES.md +150 -0
- data/Rakefile +1 -1
- data/lib/alphabets/alphabets.rb +27 -13
- data/lib/alphabets/reader.rb +62 -62
- data/lib/alphabets/utils.rb +75 -75
- data/lib/alphabets/version.rb +1 -1
- data/test/test_reader.rb +37 -37
- metadata +8 -14
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 7310d705b53b7f04b8a588b831d71940728aa333
|
4
|
+
data.tar.gz: 3306b5393e10208c4f4f97d523b3a7366042d669
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: c3ba94979d0141b763f9370a520e60124bcadb58d1358f87054f7c10973cb35eecbbc3f36a8fdf75589a4b38911124bc1b5dcad05a638e6fb4b3ae83bc4f1edd
|
7
|
+
data.tar.gz: aac6b538a571553b4d6aa8af63d1bbae722863feb9fdaa22a2b7a1ce1caa06798d0e38ce8e8be3d6ecaf98441d433d6ad0f0a410af2056e210e9edf943a910fa
|
data/{HISTORY.md → CHANGELOG.md}
RENAMED
File without changes
|
data/Manifest.txt
CHANGED
data/NOTES.md
CHANGED
@@ -14,8 +14,25 @@ Use Upcase, Downcase AND Titlecase (!)
|
|
14
14
|
|
15
15
|
## Libraries
|
16
16
|
|
17
|
+
**Ruby**
|
18
|
+
|
17
19
|
- <https://github.com/SixArm/sixarm_ruby_unaccent> - Replace a string's accent characters with ASCII characters. Based on Perl Text::Unaccent from CPAN.
|
18
20
|
|
21
|
+
- <https://github.com/fractalsoft/diacritics> - support downcase, upcase and permanent link with diacritical characters
|
22
|
+
|
23
|
+
**Perl**
|
24
|
+
|
25
|
+
- <https://metacpan.org/pod/Unicode::Diacritic::Strip> - strip diacritics from Unicode text
|
26
|
+
|
27
|
+
**JavaScript**
|
28
|
+
|
29
|
+
- <https://github.com/dundalek/latinize> - convert accents (diacritics) from strings to latin characters
|
30
|
+
|
31
|
+
- <https://github.com/tyxla/remove-accents> - removes the accents from a string, converting them to their corresponding non-accented ascii characters
|
32
|
+
|
33
|
+
**PostgreSQL**
|
34
|
+
|
35
|
+
- <https://www.postgresql.org/docs/current/unaccent.html> - unaccent is a text search dictionary that removes accents (diacritic signs) from lexemes
|
19
36
|
|
20
37
|
|
21
38
|
## Links
|
@@ -35,9 +52,142 @@ Proper Unicoding - Ruby's Regexp engine has a powerful feature built in: It can
|
|
35
52
|
Regex with Class - Ruby's regex engine defines a lot of shortcut character classes. Besides the common meta characters (\w, etc.), there is also the POSIX style expressions and the unicode property syntax. This is an overview of all character classes
|
36
53
|
|
37
54
|
|
55
|
+
**Unicode**
|
56
|
+
|
57
|
+
- <https://unicode.org/reports/tr15/> - Unicode Standard Annex #15 - UNICODE NORMALIZATION FORMS
|
58
|
+
|
38
59
|
**W3C**
|
39
60
|
|
40
61
|
- <https://www.w3.org/TR/charmod-norm/>
|
41
62
|
- <https://www.w3.org/International/wiki/Case_folding>
|
42
63
|
|
43
64
|
In Western European languages, the letter 'i' (U+0069) upper cases to a dotless 'I' (U+0049). In Turkish, this letter upper cases to a dotted upper case letter 'İ' (U+0130). Similarly, 'I' (U+0049) lower cases to 'ı' (U+0131), which is a dotless lowercase letter i.
|
65
|
+
|
66
|
+
**Wikipedia**
|
67
|
+
|
68
|
+
- <https://en.wikipedia.org/wiki/Diacritic>
|
69
|
+
|
70
|
+
**More**
|
71
|
+
|
72
|
+
- [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/)
|
73
|
+
by Joel Spolsky, 2003
|
74
|
+
|
75
|
+
- [Unicode Normalization in Ruby](https://www.honeybadger.io/blog/ruby-unicode-normalization/) by Starr Horne, 2017
|
76
|
+
|
77
|
+
|
78
|
+
## Mappings
|
79
|
+
|
80
|
+
Open questions ...
|
81
|
+
|
82
|
+
```
|
83
|
+
Þ => TH ???
|
84
|
+
þ => th ???
|
85
|
+
```
|
86
|
+
|
87
|
+
|
88
|
+
## Alphabets
|
89
|
+
|
90
|
+
Add more alphabets... why? why not?
|
91
|
+
|
92
|
+
|
93
|
+
- Portuguese [Â, "abcdefghijklmnopqrstuvwxyzáâãàçéêíóôõú", "ABCDEFGHIJKLMNOPQRSTUVWXYZÁÂÃÀÇÉÊÍÓÔÕÚ"]
|
94
|
+
- Russian [Щ, Ъ, Э, "абвгдеёжзийклмнопрстуфхцчшщъыьэюя", "АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ"]
|
95
|
+
- Greek [Β, Μ, Χ, Ω, Ή, Ύ, Ώ, ΐ, ΰ, Ϊ, Ϋ]
|
96
|
+
- Slovak ["aáäeéiíoóôuúyýbcčdďfghjklĺľmnňpqrŕsštťvwxzž", "AÁÄEÉIÍOÓÔUÚYÝBCČDĎFGHJKLĹĽMNŇPQRŔSŠTŤVWXZŽ"]
|
97
|
+
- Italian ["aàbcdeèéfghiìíîlmnoòópqrstuùúvz", "AÀBCDEÈÉFGHIÌÍÎLMNOÒÓPQRSTUÙÚVZ"]
|
98
|
+
- Romanian ["aăâbcdefghiîjklmnopqrsștțuvwxyz", "AĂÂBCDEFGHIÎJKLMNOPQRSȘTȚUVWXYZ"]
|
99
|
+
- Danish [å, â, ô, Å, Â, Ô]
|
100
|
+
|
101
|
+
```
|
102
|
+
def de
|
103
|
+
{ # German
|
104
|
+
downcase: %w(ä ö ü ß),
|
105
|
+
upcase: %w(Ä Ö Ü ẞ),
|
106
|
+
permanent: %w(ae oe ue ss)
|
107
|
+
}
|
108
|
+
end
|
109
|
+
|
110
|
+
def pl
|
111
|
+
{ # Polish
|
112
|
+
downcase: %w(ą ć ę ł ń ó ś ż ź),
|
113
|
+
upcase: %w(Ą Ć Ę Ł Ń Ó Ś Ż Ź),
|
114
|
+
permanent: %w(a c e l n o s z z)
|
115
|
+
}
|
116
|
+
end
|
117
|
+
|
118
|
+
def cs
|
119
|
+
{ # Czech uses acute (á é í ó ú ý), caron (č ď ě ň ř š ť ž), ring (ů)
|
120
|
+
# aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž
|
121
|
+
# AÁBCČDĎEÉĚFGHIÍJKLMNŇOÓPQRŘSŠTŤUÚŮVWXYÝZŽ
|
122
|
+
downcase: %w(á é í ó ú ý č ď ě ň ř š ť ů ž),
|
123
|
+
upcase: %w(Á É Í Ó Ú Ý Č Ď Ě Ň Ř Š Ť Ů Ž),
|
124
|
+
permanent: %w(a e i o u y c d e n r s t u z)
|
125
|
+
}
|
126
|
+
end
|
127
|
+
|
128
|
+
def fr
|
129
|
+
{ # French
|
130
|
+
# abcdefghijklmnopqrstuvwxyzàâæçéèêëîïôœùûüÿ
|
131
|
+
# ABCDEFGHIJKLMNOPQRSTUVWXYZÀÂÆÇÉÈÊËÎÏÔŒÙÛÜŸ
|
132
|
+
downcase: %w(à â é è ë ê ï î ô ù û ü ÿ ç œ æ),
|
133
|
+
upcase: %w(À Â É È Ë Ê Ï Î Ô Ù Û Ü Ÿ Ç Œ Æ),
|
134
|
+
permanent: %w(a a e e e e i i o u u ue y c oe ae)
|
135
|
+
}
|
136
|
+
end
|
137
|
+
|
138
|
+
def it
|
139
|
+
{ # Italian
|
140
|
+
downcase: %w(à è é ì î ò ó ù),
|
141
|
+
upcase: %w(À È É Ì Î Ò Ó Ù),
|
142
|
+
permanent: %w(a e e i i o o u)
|
143
|
+
}
|
144
|
+
end
|
145
|
+
|
146
|
+
def eo
|
147
|
+
{ # Esperantohas the symbols ŭ, ĉ, ĝ, ĥ, ĵ and ŝ
|
148
|
+
downcase: %w(ĉ ĝ ĥ ĵ ŝ ŭ),
|
149
|
+
upcase: %w(Ĉ Ĝ Ĥ Ĵ Ŝ Ŭ),
|
150
|
+
permanent: %w(c g h j s u)
|
151
|
+
}
|
152
|
+
end
|
153
|
+
|
154
|
+
def is
|
155
|
+
{ # Iceland
|
156
|
+
downcase: %w(ð þ),
|
157
|
+
upcase: %w(Ð Þ),
|
158
|
+
permanent: %w(d p)
|
159
|
+
}
|
160
|
+
end
|
161
|
+
|
162
|
+
def pt
|
163
|
+
{ # Portugal uses á, â, ã, à, ç, é, ê, í, ó, ô, õ and ú
|
164
|
+
downcase: %w(ã ç),
|
165
|
+
upcase: %w(Ã Ç),
|
166
|
+
permanent: %w(a c)
|
167
|
+
}
|
168
|
+
end
|
169
|
+
|
170
|
+
def sp
|
171
|
+
{ # Spanish
|
172
|
+
downcase: ['ñ', 'õ', '¿', '¡'],
|
173
|
+
upcase: ['Ñ', 'Õ', '¿', '¡'],
|
174
|
+
permanent: ['n', 'o', '', '']
|
175
|
+
}
|
176
|
+
end
|
177
|
+
|
178
|
+
def hu
|
179
|
+
{ # Hungarian
|
180
|
+
downcase: %w(ő),
|
181
|
+
upcase: %w(Ő),
|
182
|
+
permanent: %w(oe)
|
183
|
+
}
|
184
|
+
end
|
185
|
+
|
186
|
+
def nn
|
187
|
+
{ # Norwegian
|
188
|
+
downcase: %w(æ å),
|
189
|
+
upcase: %w(Æ Å),
|
190
|
+
permanent: %w(ae a)
|
191
|
+
}
|
192
|
+
end
|
193
|
+
```
|
data/Rakefile
CHANGED
data/lib/alphabets/alphabets.rb
CHANGED
@@ -12,9 +12,9 @@ UNACCENT = Reader.parse( <<TXT )
|
|
12
12
|
Æ AE æ ae # ae ligature
|
13
13
|
ā a
|
14
14
|
ă a
|
15
|
-
ą a
|
15
|
+
ą a # ą - U+0105 (261) - LATIN SMALL LETTER A WITH OGONEK
|
16
16
|
|
17
|
-
Ç C ç c
|
17
|
+
Ç C ç c # ç - U+00E7 (231) - LATIN SMALL LETTER C WITH CEDILLA
|
18
18
|
ć c
|
19
19
|
Č C č c
|
20
20
|
|
@@ -31,7 +31,7 @@ UNACCENT = Reader.parse( <<TXT )
|
|
31
31
|
Í I í i
|
32
32
|
î i
|
33
33
|
ī i
|
34
|
-
ı i #
|
34
|
+
ı i # ı - U+0131 (305) - LATIN SMALL LETTER DOTLESS I
|
35
35
|
|
36
36
|
Ł L ł l
|
37
37
|
|
@@ -41,6 +41,7 @@ UNACCENT = Reader.parse( <<TXT )
|
|
41
41
|
|
42
42
|
Ö O ö o
|
43
43
|
ó o
|
44
|
+
ò o
|
44
45
|
õ o
|
45
46
|
ô o
|
46
47
|
ø o
|
@@ -50,15 +51,16 @@ UNACCENT = Reader.parse( <<TXT )
|
|
50
51
|
ř r
|
51
52
|
|
52
53
|
Ś S ś s
|
53
|
-
Ş S ş s
|
54
|
+
Ş S ş s # ş - U+015F (351) - LATIN SMALL LETTER S WITH CEDILLA
|
55
|
+
Ș S ș s # ș - U+0219 (537) - LATIN SMALL LETTER S WITH COMMA BELOW
|
54
56
|
Š S š s
|
55
|
-
|
56
|
-
ß ss
|
57
|
+
ß ss # ß - U+00DF (223) - LATIN SMALL LETTER SHARP S
|
57
58
|
|
58
|
-
|
59
|
-
|
59
|
+
Ţ t ţ t # ţ - U+0163 (355) - LATIN SMALL LETTER T WITH CEDILLA
|
60
|
+
Ț t ț t # ț - U+021B (539) - LATIN SMALL LETTER T WITH COMMA BELOW
|
60
61
|
|
61
|
-
þ p
|
62
|
+
þ p # þ - U+00FE (254) - LATIN SMALL LETTER THORN
|
63
|
+
#### fix/check!!!! icelandic - use p is p or th - why? why not?
|
62
64
|
|
63
65
|
Ü U ü u
|
64
66
|
Ú U ú u
|
@@ -71,6 +73,14 @@ UNACCENT = Reader.parse( <<TXT )
|
|
71
73
|
Ž Z ž z
|
72
74
|
TXT
|
73
75
|
|
76
|
+
##
|
77
|
+
# Notes:
|
78
|
+
# Romanian did NOT initially get its Ș/ș and Ț/ț (with comma) letters,
|
79
|
+
# because these letters were initially unified with Ş/ş and Ţ/ţ (with cedilla)
|
80
|
+
# by the Unicode Consortium, considering the shapes with comma beneath
|
81
|
+
# to be glyph variants of the shapes with cedilla.
|
82
|
+
# However, the letters with explicit comma below were later added to the Unicode standard and are also in ISO 8859-16.
|
83
|
+
|
74
84
|
|
75
85
|
## de,at,ch translation for umlauts
|
76
86
|
UNACCENT_DE = Reader.parse( <<TXT )
|
@@ -90,9 +100,9 @@ DOWNCASE = %w[A B C D E F G H I J K L M N O P Q R S T U V W X Y Z].reduce({}) do
|
|
90
100
|
Ä ä
|
91
101
|
Á á
|
92
102
|
Å å
|
93
|
-
Æ æ # ae ligature
|
103
|
+
Æ æ # LATIN LETTER AE - ae ligature
|
94
104
|
|
95
|
-
Ç ç
|
105
|
+
Ç ç # LATIN LETTER C WITH CEDILLA
|
96
106
|
Č č
|
97
107
|
|
98
108
|
É é
|
@@ -103,12 +113,16 @@ DOWNCASE = %w[A B C D E F G H I J K L M N O P Q R S T U V W X Y Z].reduce({}) do
|
|
103
113
|
Ł ł
|
104
114
|
|
105
115
|
Ö ö
|
106
|
-
Œ œ #
|
116
|
+
Œ œ # LATIN LIGATURE OE
|
107
117
|
|
108
118
|
Ś ś
|
109
|
-
Ş ş
|
119
|
+
Ş ş # LATIN LETTER S WITH CEDILLA
|
120
|
+
Ș ș # LATIN LETTER S WITH COMMA BELOW
|
110
121
|
Š š
|
111
122
|
|
123
|
+
Ţ ţ # LATIN LETTER T WITH CEDILLA
|
124
|
+
Ț ț # LATIN LETTER T WITH COMMA BELOW
|
125
|
+
|
112
126
|
Ü ü
|
113
127
|
Ú ú
|
114
128
|
|
data/lib/alphabets/reader.rb
CHANGED
@@ -1,62 +1,62 @@
|
|
1
|
-
|
2
|
-
class Alphabet
|
3
|
-
class Reader ## todo/check: rename to CharReader or something - why? why not?
|
4
|
-
|
5
|
-
def self.read( path ) ## use - rename to read_file or from_file etc. - why? why not?
|
6
|
-
txt = File.open( path, 'r:utf-8' ).read
|
7
|
-
parse( txt )
|
8
|
-
end
|
9
|
-
|
10
|
-
def self.parse( txt )
|
11
|
-
h = {} ## char(acter) table mappings
|
12
|
-
|
13
|
-
txt.each_line do |line|
|
14
|
-
line = line.strip
|
15
|
-
|
16
|
-
next if line.empty?
|
17
|
-
next if line.start_with?( '#' ) ## skip comments too
|
18
|
-
|
19
|
-
## strip inline (until end-of-line) comments too
|
20
|
-
## e.g ţ t ## U+0163
|
21
|
-
## => ţ t
|
22
|
-
line = line.sub( /#.*/, '' ).strip
|
23
|
-
## pp line
|
24
|
-
|
25
|
-
values = line.split( /[ \t]+/ )
|
26
|
-
## pp values
|
27
|
-
|
28
|
-
## check - must be a even - a multiple of two
|
29
|
-
if values.size % 2 != 0
|
30
|
-
puts "** !!! ERROR !!! - missing mapping pair - mappings must be even (a multiple of two):"
|
31
|
-
pp values
|
32
|
-
exit 1
|
33
|
-
end
|
34
|
-
|
35
|
-
# add mappings in pairs
|
36
|
-
values.each_slice(2) do |slice|
|
37
|
-
## pp slice
|
38
|
-
key = slice[0]
|
39
|
-
value = slice[1]
|
40
|
-
|
41
|
-
## check - key must be a single-character/letter in unicode
|
42
|
-
if key.size != 1
|
43
|
-
puts "** !!! ERROR !!! - mapping character must be a single-character, size is #{key.size}"
|
44
|
-
pp slice
|
45
|
-
exit 1
|
46
|
-
end
|
47
|
-
|
48
|
-
## check - check for duplicates
|
49
|
-
if h[ key ]
|
50
|
-
puts "** !!! ERROR !!! - duplicate mapping character; key already present"
|
51
|
-
pp slice
|
52
|
-
exit 1
|
53
|
-
else
|
54
|
-
h[ key ] = value
|
55
|
-
end
|
56
|
-
end
|
57
|
-
end
|
58
|
-
h
|
59
|
-
end # method parse
|
60
|
-
|
61
|
-
end # class Reader
|
62
|
-
end # class Alphabet
|
1
|
+
|
2
|
+
class Alphabet
|
3
|
+
class Reader ## todo/check: rename to CharReader or something - why? why not?
|
4
|
+
|
5
|
+
def self.read( path ) ## use - rename to read_file or from_file etc. - why? why not?
|
6
|
+
txt = File.open( path, 'r:utf-8' ).read
|
7
|
+
parse( txt )
|
8
|
+
end
|
9
|
+
|
10
|
+
def self.parse( txt )
|
11
|
+
h = {} ## char(acter) table mappings
|
12
|
+
|
13
|
+
txt.each_line do |line|
|
14
|
+
line = line.strip
|
15
|
+
|
16
|
+
next if line.empty?
|
17
|
+
next if line.start_with?( '#' ) ## skip comments too
|
18
|
+
|
19
|
+
## strip inline (until end-of-line) comments too
|
20
|
+
## e.g ţ t ## U+0163
|
21
|
+
## => ţ t
|
22
|
+
line = line.sub( /#.*/, '' ).strip
|
23
|
+
## pp line
|
24
|
+
|
25
|
+
values = line.split( /[ \t]+/ )
|
26
|
+
## pp values
|
27
|
+
|
28
|
+
## check - must be a even - a multiple of two
|
29
|
+
if values.size % 2 != 0
|
30
|
+
puts "** !!! ERROR !!! - missing mapping pair - mappings must be even (a multiple of two):"
|
31
|
+
pp values
|
32
|
+
exit 1
|
33
|
+
end
|
34
|
+
|
35
|
+
# add mappings in pairs
|
36
|
+
values.each_slice(2) do |slice|
|
37
|
+
## pp slice
|
38
|
+
key = slice[0]
|
39
|
+
value = slice[1]
|
40
|
+
|
41
|
+
## check - key must be a single-character/letter in unicode
|
42
|
+
if key.size != 1
|
43
|
+
puts "** !!! ERROR !!! - mapping character must be a single-character, size is #{key.size}"
|
44
|
+
pp slice
|
45
|
+
exit 1
|
46
|
+
end
|
47
|
+
|
48
|
+
## check - check for duplicates
|
49
|
+
if h[ key ]
|
50
|
+
puts "** !!! ERROR !!! - duplicate mapping character; key already present"
|
51
|
+
pp slice
|
52
|
+
exit 1
|
53
|
+
else
|
54
|
+
h[ key ] = value
|
55
|
+
end
|
56
|
+
end
|
57
|
+
end
|
58
|
+
h
|
59
|
+
end # method parse
|
60
|
+
|
61
|
+
end # class Reader
|
62
|
+
end # class Alphabet
|
data/lib/alphabets/utils.rb
CHANGED
@@ -1,75 +1,75 @@
|
|
1
|
-
|
2
|
-
class Alphabet
|
3
|
-
|
4
|
-
def self.frequency_table( name ) ## todo/check: use/rename to char_frequency_table
|
5
|
-
## calculate the frequency table of letters, digits, etc.
|
6
|
-
freq = Hash.new(0)
|
7
|
-
name.each_char do |ch|
|
8
|
-
freq[ch] += 1
|
9
|
-
end
|
10
|
-
freq
|
11
|
-
end
|
12
|
-
|
13
|
-
|
14
|
-
def self.count( freq, mapping_or_chars )
|
15
|
-
chars = if mapping_or_chars.is_a?( Hash )
|
16
|
-
mapping_or_chars.keys
|
17
|
-
else ## todo/fix: check for is_a? Array and if is String split into Array (on char at a time?) - why? why not?
|
18
|
-
mapping_or_chars ## assume it's an array/list of characters
|
19
|
-
end
|
20
|
-
|
21
|
-
chars.reduce(0) do |count,ch|
|
22
|
-
count += freq[ch]
|
23
|
-
count
|
24
|
-
end
|
25
|
-
end
|
26
|
-
|
27
|
-
|
28
|
-
def self.sub( name, mapping ) ## todo/check: use a different/better name - gsub/map/replace/fold/... - why? why not?
|
29
|
-
buf = String.new
|
30
|
-
name.each_char do |ch|
|
31
|
-
buf << if mapping[ch]
|
32
|
-
mapping[ch]
|
33
|
-
else
|
34
|
-
ch
|
35
|
-
end
|
36
|
-
end
|
37
|
-
buf
|
38
|
-
end
|
39
|
-
|
40
|
-
|
41
|
-
class Unaccenter #Worker ## todo/change - find a better name - why? why not?
|
42
|
-
def initialize( mapping )
|
43
|
-
@mapping = mapping
|
44
|
-
end
|
45
|
-
|
46
|
-
def count( freq ) Alphabet.count( freq, @mapping ); end
|
47
|
-
def unaccent( name ) Alphabet.sub( name, @mapping ); end
|
48
|
-
end # class Unaccent Worker
|
49
|
-
|
50
|
-
|
51
|
-
def self.find_unaccenter( key )
|
52
|
-
if key == :de
|
53
|
-
@de ||= Unaccenter.new( UNACCENT_DE )
|
54
|
-
@de
|
55
|
-
else
|
56
|
-
## use uni(versal) or unicode or something - why? why not?
|
57
|
-
## use all or int'l (international) - why? why not?
|
58
|
-
## use en (english) - why? why not?
|
59
|
-
@default ||= Unaccenter.new( UNACCENT )
|
60
|
-
@default
|
61
|
-
end
|
62
|
-
end
|
63
|
-
|
64
|
-
def self.unaccent( name )
|
65
|
-
@default ||= Unaccenter.new( UNACCENT )
|
66
|
-
@default.unaccent( name )
|
67
|
-
end
|
68
|
-
|
69
|
-
|
70
|
-
def self.downcase_i18n( name ) ## our very own downcase for int'l characters / letters
|
71
|
-
sub( name, DOWNCASE )
|
72
|
-
end
|
73
|
-
## add downcase_uni - univeral/unicode - why? why not?
|
74
|
-
|
75
|
-
end # class Alphabet
|
1
|
+
|
2
|
+
class Alphabet
|
3
|
+
|
4
|
+
def self.frequency_table( name ) ## todo/check: use/rename to char_frequency_table
|
5
|
+
## calculate the frequency table of letters, digits, etc.
|
6
|
+
freq = Hash.new(0)
|
7
|
+
name.each_char do |ch|
|
8
|
+
freq[ch] += 1
|
9
|
+
end
|
10
|
+
freq
|
11
|
+
end
|
12
|
+
|
13
|
+
|
14
|
+
def self.count( freq, mapping_or_chars )
|
15
|
+
chars = if mapping_or_chars.is_a?( Hash )
|
16
|
+
mapping_or_chars.keys
|
17
|
+
else ## todo/fix: check for is_a? Array and if is String split into Array (on char at a time?) - why? why not?
|
18
|
+
mapping_or_chars ## assume it's an array/list of characters
|
19
|
+
end
|
20
|
+
|
21
|
+
chars.reduce(0) do |count,ch|
|
22
|
+
count += freq[ch]
|
23
|
+
count
|
24
|
+
end
|
25
|
+
end
|
26
|
+
|
27
|
+
|
28
|
+
def self.sub( name, mapping ) ## todo/check: use a different/better name - gsub/map/replace/fold/... - why? why not?
|
29
|
+
buf = String.new
|
30
|
+
name.each_char do |ch|
|
31
|
+
buf << if mapping[ch]
|
32
|
+
mapping[ch]
|
33
|
+
else
|
34
|
+
ch
|
35
|
+
end
|
36
|
+
end
|
37
|
+
buf
|
38
|
+
end
|
39
|
+
|
40
|
+
|
41
|
+
class Unaccenter #Worker ## todo/change - find a better name - why? why not?
|
42
|
+
def initialize( mapping )
|
43
|
+
@mapping = mapping
|
44
|
+
end
|
45
|
+
|
46
|
+
def count( freq ) Alphabet.count( freq, @mapping ); end
|
47
|
+
def unaccent( name ) Alphabet.sub( name, @mapping ); end
|
48
|
+
end # class Unaccent Worker
|
49
|
+
|
50
|
+
|
51
|
+
def self.find_unaccenter( key )
|
52
|
+
if key == :de
|
53
|
+
@de ||= Unaccenter.new( UNACCENT_DE )
|
54
|
+
@de
|
55
|
+
else
|
56
|
+
## use uni(versal) or unicode or something - why? why not?
|
57
|
+
## use all or int'l (international) - why? why not?
|
58
|
+
## use en (english) - why? why not?
|
59
|
+
@default ||= Unaccenter.new( UNACCENT )
|
60
|
+
@default
|
61
|
+
end
|
62
|
+
end
|
63
|
+
|
64
|
+
def self.unaccent( name )
|
65
|
+
@default ||= Unaccenter.new( UNACCENT )
|
66
|
+
@default.unaccent( name )
|
67
|
+
end
|
68
|
+
|
69
|
+
|
70
|
+
def self.downcase_i18n( name ) ## our very own downcase for int'l characters / letters
|
71
|
+
sub( name, DOWNCASE )
|
72
|
+
end
|
73
|
+
## add downcase_uni - univeral/unicode - why? why not?
|
74
|
+
|
75
|
+
end # class Alphabet
|
data/lib/alphabets/version.rb
CHANGED
data/test/test_reader.rb
CHANGED
@@ -1,37 +1,37 @@
|
|
1
|
-
###
|
2
|
-
# to run use
|
3
|
-
# ruby -I ./lib -I ./test test/test_reader.rb
|
4
|
-
|
5
|
-
|
6
|
-
require 'helper'
|
7
|
-
|
8
|
-
class TestReader < MiniTest::Test
|
9
|
-
|
10
|
-
def test_parse
|
11
|
-
h = Alphabet::Reader.parse( <<TXT )
|
12
|
-
## hello
|
13
|
-
|
14
|
-
Ä A ä a ## hello
|
15
|
-
Á A á a
|
16
|
-
à a
|
17
|
-
ã a
|
18
|
-
â a ### yada yada
|
19
|
-
Å A å a
|
20
|
-
æ ae
|
21
|
-
|
22
|
-
Ç C ç c
|
23
|
-
ć c
|
24
|
-
|
25
|
-
ß ss
|
26
|
-
TXT
|
27
|
-
|
28
|
-
pp h
|
29
|
-
|
30
|
-
assert_equal 'A', h['Ä']
|
31
|
-
assert_equal 'a', h['ä']
|
32
|
-
assert_equal 'ae', h['æ']
|
33
|
-
|
34
|
-
assert_equal 'ss', h['ß']
|
35
|
-
end
|
36
|
-
|
37
|
-
end # class TestReader
|
1
|
+
###
|
2
|
+
# to run use
|
3
|
+
# ruby -I ./lib -I ./test test/test_reader.rb
|
4
|
+
|
5
|
+
|
6
|
+
require 'helper'
|
7
|
+
|
8
|
+
class TestReader < MiniTest::Test
|
9
|
+
|
10
|
+
def test_parse
|
11
|
+
h = Alphabet::Reader.parse( <<TXT )
|
12
|
+
## hello
|
13
|
+
|
14
|
+
Ä A ä a ## hello
|
15
|
+
Á A á a
|
16
|
+
à a
|
17
|
+
ã a
|
18
|
+
â a ### yada yada
|
19
|
+
Å A å a
|
20
|
+
æ ae
|
21
|
+
|
22
|
+
Ç C ç c
|
23
|
+
ć c
|
24
|
+
|
25
|
+
ß ss
|
26
|
+
TXT
|
27
|
+
|
28
|
+
pp h
|
29
|
+
|
30
|
+
assert_equal 'A', h['Ä']
|
31
|
+
assert_equal 'a', h['ä']
|
32
|
+
assert_equal 'ae', h['æ']
|
33
|
+
|
34
|
+
assert_equal 'ss', h['ß']
|
35
|
+
end
|
36
|
+
|
37
|
+
end # class TestReader
|
metadata
CHANGED
@@ -1,60 +1,54 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: alphabets
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Gerald Bauer
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2020-01-07 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: rdoc
|
15
15
|
requirement: !ruby/object:Gem::Requirement
|
16
16
|
requirements:
|
17
|
-
- - "
|
17
|
+
- - "~>"
|
18
18
|
- !ruby/object:Gem::Version
|
19
19
|
version: '4.0'
|
20
|
-
- - "<"
|
21
|
-
- !ruby/object:Gem::Version
|
22
|
-
version: '7'
|
23
20
|
type: :development
|
24
21
|
prerelease: false
|
25
22
|
version_requirements: !ruby/object:Gem::Requirement
|
26
23
|
requirements:
|
27
|
-
- - "
|
24
|
+
- - "~>"
|
28
25
|
- !ruby/object:Gem::Version
|
29
26
|
version: '4.0'
|
30
|
-
- - "<"
|
31
|
-
- !ruby/object:Gem::Version
|
32
|
-
version: '7'
|
33
27
|
- !ruby/object:Gem::Dependency
|
34
28
|
name: hoe
|
35
29
|
requirement: !ruby/object:Gem::Requirement
|
36
30
|
requirements:
|
37
31
|
- - "~>"
|
38
32
|
- !ruby/object:Gem::Version
|
39
|
-
version: '3.
|
33
|
+
version: '3.16'
|
40
34
|
type: :development
|
41
35
|
prerelease: false
|
42
36
|
version_requirements: !ruby/object:Gem::Requirement
|
43
37
|
requirements:
|
44
38
|
- - "~>"
|
45
39
|
- !ruby/object:Gem::Version
|
46
|
-
version: '3.
|
40
|
+
version: '3.16'
|
47
41
|
description: 'alphabets - '
|
48
42
|
email: opensport@googlegroups.com
|
49
43
|
executables: []
|
50
44
|
extensions: []
|
51
45
|
extra_rdoc_files:
|
52
|
-
-
|
46
|
+
- CHANGELOG.md
|
53
47
|
- Manifest.txt
|
54
48
|
- NOTES.md
|
55
49
|
- README.md
|
56
50
|
files:
|
57
|
-
-
|
51
|
+
- CHANGELOG.md
|
58
52
|
- Manifest.txt
|
59
53
|
- NOTES.md
|
60
54
|
- README.md
|