byk 0.6.0 → 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +5 -0
- data/README.md +123 -51
- data/exe/byk +51 -0
- data/ext/byk/byk.c +261 -182
- data/lib/byk/version.rb +1 -1
- data/spec/byk_spec.rb +97 -40
- metadata +25 -8
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: cc996c9d9dc81f884e02cc1dd760eeb57b6545fc
|
4
|
+
data.tar.gz: de07860c2cb41bcb39b299fee4500fd2bf01db73
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 16e97855924c380b205e2e651fdcde391785fe051c2971d948f801ff4260eb691dc4c3304ac17b3083fc0a2469f26d134c9622f74b058f6950d5fd8dfaf62383
|
7
|
+
data.tar.gz: c85659aaaccbc5e1db30305b52e2f4955de160dcb7a617ae564877619fe5f36d852ea7c17228f301e639aae3d4793133baa5d644faeffc83942d6b179bef53e9
|
data/CHANGELOG.md
CHANGED
data/README.md
CHANGED
@@ -4,39 +4,85 @@ Byk
|
|
4
4
|
[](https://rubygems.org/gems/byk)
|
5
5
|
[](https://travis-ci.org/topalovic/byk)
|
6
6
|
|
7
|
-
Ruby gem for fast transliteration of Serbian Cyrillic
|
8
|
-
<br />
|
9
|
-
<sub>Inspired by @dejan's
|
10
|
-
[nice little gem](https://github.com/dejan/srbovanje),
|
11
|
-
this one comes with a C-optimized twist</sub>
|
7
|
+
Ruby gem for fast transliteration of Serbian Cyrillic ↔ Latin
|
12
8
|
|
13
9
|

|
14
10
|
|
15
11
|
|
16
12
|
## Installation
|
17
13
|
|
18
|
-
|
14
|
+
Byk can be used as a standalone console utility or as a `String`
|
15
|
+
extension in your Ruby programs. It has zero dependencies beyond
|
16
|
+
vanilla Ruby and the toolchain for building native gems <sup>1</sup>.
|
17
|
+
|
18
|
+
You can install it directly:
|
19
|
+
|
20
|
+
```ruby
|
21
|
+
$ gem install byk
|
22
|
+
```
|
23
|
+
|
24
|
+
or add it as a dependency in your application's Gemfile:
|
19
25
|
|
20
26
|
```ruby
|
21
27
|
gem "byk"
|
22
28
|
```
|
23
29
|
|
24
|
-
|
30
|
+
<sub><sup>1</sup> For Windows, you might want to check out
|
31
|
+
[DevKit](https://github.com/oneclick/rubyinstaller/wiki/Development-Kit)</sub>
|
32
|
+
|
33
|
+
|
34
|
+
## Usage
|
35
|
+
|
36
|
+
### As a standalone utility
|
37
|
+
|
38
|
+
Here's the help banner with all the available options:
|
25
39
|
|
26
40
|
```
|
27
|
-
|
41
|
+
usage: byk [options] [files]
|
42
|
+
|
43
|
+
options:
|
44
|
+
-c, --cyrillic convert input to Cyrillic (default)
|
45
|
+
-l, --latin convert input to Latin
|
46
|
+
-a, --ascii convert input to "ASCII Latin"
|
47
|
+
-v, --version show version
|
28
48
|
```
|
29
49
|
|
30
|
-
|
50
|
+
Translation goes to stdout so you can redirect it or pipe it as you
|
51
|
+
see fit. Let's take a look at some common scenarios.
|
31
52
|
|
53
|
+
To translate files to Cyrillic:
|
54
|
+
```sh
|
55
|
+
$ byk in1.txt in2.txt > out.txt
|
32
56
|
```
|
33
|
-
|
57
|
+
|
58
|
+
To translate files to Latin and search for a phrase:
|
59
|
+
```sh
|
60
|
+
$ byk -l file.txt | grep stvar
|
34
61
|
```
|
35
62
|
|
63
|
+
Ad hoc conversion:
|
64
|
+
```sh
|
65
|
+
$ echo "Вук Стефановић Караџић" | byk -a
|
66
|
+
Vuk Stefanovic Karadzic
|
67
|
+
```
|
36
68
|
|
37
|
-
|
69
|
+
or simply omit args and type away:
|
70
|
+
```sh
|
71
|
+
$ byk
|
72
|
+
a u ruke Mandušića Vuka
|
73
|
+
biće svaka puška ubojita!
|
74
|
+
^D
|
75
|
+
а у руке Мандушића Вука
|
76
|
+
биће свака пушка убојита!
|
77
|
+
```
|
38
78
|
|
39
|
-
|
79
|
+
`^D` being <kbd>ctrl</kbd> <kbd>d</kbd>.
|
80
|
+
|
81
|
+
|
82
|
+
### As a `String` extension
|
83
|
+
|
84
|
+
Unless you're using Bundler, make sure to require the gem in your
|
85
|
+
initializer:
|
40
86
|
|
41
87
|
```ruby
|
42
88
|
require "byk"
|
@@ -45,22 +91,23 @@ require "byk"
|
|
45
91
|
This will extend `String` with a couple of simple methods:
|
46
92
|
|
47
93
|
```ruby
|
48
|
-
"
|
49
|
-
"Шеширџија".
|
50
|
-
"
|
94
|
+
"Šeširdžija".to_cyrillic # => "Шеширџија"
|
95
|
+
"Шеширџија".to_latin # => "Šeširdžija"
|
96
|
+
"Шеширџија".to_ascii_latin # => "Sesirdzija"
|
51
97
|
```
|
52
98
|
|
53
|
-
|
99
|
+
These do not modify the receiver. For that, there's a destructive
|
100
|
+
variant of each:
|
54
101
|
|
55
102
|
```ruby
|
56
|
-
text = "
|
57
|
-
text.
|
58
|
-
text
|
59
|
-
text.to_ascii_latin! # => "
|
60
|
-
text # => "
|
103
|
+
text = "Šeširdžija"
|
104
|
+
text.to_cyrillic! # => "Шеширџија"
|
105
|
+
text.to_latin! # => "Šeširdžija"
|
106
|
+
text.to_ascii_latin! # => "Sesirdzija"
|
107
|
+
text # => "Sesirdzija"
|
61
108
|
```
|
62
109
|
|
63
|
-
Note that
|
110
|
+
Note that both latinization methods observe
|
64
111
|
[digraph capitalization rules](http://sr.wikipedia.org/wiki/Гајица#.D0.94.D0.B8.D0.B3.D1.80.D0.B0.D1.84.D0.B8):
|
65
112
|
|
66
113
|
```ruby
|
@@ -68,63 +115,88 @@ Note that these methods take into account the
|
|
68
115
|
"ĐORĐE Đorđević".to_ascii_latin # => "DJORDJE Djordjevic"
|
69
116
|
```
|
70
117
|
|
71
|
-
|
72
|
-
require
|
118
|
+
|
119
|
+
### Safe require
|
120
|
+
|
121
|
+
If you prefer not to monkey patch `String`, you can do a "safe"
|
122
|
+
require in your Gemfile:
|
123
|
+
|
73
124
|
|
74
125
|
```ruby
|
75
|
-
require "byk/safe"
|
126
|
+
gem "byk", :require => "byk/safe"
|
76
127
|
```
|
77
128
|
|
78
|
-
|
129
|
+
or initializer:
|
79
130
|
|
80
131
|
```ruby
|
81
|
-
|
82
|
-
Byk.to_latin(text) # => "Vuk"
|
83
|
-
text # => "Byk"
|
84
|
-
Byk.to_latin!(text) # => "Vuk"
|
85
|
-
text # => "Vuk"
|
132
|
+
require "byk/safe"
|
86
133
|
```
|
87
134
|
|
135
|
+
Then, you should rely on module methods:
|
88
136
|
|
89
|
-
|
137
|
+
```ruby
|
138
|
+
text = "Жвазбука"
|
90
139
|
|
91
|
-
|
140
|
+
Byk.to_latin(text) # => "Žvazbuka"
|
141
|
+
text # => "Жвазбука"
|
142
|
+
|
143
|
+
Byk.to_latin!(text) # => "Žvazbuka"
|
144
|
+
text # => "Žvazbuka"
|
92
145
|
|
146
|
+
# etc.
|
93
147
|
```
|
94
|
-
|
95
|
-
|
148
|
+
|
149
|
+
|
150
|
+
## How fast is "fast" transliteration?
|
151
|
+
|
152
|
+
Here's a quick test:
|
153
|
+
|
154
|
+
```sh
|
155
|
+
$ wget https://sr.wikipedia.org/ -O sample
|
156
|
+
$ du -h sample
|
157
|
+
128K
|
158
|
+
|
159
|
+
$ time byk -l sample > /dev/null
|
160
|
+
0.08s user 0.04s system 96% cpu 0.126 total
|
96
161
|
```
|
97
162
|
|
163
|
+
Let's up the ante:
|
164
|
+
|
165
|
+
```sh
|
166
|
+
$ for i in {1..800}; do cat sample; done > big
|
167
|
+
$ du -h big
|
168
|
+
97M
|
169
|
+
|
170
|
+
$ time byk -l big > /dev/null
|
171
|
+
1.71s user 0.13s system 99% cpu 1.846 total
|
172
|
+
```
|
98
173
|
|
99
|
-
|
174
|
+
So, ~100MB in under 2s. Fast enough, I suppose. You can expect it to
|
175
|
+
scale linearly.
|
100
176
|
|
101
|
-
|
102
|
-
|
103
|
-
|
177
|
+
Compared to the pure Ruby implementation, it is about
|
178
|
+
[10-30x faster](benchmark), depending on the input composition and the
|
179
|
+
transliteration method applied.
|
104
180
|
|
105
181
|
|
106
|
-
##
|
182
|
+
## Testing
|
107
183
|
|
108
|
-
|
109
|
-
projects, e.g. sites supporting dual script content. Remember,
|
110
|
-
`Benchmark` is your friend.
|
184
|
+
To test the gem, clone the repo and run:
|
111
185
|
|
112
|
-
|
113
|
-
|
114
|
-
|
186
|
+
```
|
187
|
+
$ bundle && bundle exec rake
|
188
|
+
```
|
115
189
|
|
116
190
|
|
117
191
|
## Compatibility
|
118
192
|
|
119
|
-
Byk is supported under MRI
|
193
|
+
Byk is supported under MRI 1.9.2+. I might try my hand in writing a
|
194
|
+
JRuby extension in a future release.
|
120
195
|
|
121
|
-
I don't plan to support 1.8.7 or older due to substantial C API
|
122
|
-
changes between 1.8 and 1.9. It doesn't build under Rubinius
|
123
|
-
currently, but I intend to support it in future releases.
|
124
196
|
|
125
197
|
|
126
198
|
## License
|
127
199
|
|
128
|
-
This gem is released under the [MIT License](
|
200
|
+
This gem is released under the [MIT License](LICENSE).
|
129
201
|
|
130
202
|
Уздравље!
|
data/exe/byk
ADDED
@@ -0,0 +1,51 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require "byk/safe"
|
4
|
+
require "optparse"
|
5
|
+
|
6
|
+
trap "SIGINT" do
|
7
|
+
exit 130
|
8
|
+
end
|
9
|
+
|
10
|
+
method_name = :to_cyrillic
|
11
|
+
|
12
|
+
opts = OptionParser.new do |opt|
|
13
|
+
opt.banner = "usage: byk [options] [files]"
|
14
|
+
opt.summary_width = 20
|
15
|
+
|
16
|
+
opt.separator ""
|
17
|
+
opt.separator "options:"
|
18
|
+
|
19
|
+
opt.on("-c", "--cyrillic", "convert input to Cyrillic (default)") do
|
20
|
+
method_name = :to_cyrillic
|
21
|
+
end
|
22
|
+
|
23
|
+
opt.on("-l", "--latin", "convert input to Latin") do
|
24
|
+
method_name = :to_latin
|
25
|
+
end
|
26
|
+
|
27
|
+
opt.on("-a", "--ascii", 'convert input to "ASCII Latin"') do
|
28
|
+
method_name = :to_ascii_latin
|
29
|
+
end
|
30
|
+
|
31
|
+
opt.on_tail("-v", "--version", "show version") do
|
32
|
+
puts Byk::VERSION
|
33
|
+
exit
|
34
|
+
end
|
35
|
+
end
|
36
|
+
|
37
|
+
begin
|
38
|
+
opts.parse!
|
39
|
+
rescue OptionParser::InvalidOption => e
|
40
|
+
puts e
|
41
|
+
puts
|
42
|
+
puts opts
|
43
|
+
exit 1
|
44
|
+
end
|
45
|
+
|
46
|
+
begin
|
47
|
+
puts Byk.send(method_name, ARGF.read)
|
48
|
+
rescue => e
|
49
|
+
puts e
|
50
|
+
exit 1
|
51
|
+
end
|
data/ext/byk/byk.c
CHANGED
@@ -3,103 +3,225 @@
|
|
3
3
|
|
4
4
|
#define STR_ENC_GET(str) rb_enc_from_index(ENCODING_GET(str))
|
5
5
|
|
6
|
-
|
7
|
-
|
8
|
-
|
6
|
+
static inline void
|
7
|
+
_str_cat_char(VALUE str, unsigned c, rb_encoding *enc)
|
8
|
+
{
|
9
|
+
char s[16];
|
10
|
+
int n = rb_enc_codelen(c, enc);
|
11
|
+
rb_enc_mbcput(c, s, enc);
|
12
|
+
rb_str_buf_cat(str, s, n);
|
13
|
+
}
|
9
14
|
|
10
15
|
enum {
|
11
|
-
LAT_CAP_TJ =
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
CYR_CAP_A,
|
28
|
-
CYR_CAP_ZH = 0x416,
|
29
|
-
CYR_CAP_C = 0x426,
|
30
|
-
CYR_CAP_CH,
|
31
|
-
CYR_CAP_SH,
|
32
|
-
CYR_A = 0x430,
|
33
|
-
CYR_ZH = 0x436,
|
34
|
-
CYR_C = 0x446,
|
35
|
-
CYR_CH,
|
36
|
-
CYR_SH,
|
37
|
-
CYR_DJ = 0x452,
|
38
|
-
CYR_J = 0x458,
|
39
|
-
CYR_LJ,
|
40
|
-
CYR_NJ,
|
41
|
-
CYR_TJ,
|
42
|
-
CYR_DZ = 0x45f
|
16
|
+
LAT_CAP_TJ=262, LAT_TJ, LAT_CAP_CH=268, LAT_CH,
|
17
|
+
LAT_CAP_DJ=272, LAT_DJ, LAT_CAP_SH=352, LAT_SH,
|
18
|
+
LAT_CAP_ZH=381, LAT_ZH, CYR_CAP_DJ=1026, CYR_CAP_J=1032,
|
19
|
+
CYR_CAP_LJ, CYR_CAP_NJ, CYR_CAP_TJ, CYR_CAP_DZ=1039,
|
20
|
+
CYR_CAP_A, CYR_CAP_B, CYR_CAP_V, CYR_CAP_G,
|
21
|
+
CYR_CAP_D, CYR_CAP_E, CYR_CAP_ZH, CYR_CAP_Z,
|
22
|
+
CYR_CAP_I, CYR_CAP_K=1050, CYR_CAP_L, CYR_CAP_M,
|
23
|
+
CYR_CAP_N, CYR_CAP_O, CYR_CAP_P, CYR_CAP_R,
|
24
|
+
CYR_CAP_S, CYR_CAP_T, CYR_CAP_U, CYR_CAP_F,
|
25
|
+
CYR_CAP_H, CYR_CAP_C, CYR_CAP_CH, CYR_CAP_SH,
|
26
|
+
CYR_A=1072, CYR_B, CYR_V, CYR_G, CYR_D,
|
27
|
+
CYR_E, CYR_ZH, CYR_Z, CYR_I, CYR_K=1082,
|
28
|
+
CYR_L, CYR_M, CYR_N, CYR_O, CYR_P,
|
29
|
+
CYR_R, CYR_S, CYR_T, CYR_U, CYR_F,
|
30
|
+
CYR_H, CYR_C, CYR_CH, CYR_SH, CYR_DJ=1106,
|
31
|
+
CYR_J=1112, CYR_LJ, CYR_NJ, CYR_TJ, CYR_DZ=1119
|
43
32
|
};
|
44
33
|
|
45
|
-
static inline unsigned
|
46
|
-
|
34
|
+
static inline unsigned
|
35
|
+
is_cap(unsigned codepoint)
|
47
36
|
{
|
48
|
-
|
37
|
+
if (codepoint >= 65 && codepoint <= 90) return 1;
|
38
|
+
if (codepoint >= CYR_CAP_DJ && codepoint <= CYR_CAP_SH) return 1;
|
39
|
+
|
40
|
+
switch(codepoint) {
|
41
|
+
case LAT_CAP_TJ:
|
42
|
+
case LAT_CAP_CH:
|
43
|
+
case LAT_CAP_DJ:
|
44
|
+
case LAT_CAP_SH:
|
45
|
+
case LAT_CAP_ZH:
|
46
|
+
return 1;
|
47
|
+
default:
|
48
|
+
return 0;
|
49
|
+
}
|
49
50
|
}
|
50
51
|
|
51
|
-
static inline unsigned
|
52
|
-
|
52
|
+
static inline unsigned
|
53
|
+
is_digraph(unsigned codepoint)
|
53
54
|
{
|
54
|
-
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
|
55
|
+
switch(codepoint) {
|
56
|
+
case CYR_LJ:
|
57
|
+
case CYR_NJ:
|
58
|
+
case CYR_DZ:
|
59
|
+
case CYR_CAP_LJ:
|
60
|
+
case CYR_CAP_NJ:
|
61
|
+
case CYR_CAP_DZ:
|
62
|
+
return 1;
|
63
|
+
default:
|
64
|
+
return 0;
|
65
|
+
}
|
61
66
|
}
|
62
67
|
|
63
|
-
static
|
64
|
-
|
68
|
+
static unsigned
|
69
|
+
digraph_to_cyr(unsigned codepoint, unsigned codepoint2, unsigned capitalize, unsigned *next_out)
|
65
70
|
{
|
66
|
-
|
67
|
-
|
68
|
-
|
71
|
+
static unsigned CYR_MAP[] = {
|
72
|
+
CYR_A, CYR_B, CYR_C, CYR_D, CYR_E, CYR_F,
|
73
|
+
CYR_G, CYR_H, CYR_I, CYR_J, CYR_K, CYR_L,
|
74
|
+
CYR_M, CYR_N, CYR_O, CYR_P, 0, CYR_R,
|
75
|
+
CYR_S, CYR_T, CYR_U, CYR_V, 0, 0, 0, CYR_Z
|
76
|
+
};
|
77
|
+
|
78
|
+
static unsigned CYR_CAPS_MAP[] = {
|
79
|
+
CYR_CAP_A, CYR_CAP_B, CYR_CAP_C, CYR_CAP_D, CYR_CAP_E, CYR_CAP_F,
|
80
|
+
CYR_CAP_G, CYR_CAP_H, CYR_CAP_I, CYR_CAP_J, CYR_CAP_K, CYR_CAP_L,
|
81
|
+
CYR_CAP_M, CYR_CAP_N, CYR_CAP_O, CYR_CAP_P, 0, CYR_CAP_R,
|
82
|
+
CYR_CAP_S, CYR_CAP_T, CYR_CAP_U, CYR_CAP_V, 0, 0, 0, CYR_CAP_Z
|
83
|
+
};
|
84
|
+
|
85
|
+
if (codepoint2 == LAT_CAP_ZH || codepoint2 == LAT_ZH) {
|
86
|
+
switch (codepoint) {
|
87
|
+
case 'd': return CYR_DZ;
|
88
|
+
case 'D': return CYR_CAP_DZ;
|
89
|
+
}
|
90
|
+
}
|
91
|
+
|
92
|
+
if (codepoint2 == 'j' || codepoint2 == 'J') {
|
93
|
+
switch (codepoint) {
|
94
|
+
case 'l': return CYR_LJ;
|
95
|
+
case 'n': return CYR_NJ;
|
96
|
+
case 'L': return CYR_CAP_LJ;
|
97
|
+
case 'N': return CYR_CAP_NJ;
|
98
|
+
}
|
99
|
+
}
|
100
|
+
|
101
|
+
if (codepoint >= 'a' && codepoint <= 'z') return CYR_MAP[codepoint - 'a'];
|
102
|
+
if (codepoint >= 'A' && codepoint <= 'Z') return CYR_CAPS_MAP[codepoint - 'A'];
|
103
|
+
|
104
|
+
switch (codepoint) {
|
105
|
+
case LAT_CH: return CYR_CH;
|
106
|
+
case LAT_DJ: return CYR_DJ;
|
107
|
+
case LAT_SH: return CYR_SH;
|
108
|
+
case LAT_TJ: return CYR_TJ;
|
109
|
+
case LAT_ZH: return CYR_ZH;
|
110
|
+
case LAT_CAP_CH: return CYR_CAP_CH;
|
111
|
+
case LAT_CAP_DJ: return CYR_CAP_DJ;
|
112
|
+
case LAT_CAP_SH: return CYR_CAP_SH;
|
113
|
+
case LAT_CAP_TJ: return CYR_CAP_TJ;
|
114
|
+
case LAT_CAP_ZH: return CYR_CAP_ZH;
|
115
|
+
}
|
116
|
+
|
117
|
+
return 0;
|
69
118
|
}
|
70
119
|
|
71
|
-
static
|
72
|
-
|
120
|
+
static unsigned
|
121
|
+
digraph_to_latin(unsigned codepoint, unsigned codepoint2, unsigned capitalize, unsigned *next_out)
|
73
122
|
{
|
74
|
-
char
|
75
|
-
|
76
|
-
|
77
|
-
|
123
|
+
static char LAT_MAP[] = {
|
124
|
+
'a', 'b', 'v', 'g', 'd', 'e', 0, 'z', 'i', 0, 'k', 'l',
|
125
|
+
'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'f', 'h', 'c'
|
126
|
+
};
|
127
|
+
|
128
|
+
static char LAT_CAPS_MAP[] = {
|
129
|
+
'A', 'B', 'V', 'G', 'D', 'E', 0, 'Z', 'I', 0, 'K', 'L',
|
130
|
+
'M', 'N', 'O', 'P', 'R', 'S', 'T', 'U', 'F', 'H', 'C'
|
131
|
+
};
|
132
|
+
|
133
|
+
if (codepoint < CYR_CAP_DJ || codepoint > CYR_DZ) return 0;
|
134
|
+
|
135
|
+
switch (codepoint) {
|
136
|
+
case CYR_ZH: return LAT_ZH;
|
137
|
+
case CYR_CAP_ZH: return LAT_CAP_ZH;
|
138
|
+
}
|
139
|
+
|
140
|
+
if (codepoint >= CYR_A && codepoint <= CYR_C)
|
141
|
+
return LAT_MAP[codepoint - CYR_A];
|
142
|
+
|
143
|
+
if (codepoint >= CYR_CAP_A && codepoint <= CYR_CAP_C)
|
144
|
+
return LAT_CAPS_MAP[codepoint - CYR_CAP_A];
|
145
|
+
|
146
|
+
if (codepoint >= CYR_A) {
|
147
|
+
switch (codepoint) {
|
148
|
+
case CYR_J: return 'j';
|
149
|
+
case CYR_TJ: return LAT_TJ;
|
150
|
+
case CYR_CH: return LAT_CH;
|
151
|
+
case CYR_SH: return LAT_SH;
|
152
|
+
case CYR_DJ: return LAT_DJ;
|
153
|
+
case CYR_LJ: *next_out = 'j'; return 'l';
|
154
|
+
case CYR_NJ: *next_out = 'j'; return 'n';
|
155
|
+
case CYR_DZ: *next_out = LAT_ZH; return 'd';
|
156
|
+
}
|
157
|
+
}
|
158
|
+
else {
|
159
|
+
switch (codepoint) {
|
160
|
+
case CYR_CAP_J: return 'J';
|
161
|
+
case CYR_CAP_TJ: return LAT_CAP_TJ;
|
162
|
+
case CYR_CAP_CH: return LAT_CAP_CH;
|
163
|
+
case CYR_CAP_SH: return LAT_CAP_SH;
|
164
|
+
case CYR_CAP_DJ: return LAT_CAP_DJ;
|
165
|
+
case CYR_CAP_LJ: *next_out = (capitalize || is_cap(codepoint2)) ? 'J' : 'j'; return 'L';
|
166
|
+
case CYR_CAP_NJ: *next_out = (capitalize || is_cap(codepoint2)) ? 'J' : 'j'; return 'N';
|
167
|
+
case CYR_CAP_DZ: *next_out = (capitalize || is_cap(codepoint2)) ? LAT_CAP_ZH : LAT_ZH; return 'D';
|
168
|
+
}
|
169
|
+
}
|
170
|
+
|
171
|
+
return 0;
|
172
|
+
}
|
173
|
+
|
174
|
+
static unsigned
|
175
|
+
digraph_to_ascii(unsigned codepoint, unsigned codepoint2, unsigned capitalize, unsigned *next_out)
|
176
|
+
{
|
177
|
+
switch (codepoint) {
|
178
|
+
case LAT_TJ:
|
179
|
+
case LAT_CH:
|
180
|
+
case CYR_TJ:
|
181
|
+
case CYR_CH: return 'c';
|
182
|
+
case LAT_SH:
|
183
|
+
case CYR_SH: return 's';
|
184
|
+
case LAT_ZH:
|
185
|
+
case CYR_ZH: return 'z';
|
186
|
+
case LAT_DJ:
|
187
|
+
case CYR_DJ: *next_out = 'j'; return 'd';
|
188
|
+
case LAT_CAP_TJ:
|
189
|
+
case LAT_CAP_CH:
|
190
|
+
case CYR_CAP_TJ:
|
191
|
+
case CYR_CAP_CH: return 'C';
|
192
|
+
case LAT_CAP_SH:
|
193
|
+
case CYR_CAP_SH: return 'S';
|
194
|
+
case LAT_CAP_ZH:
|
195
|
+
case CYR_CAP_ZH: return 'Z';
|
196
|
+
case LAT_CAP_DJ:
|
197
|
+
case CYR_CAP_DJ:
|
198
|
+
*next_out = (capitalize || is_cap(codepoint2)) ? 'J' : 'j'; return 'D';
|
199
|
+
case CYR_DZ:
|
200
|
+
*next_out = (capitalize || is_cap(codepoint2)) ? 'Z' : 'z'; return 'd';
|
201
|
+
case CYR_CAP_DZ:
|
202
|
+
*next_out = (capitalize || is_cap(codepoint2)) ? 'Z' : 'z'; return 'D';
|
203
|
+
default:
|
204
|
+
return digraph_to_latin(codepoint, codepoint2, capitalize, next_out);
|
205
|
+
}
|
78
206
|
}
|
79
207
|
|
80
208
|
static VALUE
|
81
|
-
|
209
|
+
str_to_srb(VALUE str, int strategy, int bang)
|
82
210
|
{
|
83
211
|
VALUE dest;
|
84
|
-
|
212
|
+
rb_encoding *enc;
|
213
|
+
|
85
214
|
int len, next_len;
|
86
|
-
|
87
|
-
int force_upper = 0;
|
215
|
+
unsigned in, in2, out, out2, seen_cap = 0;
|
88
216
|
char *pos, *end, *seq_start = 0;
|
89
|
-
char cyr;
|
90
|
-
unsigned int codepoint = 0;
|
91
|
-
unsigned int next_codepoint = 0;
|
92
|
-
rb_encoding *enc;
|
93
217
|
|
94
|
-
|
95
|
-
'a', 'b', 'v', 'g', 'd', 'e', '\0', 'z', 'i', '\0', 'k',
|
96
|
-
'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'f', 'h', 'c'
|
97
|
-
};
|
218
|
+
unsigned (*method)(unsigned, unsigned, unsigned, unsigned*);
|
98
219
|
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
|
220
|
+
switch(strategy) {
|
221
|
+
case 0: method = &digraph_to_cyr; break;
|
222
|
+
case 1: method = &digraph_to_latin; break;
|
223
|
+
default: method = &digraph_to_ascii;
|
224
|
+
}
|
103
225
|
|
104
226
|
StringValue(str);
|
105
227
|
pos = RSTRING_PTR(str);
|
@@ -107,123 +229,50 @@ str_to_latin(VALUE str, int ascii, int bang)
|
|
107
229
|
|
108
230
|
end = RSTRING_END(str);
|
109
231
|
enc = STR_ENC_GET(str);
|
110
|
-
|
111
|
-
dest = rb_str_buf_new(dest_len);
|
232
|
+
dest = rb_str_buf_new(RSTRING_LEN(str) + 30);
|
112
233
|
rb_enc_associate(dest, enc);
|
113
234
|
|
114
|
-
|
235
|
+
in = rb_enc_codepoint_len(pos, end, &len, enc);
|
115
236
|
|
116
237
|
while (pos < end) {
|
117
|
-
|
118
|
-
next_codepoint = rb_enc_codepoint_len(pos + len, end, &next_len, enc);
|
119
|
-
}
|
238
|
+
in2 = out2 = 0;
|
120
239
|
|
121
|
-
|
122
|
-
|
123
|
-
if (seq_start) {
|
124
|
-
rb_str_buf_cat(dest, seq_start, pos - seq_start);
|
125
|
-
seq_start = 0;
|
126
|
-
}
|
240
|
+
if (pos + len < end)
|
241
|
+
in2 = rb_enc_codepoint_len(pos + len, end, &next_len, enc);
|
127
242
|
|
128
|
-
|
129
|
-
case LAT_TJ:
|
130
|
-
case LAT_CH: rb_str_buf_cat(dest, "c", 1); break;
|
131
|
-
case LAT_DJ: rb_str_buf_cat(dest, "dj", 2); break;
|
132
|
-
case LAT_SH: rb_str_buf_cat(dest, "s", 1); break;
|
133
|
-
case LAT_ZH: rb_str_buf_cat(dest, "z", 1); break;
|
134
|
-
case LAT_CAP_TJ:
|
135
|
-
case LAT_CAP_CH: rb_str_buf_cat(dest, "C", 1); break;
|
136
|
-
case LAT_CAP_SH: rb_str_buf_cat(dest, "S", 1); break;
|
137
|
-
case LAT_CAP_ZH: rb_str_buf_cat(dest, "Z", 1); break;
|
138
|
-
case LAT_CAP_DJ:
|
139
|
-
(seen_upper || is_upper(next_codepoint))
|
140
|
-
? rb_str_buf_cat(dest, "DJ", 2)
|
141
|
-
: rb_str_buf_cat(dest, "Dj", 2);
|
142
|
-
break;
|
143
|
-
default:
|
144
|
-
rb_str_buf_cat(dest, pos, len);
|
145
|
-
}
|
146
|
-
}
|
243
|
+
out = (*method)(in, in2, seen_cap, &out2);
|
147
244
|
|
148
|
-
|
149
|
-
|
245
|
+
if (out) {
|
246
|
+
/* flush previous untranslatable sequence */
|
150
247
|
if (seq_start) {
|
151
248
|
rb_str_buf_cat(dest, seq_start, pos - seq_start);
|
152
249
|
seq_start = 0;
|
153
250
|
}
|
154
251
|
|
155
|
-
|
156
|
-
|
157
|
-
cyr = CYR_MAP[codepoint - CYR_A];
|
158
|
-
cyr ? rb_str_buf_cat(dest, &cyr, 1)
|
159
|
-
: rb_str_buf_cat(dest, pos, len);
|
160
|
-
}
|
161
|
-
else {
|
162
|
-
switch (codepoint) {
|
163
|
-
case CYR_J: rb_str_buf_cat(dest, "j", 1); break;
|
164
|
-
case CYR_LJ: rb_str_buf_cat(dest, "lj", 2); break;
|
165
|
-
case CYR_NJ: rb_str_buf_cat(dest, "nj", 2); break;
|
166
|
-
case CYR_DJ: STR_CAT_COND_ASCII(ascii, dest, "dj", LAT_DJ, 2, enc); break;
|
167
|
-
case CYR_TJ: STR_CAT_COND_ASCII(ascii, dest, "c", LAT_TJ, 1, enc); break;
|
168
|
-
case CYR_CH: STR_CAT_COND_ASCII(ascii, dest, "c", LAT_CH, 1, enc); break;
|
169
|
-
case CYR_SH: STR_CAT_COND_ASCII(ascii, dest, "s", LAT_SH, 1, enc); break;
|
170
|
-
case CYR_ZH: STR_CAT_COND_ASCII(ascii, dest, "z", LAT_ZH, 1, enc); break;
|
171
|
-
case CYR_DZ:
|
172
|
-
rb_str_buf_cat(dest, "d", 1);
|
173
|
-
STR_CAT_COND_ASCII(ascii, dest, "z", LAT_ZH, 1, enc);
|
174
|
-
break;
|
175
|
-
default:
|
176
|
-
rb_str_buf_cat(dest, pos, len);
|
177
|
-
}
|
178
|
-
}
|
179
|
-
}
|
180
|
-
else {
|
181
|
-
if (maps_directly(codepoint)) {
|
182
|
-
cyr = CYR_CAPS_MAP[codepoint - CYR_CAP_A];
|
183
|
-
cyr ? rb_str_buf_cat(dest, &cyr, 1)
|
184
|
-
: rb_str_buf_cat(dest, pos, len);
|
185
|
-
}
|
186
|
-
else {
|
187
|
-
force_upper = seen_upper || is_upper(next_codepoint);
|
188
|
-
|
189
|
-
switch (codepoint) {
|
190
|
-
case CYR_CAP_J: rb_str_buf_cat(dest, "J", 1); break;
|
191
|
-
case CYR_CAP_LJ: rb_str_buf_cat(dest, (force_upper ? "LJ" : "Lj"), 2); break;
|
192
|
-
case CYR_CAP_NJ: rb_str_buf_cat(dest, (force_upper ? "NJ" : "Nj"), 2); break;
|
193
|
-
case CYR_CAP_TJ: STR_CAT_COND_ASCII(ascii, dest, "C", LAT_CAP_TJ, 1, enc); break;
|
194
|
-
case CYR_CAP_CH: STR_CAT_COND_ASCII(ascii, dest, "C", LAT_CAP_CH, 1, enc); break;
|
195
|
-
case CYR_CAP_SH: STR_CAT_COND_ASCII(ascii, dest, "S", LAT_CAP_SH, 1, enc); break;
|
196
|
-
case CYR_CAP_ZH: STR_CAT_COND_ASCII(ascii, dest, "Z", LAT_CAP_ZH, 1, enc); break;
|
197
|
-
case CYR_CAP_DJ: STR_CAT_COND_ASCII(ascii, dest, (force_upper ? "DJ" : "Dj"), LAT_CAP_DJ, 2, enc); break;
|
198
|
-
case CYR_CAP_DZ:
|
199
|
-
rb_str_buf_cat(dest, "D", 1);
|
200
|
-
force_upper ? STR_CAT_COND_ASCII(ascii, dest, "Z", LAT_CAP_ZH, 1, enc)
|
201
|
-
: STR_CAT_COND_ASCII(ascii, dest, "z", LAT_ZH, 1, enc);
|
202
|
-
break;
|
203
|
-
default:
|
204
|
-
rb_str_buf_cat(dest, pos, len);
|
205
|
-
}
|
206
|
-
}
|
207
|
-
}
|
252
|
+
_str_cat_char(dest, out, enc);
|
253
|
+
if (out2) _str_cat_char(dest, out2, enc);
|
208
254
|
}
|
209
|
-
else {
|
210
|
-
/*
|
211
|
-
|
255
|
+
else if (!seq_start) {
|
256
|
+
/* mark the beginning of an untranslatable sequence */
|
257
|
+
seq_start = pos;
|
258
|
+
}
|
259
|
+
|
260
|
+
/* for cyrillic output, skip the second half of an input digraph */
|
261
|
+
if (strategy == 0 && is_digraph(out)) {
|
262
|
+
pos += next_len;
|
263
|
+
if (pos + len < end)
|
264
|
+
in2 = rb_enc_codepoint_len(pos + len, end, &next_len, enc);
|
212
265
|
}
|
213
266
|
|
214
|
-
|
267
|
+
seen_cap = is_cap(in);
|
215
268
|
|
216
269
|
pos += len;
|
217
270
|
len = next_len;
|
218
|
-
|
219
|
-
codepoint = next_codepoint;
|
220
|
-
next_codepoint = 0;
|
271
|
+
in = in2;
|
221
272
|
}
|
222
273
|
|
223
|
-
/*
|
224
|
-
if (seq_start)
|
225
|
-
rb_str_buf_cat(dest, seq_start, pos - seq_start);
|
226
|
-
}
|
274
|
+
/* flush final sequence */
|
275
|
+
if (seq_start) rb_str_buf_cat(dest, seq_start, pos - seq_start);
|
227
276
|
|
228
277
|
if (bang) {
|
229
278
|
rb_str_shared_replace(str, dest);
|
@@ -237,7 +286,35 @@ str_to_latin(VALUE str, int ascii, int bang)
|
|
237
286
|
}
|
238
287
|
|
239
288
|
/**
|
240
|
-
* Returns a copy of <i>str</i> with
|
289
|
+
* Returns a copy of <i>str</i> with Latin characters transliterated
|
290
|
+
* into Serbian Cyrillic.
|
291
|
+
*
|
292
|
+
* @overload to_cyrillic(str)
|
293
|
+
* @param [String] str text to be transliterated
|
294
|
+
* @return [String] transliterated text
|
295
|
+
*/
|
296
|
+
static VALUE
|
297
|
+
rb_str_to_cyrillic(VALUE self, VALUE str)
|
298
|
+
{
|
299
|
+
return str_to_srb(str, 0, 0);
|
300
|
+
}
|
301
|
+
|
302
|
+
/**
|
303
|
+
* Performs transliteration of <code>Byk.to_cyrillic</code> in place,
|
304
|
+
* returning <i>str</i>, whether any changes were made or not.
|
305
|
+
*
|
306
|
+
* @overload to_cyrillic!(str)
|
307
|
+
* @param [String] str text to be transliterated
|
308
|
+
* @return [String] transliterated text
|
309
|
+
*/
|
310
|
+
static VALUE
|
311
|
+
rb_str_to_cyrillic_bang(VALUE self, VALUE str)
|
312
|
+
{
|
313
|
+
return str_to_srb(str, 0, 1);
|
314
|
+
}
|
315
|
+
|
316
|
+
/**
|
317
|
+
* Returns a copy of <i>str</i> with Serbian Cyrillic characters
|
241
318
|
* transliterated into Latin.
|
242
319
|
*
|
243
320
|
* @overload to_latin(str)
|
@@ -247,12 +324,12 @@ str_to_latin(VALUE str, int ascii, int bang)
|
|
247
324
|
static VALUE
|
248
325
|
rb_str_to_latin(VALUE self, VALUE str)
|
249
326
|
{
|
250
|
-
return
|
327
|
+
return str_to_srb(str, 1, 0);
|
251
328
|
}
|
252
329
|
|
253
330
|
/**
|
254
|
-
* Performs
|
255
|
-
* returning <i>str</i>, whether changes were made or not.
|
331
|
+
* Performs transliteration of <code>Byk.to_latin</code> in place,
|
332
|
+
* returning <i>str</i>, whether any changes were made or not.
|
256
333
|
*
|
257
334
|
* @overload to_latin!(str)
|
258
335
|
* @param [String] str text to be transliterated
|
@@ -261,12 +338,12 @@ rb_str_to_latin(VALUE self, VALUE str)
|
|
261
338
|
static VALUE
|
262
339
|
rb_str_to_latin_bang(VALUE self, VALUE str)
|
263
340
|
{
|
264
|
-
return
|
341
|
+
return str_to_srb(str, 1, 1);
|
265
342
|
}
|
266
343
|
|
267
344
|
/**
|
268
|
-
* Returns a copy of <i>str</i> with
|
269
|
-
*
|
345
|
+
* Returns a copy of <i>str</i> with Serbian characters transliterated
|
346
|
+
* into ASCII Latin.
|
270
347
|
*
|
271
348
|
* @overload to_ascii_latin(str)
|
272
349
|
* @param [String] str text to be transliterated
|
@@ -275,12 +352,12 @@ rb_str_to_latin_bang(VALUE self, VALUE str)
|
|
275
352
|
static VALUE
|
276
353
|
rb_str_to_ascii_latin(VALUE self, VALUE str)
|
277
354
|
{
|
278
|
-
return
|
355
|
+
return str_to_srb(str, 2, 0);
|
279
356
|
}
|
280
357
|
|
281
358
|
/**
|
282
|
-
* Performs
|
283
|
-
* place, returning <i>str</i>, whether changes were made or not.
|
359
|
+
* Performs transliteration of <code>Byk.to_ascii_latin</code> in
|
360
|
+
* place, returning <i>str</i>, whether any changes were made or not.
|
284
361
|
*
|
285
362
|
* @overload to_ascii_latin!(str)
|
286
363
|
* @param [String] str text to be transliterated
|
@@ -289,12 +366,14 @@ rb_str_to_ascii_latin(VALUE self, VALUE str)
|
|
289
366
|
static VALUE
|
290
367
|
rb_str_to_ascii_latin_bang(VALUE self, VALUE str)
|
291
368
|
{
|
292
|
-
return
|
369
|
+
return str_to_srb(str, 2, 1);
|
293
370
|
}
|
294
371
|
|
295
372
|
void Init_byk_native(void)
|
296
373
|
{
|
297
374
|
VALUE Byk = rb_define_module("Byk");
|
375
|
+
rb_define_singleton_method(Byk, "to_cyrillic", rb_str_to_cyrillic, 1);
|
376
|
+
rb_define_singleton_method(Byk, "to_cyrillic!", rb_str_to_cyrillic_bang, 1);
|
298
377
|
rb_define_singleton_method(Byk, "to_latin", rb_str_to_latin, 1);
|
299
378
|
rb_define_singleton_method(Byk, "to_latin!", rb_str_to_latin_bang, 1);
|
300
379
|
rb_define_singleton_method(Byk, "to_ascii_latin", rb_str_to_ascii_latin, 1);
|
data/lib/byk/version.rb
CHANGED
data/spec/byk_spec.rb
CHANGED
@@ -1,5 +1,4 @@
|
|
1
1
|
# coding: utf-8
|
2
|
-
|
3
2
|
require "spec_helper"
|
4
3
|
|
5
4
|
describe Byk do
|
@@ -24,70 +23,114 @@ describe Byk do
|
|
24
23
|
let(:non_serbian_cyrillic) { non_serbian_cyrillic_coderange.join }
|
25
24
|
|
26
25
|
let(:ascii) { "The quick brown fox jumps over the lazy dog." }
|
27
|
-
let(:other) { "संस्कृतम्
|
26
|
+
let(:other) { "संस्कृतम्" }
|
28
27
|
|
29
|
-
let(:mixed) { "संस्कृतम्
|
30
|
-
let(:
|
31
|
-
let(:
|
28
|
+
let(:mixed) { "संस्कृतम् илити Sanskrit, obrati ПАЖЊУ." }
|
29
|
+
let(:mixed_cyrillic) { "संस्कृतम् илити Санскрит, обрати ПАЖЊУ." }
|
30
|
+
let(:mixed_latin) { "संस्कृतम् iliti Sanskrit, obrati PAŽNJU." }
|
31
|
+
let(:mixed_ascii_latin) { "संस्कृतम् iliti Sanskrit, obrati PAZNJU." }
|
32
32
|
|
33
|
-
it "doesn't
|
33
|
+
it "doesn't translate an empty string" do
|
34
34
|
expect(Byk.send(method, "")).to eq ""
|
35
35
|
end
|
36
36
|
|
37
|
-
it "doesn't
|
38
|
-
expect(Byk.send(method,
|
37
|
+
it "doesn't translate foreign coderanges" do
|
38
|
+
expect(Byk.send(method, other)).to eq other
|
39
39
|
end
|
40
|
+
end
|
40
41
|
|
41
|
-
|
42
|
+
shared_examples :cyrillization_method do |method|
|
43
|
+
include_examples :base, method
|
44
|
+
|
45
|
+
let(:edge_cases) do
|
46
|
+
[
|
47
|
+
["lJ", "љ"],
|
48
|
+
["nJ", "њ"],
|
49
|
+
["dŽ", "џ"]
|
50
|
+
]
|
51
|
+
end
|
52
|
+
|
53
|
+
it "doesn't translate Cyrillic" do
|
54
|
+
expect(Byk.send(method, pangram)).to eq pangram
|
55
|
+
end
|
56
|
+
|
57
|
+
it "doesn't translate non-Serbian Cyrillic" do
|
42
58
|
expect(Byk.send(method, non_serbian_cyrillic)).to eq non_serbian_cyrillic
|
43
59
|
end
|
44
60
|
|
45
|
-
it "
|
46
|
-
expect(Byk.send(method,
|
61
|
+
it "translates Latin to Cyrillic" do
|
62
|
+
expect(Byk.send(method, pangram_latin)).to eq pangram
|
63
|
+
end
|
64
|
+
|
65
|
+
it "translates Latin caps to Cyrillic caps" do
|
66
|
+
expect(Byk.send(method, pangram_latin_caps)).to eq pangram_caps
|
67
|
+
end
|
68
|
+
|
69
|
+
it "translates mixed text properly" do
|
70
|
+
expect(Byk.send(method, mixed)).to eq mixed_cyrillic
|
71
|
+
end
|
72
|
+
|
73
|
+
it "translates edge cases properly" do
|
74
|
+
edge_cases.each do |input, output|
|
75
|
+
expect(Byk.send(method, input)).to eq output
|
76
|
+
end
|
77
|
+
end
|
78
|
+
|
79
|
+
it "translates ABECEDA to AZBUKA" do
|
80
|
+
expect(Byk::ABECEDA.map { |l| l.dup.send(:to_cyrillic) }).to match_array(Byk::AZBUKA)
|
81
|
+
end
|
82
|
+
|
83
|
+
it "translates ABECEDA_CAPS to AZBUKA_CAPS" do
|
84
|
+
expect(Byk::ABECEDA_CAPS.map { |l| l.dup.send(:to_cyrillic) }).to match_array(Byk::AZBUKA_CAPS)
|
47
85
|
end
|
48
86
|
end
|
49
87
|
|
50
88
|
shared_examples :latinization_method do |method|
|
51
89
|
include_examples :base, method
|
52
90
|
|
53
|
-
let(:edge_cases)
|
91
|
+
let(:edge_cases) do
|
54
92
|
[
|
55
|
-
["Њ", "Nj"],
|
56
|
-
["Љ", "Lj"],
|
57
|
-
["Џ", "Dž"],
|
58
|
-
["ЊЊ", "NJNJ"],
|
59
93
|
["ЉЉ", "LJLJ"],
|
94
|
+
["ЊЊ", "NJNJ"],
|
60
95
|
["ЏЏ", "DŽDŽ"]
|
61
96
|
]
|
62
|
-
|
97
|
+
end
|
63
98
|
|
64
|
-
it "doesn't
|
99
|
+
it "doesn't translate ASCII" do
|
100
|
+
expect(Byk.send(method, ascii)).to eq ascii
|
101
|
+
end
|
102
|
+
|
103
|
+
it "doesn't translate Latin" do
|
65
104
|
expect(Byk.send(method, pangram_latin)).to eq pangram_latin
|
66
105
|
end
|
67
106
|
|
68
|
-
it "
|
107
|
+
it "doesn't translate non-Serbian Cyrillic" do
|
108
|
+
expect(Byk.send(method, non_serbian_cyrillic)).to eq non_serbian_cyrillic
|
109
|
+
end
|
110
|
+
|
111
|
+
it "translates Cyrillic to Latin" do
|
69
112
|
expect(Byk.send(method, pangram)).to eq pangram_latin
|
70
113
|
end
|
71
114
|
|
72
|
-
it "
|
115
|
+
it "translates Cyrillic caps to Latin caps" do
|
73
116
|
expect(Byk.send(method, pangram_caps)).to eq pangram_latin_caps
|
74
117
|
end
|
75
118
|
|
76
|
-
it "
|
119
|
+
it "translates mixed text properly" do
|
77
120
|
expect(Byk.send(method, mixed)).to eq mixed_latin
|
78
121
|
end
|
79
122
|
|
80
|
-
it "
|
123
|
+
it "translates edge cases properly" do
|
81
124
|
edge_cases.each do |input, output|
|
82
125
|
expect(Byk.send(method, input)).to eq output
|
83
126
|
end
|
84
127
|
end
|
85
128
|
|
86
|
-
it "
|
129
|
+
it "translates AZBUKA to ABECEDA" do
|
87
130
|
expect(Byk::AZBUKA.map { |l| l.dup.send(method) }).to match_array(Byk::ABECEDA)
|
88
131
|
end
|
89
132
|
|
90
|
-
it "
|
133
|
+
it "translates AZBUKA_CAPS to ABECEDA_CAPS" do
|
91
134
|
expect(Byk::AZBUKA_CAPS.map { |l| l.dup.send(method) }).to match_array(Byk::ABECEDA_CAPS)
|
92
135
|
end
|
93
136
|
end
|
@@ -95,7 +138,7 @@ describe Byk do
|
|
95
138
|
shared_examples :ascii_latinization_method do |method|
|
96
139
|
include_examples :base, method
|
97
140
|
|
98
|
-
let(:edge_cases)
|
141
|
+
let(:edge_cases) do
|
99
142
|
[
|
100
143
|
["Њ", "Nj"],
|
101
144
|
["Љ", "Lj"],
|
@@ -107,32 +150,36 @@ describe Byk do
|
|
107
150
|
["ЏЏ", "DZDZ"],
|
108
151
|
["ЂЂ", "DJDJ"],
|
109
152
|
["ĐĐ", "DJDJ"],
|
110
|
-
["ЂУРАЂ
|
111
|
-
["ĐURAĐ
|
153
|
+
["ЂУРАЂ Ђурђевић", "DJURADJ Djurdjevic"],
|
154
|
+
["ĐURAĐ Đurđević", "DJURADJ Djurdjevic"]
|
112
155
|
]
|
113
|
-
}
|
114
|
-
|
115
|
-
it "converts Cyrillic to ASCII Latin" do
|
116
|
-
expect(Byk.send(method, pangram)).to eq pangram_ascii_latin
|
117
156
|
end
|
118
157
|
|
119
|
-
it "
|
120
|
-
expect(Byk.send(method,
|
158
|
+
it "doesn't translate ASCII" do
|
159
|
+
expect(Byk.send(method, ascii)).to eq ascii
|
121
160
|
end
|
122
161
|
|
123
|
-
it "
|
162
|
+
it "translates Latin to ASCII Latin" do
|
124
163
|
expect(Byk.send(method, pangram_latin)).to eq pangram_ascii_latin
|
125
164
|
end
|
126
165
|
|
127
|
-
it "
|
166
|
+
it "translates Latin caps to ASCII Latin caps" do
|
128
167
|
expect(Byk.send(method, pangram_latin_caps)).to eq pangram_ascii_latin_caps
|
129
168
|
end
|
130
169
|
|
131
|
-
it "
|
170
|
+
it "translates Cyrillic to ASCII Latin" do
|
171
|
+
expect(Byk.send(method, pangram)).to eq pangram_ascii_latin
|
172
|
+
end
|
173
|
+
|
174
|
+
it "translates Cyrillic caps to ASCII Latin caps" do
|
175
|
+
expect(Byk.send(method, pangram_caps)).to eq pangram_ascii_latin_caps
|
176
|
+
end
|
177
|
+
|
178
|
+
it "translates mixed text properly" do
|
132
179
|
expect(Byk.send(method, mixed)).to eq mixed_ascii_latin
|
133
180
|
end
|
134
181
|
|
135
|
-
it "
|
182
|
+
it "translates edge cases properly" do
|
136
183
|
edge_cases.each do |input, output|
|
137
184
|
expect(Byk.send(method, input)).to eq output
|
138
185
|
end
|
@@ -141,18 +188,28 @@ describe Byk do
|
|
141
188
|
|
142
189
|
shared_examples :non_destructive_method do |method|
|
143
190
|
it "doesn't modify the arg" do
|
144
|
-
str = "
|
191
|
+
str = "ЖŽ"
|
145
192
|
expect { Byk.send(method, str) }.to_not change { str }
|
146
193
|
end
|
147
194
|
end
|
148
195
|
|
149
196
|
shared_examples :destructive_method do |method|
|
150
197
|
it "modifies the arg" do
|
151
|
-
str = "
|
198
|
+
str = "ЖŽ"
|
152
199
|
expect { Byk.send(method, str) }.to change { str }
|
153
200
|
end
|
154
201
|
end
|
155
202
|
|
203
|
+
describe ".to_cyrillic" do
|
204
|
+
it_behaves_like :cyrillization_method, :to_cyrillic
|
205
|
+
it_behaves_like :non_destructive_method, :to_cyrillic
|
206
|
+
end
|
207
|
+
|
208
|
+
describe ".to_cyrillic!" do
|
209
|
+
it_behaves_like :cyrillization_method, :to_cyrillic!
|
210
|
+
it_behaves_like :destructive_method, :to_cyrillic!
|
211
|
+
end
|
212
|
+
|
156
213
|
describe ".to_latin" do
|
157
214
|
it_behaves_like :latinization_method, :to_latin
|
158
215
|
it_behaves_like :non_destructive_method, :to_latin
|
@@ -176,7 +233,7 @@ end
|
|
176
233
|
|
177
234
|
describe String do
|
178
235
|
it "responds to Byk methods" do
|
179
|
-
Byk.
|
236
|
+
Byk.singleton_methods.each do |method|
|
180
237
|
expect("").to respond_to(method)
|
181
238
|
end
|
182
239
|
end
|
metadata
CHANGED
@@ -1,15 +1,29 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: byk
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 1.0.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Nikola Topalović
|
8
8
|
autorequire:
|
9
|
-
bindir:
|
9
|
+
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2016-04-09 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
|
+
- !ruby/object:Gem::Dependency
|
14
|
+
name: rake
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
16
|
+
requirements:
|
17
|
+
- - "~>"
|
18
|
+
- !ruby/object:Gem::Version
|
19
|
+
version: '10.5'
|
20
|
+
type: :development
|
21
|
+
prerelease: false
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
23
|
+
requirements:
|
24
|
+
- - "~>"
|
25
|
+
- !ruby/object:Gem::Version
|
26
|
+
version: '10.5'
|
13
27
|
- !ruby/object:Gem::Dependency
|
14
28
|
name: rake-compiler
|
15
29
|
requirement: !ruby/object:Gem::Requirement
|
@@ -38,10 +52,11 @@ dependencies:
|
|
38
52
|
- - "~>"
|
39
53
|
- !ruby/object:Gem::Version
|
40
54
|
version: '3.2'
|
41
|
-
description:
|
42
|
-
|
55
|
+
description: Fast transliteration of Serbian Cyrillic to Latin and back. Brzo preslovljavanje
|
56
|
+
ćirilice u latinicu i obratno.
|
43
57
|
email: nikola.topalovic@gmail.com
|
44
|
-
executables:
|
58
|
+
executables:
|
59
|
+
- byk
|
45
60
|
extensions:
|
46
61
|
- ext/byk/extconf.rb
|
47
62
|
extra_rdoc_files: []
|
@@ -49,6 +64,7 @@ files:
|
|
49
64
|
- CHANGELOG.md
|
50
65
|
- LICENSE
|
51
66
|
- README.md
|
67
|
+
- exe/byk
|
52
68
|
- ext/byk/byk.c
|
53
69
|
- ext/byk/extconf.rb
|
54
70
|
- lib/byk.rb
|
@@ -76,9 +92,10 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
76
92
|
version: '0'
|
77
93
|
requirements: []
|
78
94
|
rubyforge_project:
|
79
|
-
rubygems_version: 2.
|
95
|
+
rubygems_version: 2.5.1
|
80
96
|
signing_key:
|
81
97
|
specification_version: 4
|
82
|
-
summary: Fast transliteration of Serbian Cyrillic
|
98
|
+
summary: Fast transliteration of Serbian Cyrillic to Latin and back. Brzo preslovljavanje
|
99
|
+
ćirilice u latinicu i obratno.
|
83
100
|
test_files:
|
84
101
|
- spec/byk_spec.rb
|