byk 0.6.0 → 1.0.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +5 -0
- data/README.md +123 -51
- data/exe/byk +51 -0
- data/ext/byk/byk.c +261 -182
- data/lib/byk/version.rb +1 -1
- data/spec/byk_spec.rb +97 -40
- metadata +25 -8
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: cc996c9d9dc81f884e02cc1dd760eeb57b6545fc
|
4
|
+
data.tar.gz: de07860c2cb41bcb39b299fee4500fd2bf01db73
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 16e97855924c380b205e2e651fdcde391785fe051c2971d948f801ff4260eb691dc4c3304ac17b3083fc0a2469f26d134c9622f74b058f6950d5fd8dfaf62383
|
7
|
+
data.tar.gz: c85659aaaccbc5e1db30305b52e2f4955de160dcb7a617ae564877619fe5f36d852ea7c17228f301e639aae3d4793133baa5d644faeffc83942d6b179bef53e9
|
data/CHANGELOG.md
CHANGED
data/README.md
CHANGED
@@ -4,39 +4,85 @@ Byk
|
|
4
4
|
[![Gem Version](https://badge.fury.io/rb/byk.svg)](https://rubygems.org/gems/byk)
|
5
5
|
[![Build Status](https://travis-ci.org/topalovic/byk.svg?branch=master)](https://travis-ci.org/topalovic/byk)
|
6
6
|
|
7
|
-
Ruby gem for fast transliteration of Serbian Cyrillic
|
8
|
-
<br />
|
9
|
-
<sub>Inspired by @dejan's
|
10
|
-
[nice little gem](https://github.com/dejan/srbovanje),
|
11
|
-
this one comes with a C-optimized twist</sub>
|
7
|
+
Ruby gem for fast transliteration of Serbian Cyrillic ↔ Latin
|
12
8
|
|
13
9
|
![byk](https://cloud.githubusercontent.com/assets/626128/7155207/07545960-e35d-11e4-804e-5fdee70a3e30.png)
|
14
10
|
|
15
11
|
|
16
12
|
## Installation
|
17
13
|
|
18
|
-
|
14
|
+
Byk can be used as a standalone console utility or as a `String`
|
15
|
+
extension in your Ruby programs. It has zero dependencies beyond
|
16
|
+
vanilla Ruby and the toolchain for building native gems <sup>1</sup>.
|
17
|
+
|
18
|
+
You can install it directly:
|
19
|
+
|
20
|
+
```ruby
|
21
|
+
$ gem install byk
|
22
|
+
```
|
23
|
+
|
24
|
+
or add it as a dependency in your application's Gemfile:
|
19
25
|
|
20
26
|
```ruby
|
21
27
|
gem "byk"
|
22
28
|
```
|
23
29
|
|
24
|
-
|
30
|
+
<sub><sup>1</sup> For Windows, you might want to check out
|
31
|
+
[DevKit](https://github.com/oneclick/rubyinstaller/wiki/Development-Kit)</sub>
|
32
|
+
|
33
|
+
|
34
|
+
## Usage
|
35
|
+
|
36
|
+
### As a standalone utility
|
37
|
+
|
38
|
+
Here's the help banner with all the available options:
|
25
39
|
|
26
40
|
```
|
27
|
-
|
41
|
+
usage: byk [options] [files]
|
42
|
+
|
43
|
+
options:
|
44
|
+
-c, --cyrillic convert input to Cyrillic (default)
|
45
|
+
-l, --latin convert input to Latin
|
46
|
+
-a, --ascii convert input to "ASCII Latin"
|
47
|
+
-v, --version show version
|
28
48
|
```
|
29
49
|
|
30
|
-
|
50
|
+
Translation goes to stdout so you can redirect it or pipe it as you
|
51
|
+
see fit. Let's take a look at some common scenarios.
|
31
52
|
|
53
|
+
To translate files to Cyrillic:
|
54
|
+
```sh
|
55
|
+
$ byk in1.txt in2.txt > out.txt
|
32
56
|
```
|
33
|
-
|
57
|
+
|
58
|
+
To translate files to Latin and search for a phrase:
|
59
|
+
```sh
|
60
|
+
$ byk -l file.txt | grep stvar
|
34
61
|
```
|
35
62
|
|
63
|
+
Ad hoc conversion:
|
64
|
+
```sh
|
65
|
+
$ echo "Вук Стефановић Караџић" | byk -a
|
66
|
+
Vuk Stefanovic Karadzic
|
67
|
+
```
|
36
68
|
|
37
|
-
|
69
|
+
or simply omit args and type away:
|
70
|
+
```sh
|
71
|
+
$ byk
|
72
|
+
a u ruke Mandušića Vuka
|
73
|
+
biće svaka puška ubojita!
|
74
|
+
^D
|
75
|
+
а у руке Мандушића Вука
|
76
|
+
биће свака пушка убојита!
|
77
|
+
```
|
38
78
|
|
39
|
-
|
79
|
+
`^D` being <kbd>ctrl</kbd> <kbd>d</kbd>.
|
80
|
+
|
81
|
+
|
82
|
+
### As a `String` extension
|
83
|
+
|
84
|
+
Unless you're using Bundler, make sure to require the gem in your
|
85
|
+
initializer:
|
40
86
|
|
41
87
|
```ruby
|
42
88
|
require "byk"
|
@@ -45,22 +91,23 @@ require "byk"
|
|
45
91
|
This will extend `String` with a couple of simple methods:
|
46
92
|
|
47
93
|
```ruby
|
48
|
-
"
|
49
|
-
"Шеширџија".
|
50
|
-
"
|
94
|
+
"Šeširdžija".to_cyrillic # => "Шеширџија"
|
95
|
+
"Шеширџија".to_latin # => "Šeširdžija"
|
96
|
+
"Шеширџија".to_ascii_latin # => "Sesirdzija"
|
51
97
|
```
|
52
98
|
|
53
|
-
|
99
|
+
These do not modify the receiver. For that, there's a destructive
|
100
|
+
variant of each:
|
54
101
|
|
55
102
|
```ruby
|
56
|
-
text = "
|
57
|
-
text.
|
58
|
-
text
|
59
|
-
text.to_ascii_latin! # => "
|
60
|
-
text # => "
|
103
|
+
text = "Šeširdžija"
|
104
|
+
text.to_cyrillic! # => "Шеширџија"
|
105
|
+
text.to_latin! # => "Šeširdžija"
|
106
|
+
text.to_ascii_latin! # => "Sesirdzija"
|
107
|
+
text # => "Sesirdzija"
|
61
108
|
```
|
62
109
|
|
63
|
-
Note that
|
110
|
+
Note that both latinization methods observe
|
64
111
|
[digraph capitalization rules](http://sr.wikipedia.org/wiki/Гајица#.D0.94.D0.B8.D0.B3.D1.80.D0.B0.D1.84.D0.B8):
|
65
112
|
|
66
113
|
```ruby
|
@@ -68,63 +115,88 @@ Note that these methods take into account the
|
|
68
115
|
"ĐORĐE Đorđević".to_ascii_latin # => "DJORDJE Djordjevic"
|
69
116
|
```
|
70
117
|
|
71
|
-
|
72
|
-
require
|
118
|
+
|
119
|
+
### Safe require
|
120
|
+
|
121
|
+
If you prefer not to monkey patch `String`, you can do a "safe"
|
122
|
+
require in your Gemfile:
|
123
|
+
|
73
124
|
|
74
125
|
```ruby
|
75
|
-
require "byk/safe"
|
126
|
+
gem "byk", :require => "byk/safe"
|
76
127
|
```
|
77
128
|
|
78
|
-
|
129
|
+
or initializer:
|
79
130
|
|
80
131
|
```ruby
|
81
|
-
|
82
|
-
Byk.to_latin(text) # => "Vuk"
|
83
|
-
text # => "Byk"
|
84
|
-
Byk.to_latin!(text) # => "Vuk"
|
85
|
-
text # => "Vuk"
|
132
|
+
require "byk/safe"
|
86
133
|
```
|
87
134
|
|
135
|
+
Then, you should rely on module methods:
|
88
136
|
|
89
|
-
|
137
|
+
```ruby
|
138
|
+
text = "Жвазбука"
|
90
139
|
|
91
|
-
|
140
|
+
Byk.to_latin(text) # => "Žvazbuka"
|
141
|
+
text # => "Жвазбука"
|
142
|
+
|
143
|
+
Byk.to_latin!(text) # => "Žvazbuka"
|
144
|
+
text # => "Žvazbuka"
|
92
145
|
|
146
|
+
# etc.
|
93
147
|
```
|
94
|
-
|
95
|
-
|
148
|
+
|
149
|
+
|
150
|
+
## How fast is "fast" transliteration?
|
151
|
+
|
152
|
+
Here's a quick test:
|
153
|
+
|
154
|
+
```sh
|
155
|
+
$ wget https://sr.wikipedia.org/ -O sample
|
156
|
+
$ du -h sample
|
157
|
+
128K
|
158
|
+
|
159
|
+
$ time byk -l sample > /dev/null
|
160
|
+
0.08s user 0.04s system 96% cpu 0.126 total
|
96
161
|
```
|
97
162
|
|
163
|
+
Let's up the ante:
|
164
|
+
|
165
|
+
```sh
|
166
|
+
$ for i in {1..800}; do cat sample; done > big
|
167
|
+
$ du -h big
|
168
|
+
97M
|
169
|
+
|
170
|
+
$ time byk -l big > /dev/null
|
171
|
+
1.71s user 0.13s system 99% cpu 1.846 total
|
172
|
+
```
|
98
173
|
|
99
|
-
|
174
|
+
So, ~100MB in under 2s. Fast enough, I suppose. You can expect it to
|
175
|
+
scale linearly.
|
100
176
|
|
101
|
-
|
102
|
-
|
103
|
-
|
177
|
+
Compared to the pure Ruby implementation, it is about
|
178
|
+
[10-30x faster](benchmark), depending on the input composition and the
|
179
|
+
transliteration method applied.
|
104
180
|
|
105
181
|
|
106
|
-
##
|
182
|
+
## Testing
|
107
183
|
|
108
|
-
|
109
|
-
projects, e.g. sites supporting dual script content. Remember,
|
110
|
-
`Benchmark` is your friend.
|
184
|
+
To test the gem, clone the repo and run:
|
111
185
|
|
112
|
-
|
113
|
-
|
114
|
-
|
186
|
+
```
|
187
|
+
$ bundle && bundle exec rake
|
188
|
+
```
|
115
189
|
|
116
190
|
|
117
191
|
## Compatibility
|
118
192
|
|
119
|
-
Byk is supported under MRI
|
193
|
+
Byk is supported under MRI 1.9.2+. I might try my hand in writing a
|
194
|
+
JRuby extension in a future release.
|
120
195
|
|
121
|
-
I don't plan to support 1.8.7 or older due to substantial C API
|
122
|
-
changes between 1.8 and 1.9. It doesn't build under Rubinius
|
123
|
-
currently, but I intend to support it in future releases.
|
124
196
|
|
125
197
|
|
126
198
|
## License
|
127
199
|
|
128
|
-
This gem is released under the [MIT License](
|
200
|
+
This gem is released under the [MIT License](LICENSE).
|
129
201
|
|
130
202
|
Уздравље!
|
data/exe/byk
ADDED
@@ -0,0 +1,51 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require "byk/safe"
|
4
|
+
require "optparse"
|
5
|
+
|
6
|
+
trap "SIGINT" do
|
7
|
+
exit 130
|
8
|
+
end
|
9
|
+
|
10
|
+
method_name = :to_cyrillic
|
11
|
+
|
12
|
+
opts = OptionParser.new do |opt|
|
13
|
+
opt.banner = "usage: byk [options] [files]"
|
14
|
+
opt.summary_width = 20
|
15
|
+
|
16
|
+
opt.separator ""
|
17
|
+
opt.separator "options:"
|
18
|
+
|
19
|
+
opt.on("-c", "--cyrillic", "convert input to Cyrillic (default)") do
|
20
|
+
method_name = :to_cyrillic
|
21
|
+
end
|
22
|
+
|
23
|
+
opt.on("-l", "--latin", "convert input to Latin") do
|
24
|
+
method_name = :to_latin
|
25
|
+
end
|
26
|
+
|
27
|
+
opt.on("-a", "--ascii", 'convert input to "ASCII Latin"') do
|
28
|
+
method_name = :to_ascii_latin
|
29
|
+
end
|
30
|
+
|
31
|
+
opt.on_tail("-v", "--version", "show version") do
|
32
|
+
puts Byk::VERSION
|
33
|
+
exit
|
34
|
+
end
|
35
|
+
end
|
36
|
+
|
37
|
+
begin
|
38
|
+
opts.parse!
|
39
|
+
rescue OptionParser::InvalidOption => e
|
40
|
+
puts e
|
41
|
+
puts
|
42
|
+
puts opts
|
43
|
+
exit 1
|
44
|
+
end
|
45
|
+
|
46
|
+
begin
|
47
|
+
puts Byk.send(method_name, ARGF.read)
|
48
|
+
rescue => e
|
49
|
+
puts e
|
50
|
+
exit 1
|
51
|
+
end
|
data/ext/byk/byk.c
CHANGED
@@ -3,103 +3,225 @@
|
|
3
3
|
|
4
4
|
#define STR_ENC_GET(str) rb_enc_from_index(ENCODING_GET(str))
|
5
5
|
|
6
|
-
|
7
|
-
|
8
|
-
|
6
|
+
static inline void
|
7
|
+
_str_cat_char(VALUE str, unsigned c, rb_encoding *enc)
|
8
|
+
{
|
9
|
+
char s[16];
|
10
|
+
int n = rb_enc_codelen(c, enc);
|
11
|
+
rb_enc_mbcput(c, s, enc);
|
12
|
+
rb_str_buf_cat(str, s, n);
|
13
|
+
}
|
9
14
|
|
10
15
|
enum {
|
11
|
-
LAT_CAP_TJ =
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
CYR_CAP_A,
|
28
|
-
CYR_CAP_ZH = 0x416,
|
29
|
-
CYR_CAP_C = 0x426,
|
30
|
-
CYR_CAP_CH,
|
31
|
-
CYR_CAP_SH,
|
32
|
-
CYR_A = 0x430,
|
33
|
-
CYR_ZH = 0x436,
|
34
|
-
CYR_C = 0x446,
|
35
|
-
CYR_CH,
|
36
|
-
CYR_SH,
|
37
|
-
CYR_DJ = 0x452,
|
38
|
-
CYR_J = 0x458,
|
39
|
-
CYR_LJ,
|
40
|
-
CYR_NJ,
|
41
|
-
CYR_TJ,
|
42
|
-
CYR_DZ = 0x45f
|
16
|
+
LAT_CAP_TJ=262, LAT_TJ, LAT_CAP_CH=268, LAT_CH,
|
17
|
+
LAT_CAP_DJ=272, LAT_DJ, LAT_CAP_SH=352, LAT_SH,
|
18
|
+
LAT_CAP_ZH=381, LAT_ZH, CYR_CAP_DJ=1026, CYR_CAP_J=1032,
|
19
|
+
CYR_CAP_LJ, CYR_CAP_NJ, CYR_CAP_TJ, CYR_CAP_DZ=1039,
|
20
|
+
CYR_CAP_A, CYR_CAP_B, CYR_CAP_V, CYR_CAP_G,
|
21
|
+
CYR_CAP_D, CYR_CAP_E, CYR_CAP_ZH, CYR_CAP_Z,
|
22
|
+
CYR_CAP_I, CYR_CAP_K=1050, CYR_CAP_L, CYR_CAP_M,
|
23
|
+
CYR_CAP_N, CYR_CAP_O, CYR_CAP_P, CYR_CAP_R,
|
24
|
+
CYR_CAP_S, CYR_CAP_T, CYR_CAP_U, CYR_CAP_F,
|
25
|
+
CYR_CAP_H, CYR_CAP_C, CYR_CAP_CH, CYR_CAP_SH,
|
26
|
+
CYR_A=1072, CYR_B, CYR_V, CYR_G, CYR_D,
|
27
|
+
CYR_E, CYR_ZH, CYR_Z, CYR_I, CYR_K=1082,
|
28
|
+
CYR_L, CYR_M, CYR_N, CYR_O, CYR_P,
|
29
|
+
CYR_R, CYR_S, CYR_T, CYR_U, CYR_F,
|
30
|
+
CYR_H, CYR_C, CYR_CH, CYR_SH, CYR_DJ=1106,
|
31
|
+
CYR_J=1112, CYR_LJ, CYR_NJ, CYR_TJ, CYR_DZ=1119
|
43
32
|
};
|
44
33
|
|
45
|
-
static inline unsigned
|
46
|
-
|
34
|
+
static inline unsigned
|
35
|
+
is_cap(unsigned codepoint)
|
47
36
|
{
|
48
|
-
|
37
|
+
if (codepoint >= 65 && codepoint <= 90) return 1;
|
38
|
+
if (codepoint >= CYR_CAP_DJ && codepoint <= CYR_CAP_SH) return 1;
|
39
|
+
|
40
|
+
switch(codepoint) {
|
41
|
+
case LAT_CAP_TJ:
|
42
|
+
case LAT_CAP_CH:
|
43
|
+
case LAT_CAP_DJ:
|
44
|
+
case LAT_CAP_SH:
|
45
|
+
case LAT_CAP_ZH:
|
46
|
+
return 1;
|
47
|
+
default:
|
48
|
+
return 0;
|
49
|
+
}
|
49
50
|
}
|
50
51
|
|
51
|
-
static inline unsigned
|
52
|
-
|
52
|
+
static inline unsigned
|
53
|
+
is_digraph(unsigned codepoint)
|
53
54
|
{
|
54
|
-
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
|
55
|
+
switch(codepoint) {
|
56
|
+
case CYR_LJ:
|
57
|
+
case CYR_NJ:
|
58
|
+
case CYR_DZ:
|
59
|
+
case CYR_CAP_LJ:
|
60
|
+
case CYR_CAP_NJ:
|
61
|
+
case CYR_CAP_DZ:
|
62
|
+
return 1;
|
63
|
+
default:
|
64
|
+
return 0;
|
65
|
+
}
|
61
66
|
}
|
62
67
|
|
63
|
-
static
|
64
|
-
|
68
|
+
static unsigned
|
69
|
+
digraph_to_cyr(unsigned codepoint, unsigned codepoint2, unsigned capitalize, unsigned *next_out)
|
65
70
|
{
|
66
|
-
|
67
|
-
|
68
|
-
|
71
|
+
static unsigned CYR_MAP[] = {
|
72
|
+
CYR_A, CYR_B, CYR_C, CYR_D, CYR_E, CYR_F,
|
73
|
+
CYR_G, CYR_H, CYR_I, CYR_J, CYR_K, CYR_L,
|
74
|
+
CYR_M, CYR_N, CYR_O, CYR_P, 0, CYR_R,
|
75
|
+
CYR_S, CYR_T, CYR_U, CYR_V, 0, 0, 0, CYR_Z
|
76
|
+
};
|
77
|
+
|
78
|
+
static unsigned CYR_CAPS_MAP[] = {
|
79
|
+
CYR_CAP_A, CYR_CAP_B, CYR_CAP_C, CYR_CAP_D, CYR_CAP_E, CYR_CAP_F,
|
80
|
+
CYR_CAP_G, CYR_CAP_H, CYR_CAP_I, CYR_CAP_J, CYR_CAP_K, CYR_CAP_L,
|
81
|
+
CYR_CAP_M, CYR_CAP_N, CYR_CAP_O, CYR_CAP_P, 0, CYR_CAP_R,
|
82
|
+
CYR_CAP_S, CYR_CAP_T, CYR_CAP_U, CYR_CAP_V, 0, 0, 0, CYR_CAP_Z
|
83
|
+
};
|
84
|
+
|
85
|
+
if (codepoint2 == LAT_CAP_ZH || codepoint2 == LAT_ZH) {
|
86
|
+
switch (codepoint) {
|
87
|
+
case 'd': return CYR_DZ;
|
88
|
+
case 'D': return CYR_CAP_DZ;
|
89
|
+
}
|
90
|
+
}
|
91
|
+
|
92
|
+
if (codepoint2 == 'j' || codepoint2 == 'J') {
|
93
|
+
switch (codepoint) {
|
94
|
+
case 'l': return CYR_LJ;
|
95
|
+
case 'n': return CYR_NJ;
|
96
|
+
case 'L': return CYR_CAP_LJ;
|
97
|
+
case 'N': return CYR_CAP_NJ;
|
98
|
+
}
|
99
|
+
}
|
100
|
+
|
101
|
+
if (codepoint >= 'a' && codepoint <= 'z') return CYR_MAP[codepoint - 'a'];
|
102
|
+
if (codepoint >= 'A' && codepoint <= 'Z') return CYR_CAPS_MAP[codepoint - 'A'];
|
103
|
+
|
104
|
+
switch (codepoint) {
|
105
|
+
case LAT_CH: return CYR_CH;
|
106
|
+
case LAT_DJ: return CYR_DJ;
|
107
|
+
case LAT_SH: return CYR_SH;
|
108
|
+
case LAT_TJ: return CYR_TJ;
|
109
|
+
case LAT_ZH: return CYR_ZH;
|
110
|
+
case LAT_CAP_CH: return CYR_CAP_CH;
|
111
|
+
case LAT_CAP_DJ: return CYR_CAP_DJ;
|
112
|
+
case LAT_CAP_SH: return CYR_CAP_SH;
|
113
|
+
case LAT_CAP_TJ: return CYR_CAP_TJ;
|
114
|
+
case LAT_CAP_ZH: return CYR_CAP_ZH;
|
115
|
+
}
|
116
|
+
|
117
|
+
return 0;
|
69
118
|
}
|
70
119
|
|
71
|
-
static
|
72
|
-
|
120
|
+
static unsigned
|
121
|
+
digraph_to_latin(unsigned codepoint, unsigned codepoint2, unsigned capitalize, unsigned *next_out)
|
73
122
|
{
|
74
|
-
char
|
75
|
-
|
76
|
-
|
77
|
-
|
123
|
+
static char LAT_MAP[] = {
|
124
|
+
'a', 'b', 'v', 'g', 'd', 'e', 0, 'z', 'i', 0, 'k', 'l',
|
125
|
+
'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'f', 'h', 'c'
|
126
|
+
};
|
127
|
+
|
128
|
+
static char LAT_CAPS_MAP[] = {
|
129
|
+
'A', 'B', 'V', 'G', 'D', 'E', 0, 'Z', 'I', 0, 'K', 'L',
|
130
|
+
'M', 'N', 'O', 'P', 'R', 'S', 'T', 'U', 'F', 'H', 'C'
|
131
|
+
};
|
132
|
+
|
133
|
+
if (codepoint < CYR_CAP_DJ || codepoint > CYR_DZ) return 0;
|
134
|
+
|
135
|
+
switch (codepoint) {
|
136
|
+
case CYR_ZH: return LAT_ZH;
|
137
|
+
case CYR_CAP_ZH: return LAT_CAP_ZH;
|
138
|
+
}
|
139
|
+
|
140
|
+
if (codepoint >= CYR_A && codepoint <= CYR_C)
|
141
|
+
return LAT_MAP[codepoint - CYR_A];
|
142
|
+
|
143
|
+
if (codepoint >= CYR_CAP_A && codepoint <= CYR_CAP_C)
|
144
|
+
return LAT_CAPS_MAP[codepoint - CYR_CAP_A];
|
145
|
+
|
146
|
+
if (codepoint >= CYR_A) {
|
147
|
+
switch (codepoint) {
|
148
|
+
case CYR_J: return 'j';
|
149
|
+
case CYR_TJ: return LAT_TJ;
|
150
|
+
case CYR_CH: return LAT_CH;
|
151
|
+
case CYR_SH: return LAT_SH;
|
152
|
+
case CYR_DJ: return LAT_DJ;
|
153
|
+
case CYR_LJ: *next_out = 'j'; return 'l';
|
154
|
+
case CYR_NJ: *next_out = 'j'; return 'n';
|
155
|
+
case CYR_DZ: *next_out = LAT_ZH; return 'd';
|
156
|
+
}
|
157
|
+
}
|
158
|
+
else {
|
159
|
+
switch (codepoint) {
|
160
|
+
case CYR_CAP_J: return 'J';
|
161
|
+
case CYR_CAP_TJ: return LAT_CAP_TJ;
|
162
|
+
case CYR_CAP_CH: return LAT_CAP_CH;
|
163
|
+
case CYR_CAP_SH: return LAT_CAP_SH;
|
164
|
+
case CYR_CAP_DJ: return LAT_CAP_DJ;
|
165
|
+
case CYR_CAP_LJ: *next_out = (capitalize || is_cap(codepoint2)) ? 'J' : 'j'; return 'L';
|
166
|
+
case CYR_CAP_NJ: *next_out = (capitalize || is_cap(codepoint2)) ? 'J' : 'j'; return 'N';
|
167
|
+
case CYR_CAP_DZ: *next_out = (capitalize || is_cap(codepoint2)) ? LAT_CAP_ZH : LAT_ZH; return 'D';
|
168
|
+
}
|
169
|
+
}
|
170
|
+
|
171
|
+
return 0;
|
172
|
+
}
|
173
|
+
|
174
|
+
static unsigned
|
175
|
+
digraph_to_ascii(unsigned codepoint, unsigned codepoint2, unsigned capitalize, unsigned *next_out)
|
176
|
+
{
|
177
|
+
switch (codepoint) {
|
178
|
+
case LAT_TJ:
|
179
|
+
case LAT_CH:
|
180
|
+
case CYR_TJ:
|
181
|
+
case CYR_CH: return 'c';
|
182
|
+
case LAT_SH:
|
183
|
+
case CYR_SH: return 's';
|
184
|
+
case LAT_ZH:
|
185
|
+
case CYR_ZH: return 'z';
|
186
|
+
case LAT_DJ:
|
187
|
+
case CYR_DJ: *next_out = 'j'; return 'd';
|
188
|
+
case LAT_CAP_TJ:
|
189
|
+
case LAT_CAP_CH:
|
190
|
+
case CYR_CAP_TJ:
|
191
|
+
case CYR_CAP_CH: return 'C';
|
192
|
+
case LAT_CAP_SH:
|
193
|
+
case CYR_CAP_SH: return 'S';
|
194
|
+
case LAT_CAP_ZH:
|
195
|
+
case CYR_CAP_ZH: return 'Z';
|
196
|
+
case LAT_CAP_DJ:
|
197
|
+
case CYR_CAP_DJ:
|
198
|
+
*next_out = (capitalize || is_cap(codepoint2)) ? 'J' : 'j'; return 'D';
|
199
|
+
case CYR_DZ:
|
200
|
+
*next_out = (capitalize || is_cap(codepoint2)) ? 'Z' : 'z'; return 'd';
|
201
|
+
case CYR_CAP_DZ:
|
202
|
+
*next_out = (capitalize || is_cap(codepoint2)) ? 'Z' : 'z'; return 'D';
|
203
|
+
default:
|
204
|
+
return digraph_to_latin(codepoint, codepoint2, capitalize, next_out);
|
205
|
+
}
|
78
206
|
}
|
79
207
|
|
80
208
|
static VALUE
|
81
|
-
|
209
|
+
str_to_srb(VALUE str, int strategy, int bang)
|
82
210
|
{
|
83
211
|
VALUE dest;
|
84
|
-
|
212
|
+
rb_encoding *enc;
|
213
|
+
|
85
214
|
int len, next_len;
|
86
|
-
|
87
|
-
int force_upper = 0;
|
215
|
+
unsigned in, in2, out, out2, seen_cap = 0;
|
88
216
|
char *pos, *end, *seq_start = 0;
|
89
|
-
char cyr;
|
90
|
-
unsigned int codepoint = 0;
|
91
|
-
unsigned int next_codepoint = 0;
|
92
|
-
rb_encoding *enc;
|
93
217
|
|
94
|
-
|
95
|
-
'a', 'b', 'v', 'g', 'd', 'e', '\0', 'z', 'i', '\0', 'k',
|
96
|
-
'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'f', 'h', 'c'
|
97
|
-
};
|
218
|
+
unsigned (*method)(unsigned, unsigned, unsigned, unsigned*);
|
98
219
|
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
|
220
|
+
switch(strategy) {
|
221
|
+
case 0: method = &digraph_to_cyr; break;
|
222
|
+
case 1: method = &digraph_to_latin; break;
|
223
|
+
default: method = &digraph_to_ascii;
|
224
|
+
}
|
103
225
|
|
104
226
|
StringValue(str);
|
105
227
|
pos = RSTRING_PTR(str);
|
@@ -107,123 +229,50 @@ str_to_latin(VALUE str, int ascii, int bang)
|
|
107
229
|
|
108
230
|
end = RSTRING_END(str);
|
109
231
|
enc = STR_ENC_GET(str);
|
110
|
-
|
111
|
-
dest = rb_str_buf_new(dest_len);
|
232
|
+
dest = rb_str_buf_new(RSTRING_LEN(str) + 30);
|
112
233
|
rb_enc_associate(dest, enc);
|
113
234
|
|
114
|
-
|
235
|
+
in = rb_enc_codepoint_len(pos, end, &len, enc);
|
115
236
|
|
116
237
|
while (pos < end) {
|
117
|
-
|
118
|
-
next_codepoint = rb_enc_codepoint_len(pos + len, end, &next_len, enc);
|
119
|
-
}
|
238
|
+
in2 = out2 = 0;
|
120
239
|
|
121
|
-
|
122
|
-
|
123
|
-
if (seq_start) {
|
124
|
-
rb_str_buf_cat(dest, seq_start, pos - seq_start);
|
125
|
-
seq_start = 0;
|
126
|
-
}
|
240
|
+
if (pos + len < end)
|
241
|
+
in2 = rb_enc_codepoint_len(pos + len, end, &next_len, enc);
|
127
242
|
|
128
|
-
|
129
|
-
case LAT_TJ:
|
130
|
-
case LAT_CH: rb_str_buf_cat(dest, "c", 1); break;
|
131
|
-
case LAT_DJ: rb_str_buf_cat(dest, "dj", 2); break;
|
132
|
-
case LAT_SH: rb_str_buf_cat(dest, "s", 1); break;
|
133
|
-
case LAT_ZH: rb_str_buf_cat(dest, "z", 1); break;
|
134
|
-
case LAT_CAP_TJ:
|
135
|
-
case LAT_CAP_CH: rb_str_buf_cat(dest, "C", 1); break;
|
136
|
-
case LAT_CAP_SH: rb_str_buf_cat(dest, "S", 1); break;
|
137
|
-
case LAT_CAP_ZH: rb_str_buf_cat(dest, "Z", 1); break;
|
138
|
-
case LAT_CAP_DJ:
|
139
|
-
(seen_upper || is_upper(next_codepoint))
|
140
|
-
? rb_str_buf_cat(dest, "DJ", 2)
|
141
|
-
: rb_str_buf_cat(dest, "Dj", 2);
|
142
|
-
break;
|
143
|
-
default:
|
144
|
-
rb_str_buf_cat(dest, pos, len);
|
145
|
-
}
|
146
|
-
}
|
243
|
+
out = (*method)(in, in2, seen_cap, &out2);
|
147
244
|
|
148
|
-
|
149
|
-
|
245
|
+
if (out) {
|
246
|
+
/* flush previous untranslatable sequence */
|
150
247
|
if (seq_start) {
|
151
248
|
rb_str_buf_cat(dest, seq_start, pos - seq_start);
|
152
249
|
seq_start = 0;
|
153
250
|
}
|
154
251
|
|
155
|
-
|
156
|
-
|
157
|
-
cyr = CYR_MAP[codepoint - CYR_A];
|
158
|
-
cyr ? rb_str_buf_cat(dest, &cyr, 1)
|
159
|
-
: rb_str_buf_cat(dest, pos, len);
|
160
|
-
}
|
161
|
-
else {
|
162
|
-
switch (codepoint) {
|
163
|
-
case CYR_J: rb_str_buf_cat(dest, "j", 1); break;
|
164
|
-
case CYR_LJ: rb_str_buf_cat(dest, "lj", 2); break;
|
165
|
-
case CYR_NJ: rb_str_buf_cat(dest, "nj", 2); break;
|
166
|
-
case CYR_DJ: STR_CAT_COND_ASCII(ascii, dest, "dj", LAT_DJ, 2, enc); break;
|
167
|
-
case CYR_TJ: STR_CAT_COND_ASCII(ascii, dest, "c", LAT_TJ, 1, enc); break;
|
168
|
-
case CYR_CH: STR_CAT_COND_ASCII(ascii, dest, "c", LAT_CH, 1, enc); break;
|
169
|
-
case CYR_SH: STR_CAT_COND_ASCII(ascii, dest, "s", LAT_SH, 1, enc); break;
|
170
|
-
case CYR_ZH: STR_CAT_COND_ASCII(ascii, dest, "z", LAT_ZH, 1, enc); break;
|
171
|
-
case CYR_DZ:
|
172
|
-
rb_str_buf_cat(dest, "d", 1);
|
173
|
-
STR_CAT_COND_ASCII(ascii, dest, "z", LAT_ZH, 1, enc);
|
174
|
-
break;
|
175
|
-
default:
|
176
|
-
rb_str_buf_cat(dest, pos, len);
|
177
|
-
}
|
178
|
-
}
|
179
|
-
}
|
180
|
-
else {
|
181
|
-
if (maps_directly(codepoint)) {
|
182
|
-
cyr = CYR_CAPS_MAP[codepoint - CYR_CAP_A];
|
183
|
-
cyr ? rb_str_buf_cat(dest, &cyr, 1)
|
184
|
-
: rb_str_buf_cat(dest, pos, len);
|
185
|
-
}
|
186
|
-
else {
|
187
|
-
force_upper = seen_upper || is_upper(next_codepoint);
|
188
|
-
|
189
|
-
switch (codepoint) {
|
190
|
-
case CYR_CAP_J: rb_str_buf_cat(dest, "J", 1); break;
|
191
|
-
case CYR_CAP_LJ: rb_str_buf_cat(dest, (force_upper ? "LJ" : "Lj"), 2); break;
|
192
|
-
case CYR_CAP_NJ: rb_str_buf_cat(dest, (force_upper ? "NJ" : "Nj"), 2); break;
|
193
|
-
case CYR_CAP_TJ: STR_CAT_COND_ASCII(ascii, dest, "C", LAT_CAP_TJ, 1, enc); break;
|
194
|
-
case CYR_CAP_CH: STR_CAT_COND_ASCII(ascii, dest, "C", LAT_CAP_CH, 1, enc); break;
|
195
|
-
case CYR_CAP_SH: STR_CAT_COND_ASCII(ascii, dest, "S", LAT_CAP_SH, 1, enc); break;
|
196
|
-
case CYR_CAP_ZH: STR_CAT_COND_ASCII(ascii, dest, "Z", LAT_CAP_ZH, 1, enc); break;
|
197
|
-
case CYR_CAP_DJ: STR_CAT_COND_ASCII(ascii, dest, (force_upper ? "DJ" : "Dj"), LAT_CAP_DJ, 2, enc); break;
|
198
|
-
case CYR_CAP_DZ:
|
199
|
-
rb_str_buf_cat(dest, "D", 1);
|
200
|
-
force_upper ? STR_CAT_COND_ASCII(ascii, dest, "Z", LAT_CAP_ZH, 1, enc)
|
201
|
-
: STR_CAT_COND_ASCII(ascii, dest, "z", LAT_ZH, 1, enc);
|
202
|
-
break;
|
203
|
-
default:
|
204
|
-
rb_str_buf_cat(dest, pos, len);
|
205
|
-
}
|
206
|
-
}
|
207
|
-
}
|
252
|
+
_str_cat_char(dest, out, enc);
|
253
|
+
if (out2) _str_cat_char(dest, out2, enc);
|
208
254
|
}
|
209
|
-
else {
|
210
|
-
/*
|
211
|
-
|
255
|
+
else if (!seq_start) {
|
256
|
+
/* mark the beginning of an untranslatable sequence */
|
257
|
+
seq_start = pos;
|
258
|
+
}
|
259
|
+
|
260
|
+
/* for cyrillic output, skip the second half of an input digraph */
|
261
|
+
if (strategy == 0 && is_digraph(out)) {
|
262
|
+
pos += next_len;
|
263
|
+
if (pos + len < end)
|
264
|
+
in2 = rb_enc_codepoint_len(pos + len, end, &next_len, enc);
|
212
265
|
}
|
213
266
|
|
214
|
-
|
267
|
+
seen_cap = is_cap(in);
|
215
268
|
|
216
269
|
pos += len;
|
217
270
|
len = next_len;
|
218
|
-
|
219
|
-
codepoint = next_codepoint;
|
220
|
-
next_codepoint = 0;
|
271
|
+
in = in2;
|
221
272
|
}
|
222
273
|
|
223
|
-
/*
|
224
|
-
if (seq_start)
|
225
|
-
rb_str_buf_cat(dest, seq_start, pos - seq_start);
|
226
|
-
}
|
274
|
+
/* flush final sequence */
|
275
|
+
if (seq_start) rb_str_buf_cat(dest, seq_start, pos - seq_start);
|
227
276
|
|
228
277
|
if (bang) {
|
229
278
|
rb_str_shared_replace(str, dest);
|
@@ -237,7 +286,35 @@ str_to_latin(VALUE str, int ascii, int bang)
|
|
237
286
|
}
|
238
287
|
|
239
288
|
/**
|
240
|
-
* Returns a copy of <i>str</i> with
|
289
|
+
* Returns a copy of <i>str</i> with Latin characters transliterated
|
290
|
+
* into Serbian Cyrillic.
|
291
|
+
*
|
292
|
+
* @overload to_cyrillic(str)
|
293
|
+
* @param [String] str text to be transliterated
|
294
|
+
* @return [String] transliterated text
|
295
|
+
*/
|
296
|
+
static VALUE
|
297
|
+
rb_str_to_cyrillic(VALUE self, VALUE str)
|
298
|
+
{
|
299
|
+
return str_to_srb(str, 0, 0);
|
300
|
+
}
|
301
|
+
|
302
|
+
/**
|
303
|
+
* Performs transliteration of <code>Byk.to_cyrillic</code> in place,
|
304
|
+
* returning <i>str</i>, whether any changes were made or not.
|
305
|
+
*
|
306
|
+
* @overload to_cyrillic!(str)
|
307
|
+
* @param [String] str text to be transliterated
|
308
|
+
* @return [String] transliterated text
|
309
|
+
*/
|
310
|
+
static VALUE
|
311
|
+
rb_str_to_cyrillic_bang(VALUE self, VALUE str)
|
312
|
+
{
|
313
|
+
return str_to_srb(str, 0, 1);
|
314
|
+
}
|
315
|
+
|
316
|
+
/**
|
317
|
+
* Returns a copy of <i>str</i> with Serbian Cyrillic characters
|
241
318
|
* transliterated into Latin.
|
242
319
|
*
|
243
320
|
* @overload to_latin(str)
|
@@ -247,12 +324,12 @@ str_to_latin(VALUE str, int ascii, int bang)
|
|
247
324
|
static VALUE
|
248
325
|
rb_str_to_latin(VALUE self, VALUE str)
|
249
326
|
{
|
250
|
-
return
|
327
|
+
return str_to_srb(str, 1, 0);
|
251
328
|
}
|
252
329
|
|
253
330
|
/**
|
254
|
-
* Performs
|
255
|
-
* returning <i>str</i>, whether changes were made or not.
|
331
|
+
* Performs transliteration of <code>Byk.to_latin</code> in place,
|
332
|
+
* returning <i>str</i>, whether any changes were made or not.
|
256
333
|
*
|
257
334
|
* @overload to_latin!(str)
|
258
335
|
* @param [String] str text to be transliterated
|
@@ -261,12 +338,12 @@ rb_str_to_latin(VALUE self, VALUE str)
|
|
261
338
|
static VALUE
|
262
339
|
rb_str_to_latin_bang(VALUE self, VALUE str)
|
263
340
|
{
|
264
|
-
return
|
341
|
+
return str_to_srb(str, 1, 1);
|
265
342
|
}
|
266
343
|
|
267
344
|
/**
|
268
|
-
* Returns a copy of <i>str</i> with
|
269
|
-
*
|
345
|
+
* Returns a copy of <i>str</i> with Serbian characters transliterated
|
346
|
+
* into ASCII Latin.
|
270
347
|
*
|
271
348
|
* @overload to_ascii_latin(str)
|
272
349
|
* @param [String] str text to be transliterated
|
@@ -275,12 +352,12 @@ rb_str_to_latin_bang(VALUE self, VALUE str)
|
|
275
352
|
static VALUE
|
276
353
|
rb_str_to_ascii_latin(VALUE self, VALUE str)
|
277
354
|
{
|
278
|
-
return
|
355
|
+
return str_to_srb(str, 2, 0);
|
279
356
|
}
|
280
357
|
|
281
358
|
/**
|
282
|
-
* Performs
|
283
|
-
* place, returning <i>str</i>, whether changes were made or not.
|
359
|
+
* Performs transliteration of <code>Byk.to_ascii_latin</code> in
|
360
|
+
* place, returning <i>str</i>, whether any changes were made or not.
|
284
361
|
*
|
285
362
|
* @overload to_ascii_latin!(str)
|
286
363
|
* @param [String] str text to be transliterated
|
@@ -289,12 +366,14 @@ rb_str_to_ascii_latin(VALUE self, VALUE str)
|
|
289
366
|
static VALUE
|
290
367
|
rb_str_to_ascii_latin_bang(VALUE self, VALUE str)
|
291
368
|
{
|
292
|
-
return
|
369
|
+
return str_to_srb(str, 2, 1);
|
293
370
|
}
|
294
371
|
|
295
372
|
void Init_byk_native(void)
|
296
373
|
{
|
297
374
|
VALUE Byk = rb_define_module("Byk");
|
375
|
+
rb_define_singleton_method(Byk, "to_cyrillic", rb_str_to_cyrillic, 1);
|
376
|
+
rb_define_singleton_method(Byk, "to_cyrillic!", rb_str_to_cyrillic_bang, 1);
|
298
377
|
rb_define_singleton_method(Byk, "to_latin", rb_str_to_latin, 1);
|
299
378
|
rb_define_singleton_method(Byk, "to_latin!", rb_str_to_latin_bang, 1);
|
300
379
|
rb_define_singleton_method(Byk, "to_ascii_latin", rb_str_to_ascii_latin, 1);
|
data/lib/byk/version.rb
CHANGED
data/spec/byk_spec.rb
CHANGED
@@ -1,5 +1,4 @@
|
|
1
1
|
# coding: utf-8
|
2
|
-
|
3
2
|
require "spec_helper"
|
4
3
|
|
5
4
|
describe Byk do
|
@@ -24,70 +23,114 @@ describe Byk do
|
|
24
23
|
let(:non_serbian_cyrillic) { non_serbian_cyrillic_coderange.join }
|
25
24
|
|
26
25
|
let(:ascii) { "The quick brown fox jumps over the lazy dog." }
|
27
|
-
let(:other) { "संस्कृतम्
|
26
|
+
let(:other) { "संस्कृतम्" }
|
28
27
|
|
29
|
-
let(:mixed) { "संस्कृतम्
|
30
|
-
let(:
|
31
|
-
let(:
|
28
|
+
let(:mixed) { "संस्कृतम् илити Sanskrit, obrati ПАЖЊУ." }
|
29
|
+
let(:mixed_cyrillic) { "संस्कृतम् илити Санскрит, обрати ПАЖЊУ." }
|
30
|
+
let(:mixed_latin) { "संस्कृतम् iliti Sanskrit, obrati PAŽNJU." }
|
31
|
+
let(:mixed_ascii_latin) { "संस्कृतम् iliti Sanskrit, obrati PAZNJU." }
|
32
32
|
|
33
|
-
it "doesn't
|
33
|
+
it "doesn't translate an empty string" do
|
34
34
|
expect(Byk.send(method, "")).to eq ""
|
35
35
|
end
|
36
36
|
|
37
|
-
it "doesn't
|
38
|
-
expect(Byk.send(method,
|
37
|
+
it "doesn't translate foreign coderanges" do
|
38
|
+
expect(Byk.send(method, other)).to eq other
|
39
39
|
end
|
40
|
+
end
|
40
41
|
|
41
|
-
|
42
|
+
shared_examples :cyrillization_method do |method|
|
43
|
+
include_examples :base, method
|
44
|
+
|
45
|
+
let(:edge_cases) do
|
46
|
+
[
|
47
|
+
["lJ", "љ"],
|
48
|
+
["nJ", "њ"],
|
49
|
+
["dŽ", "џ"]
|
50
|
+
]
|
51
|
+
end
|
52
|
+
|
53
|
+
it "doesn't translate Cyrillic" do
|
54
|
+
expect(Byk.send(method, pangram)).to eq pangram
|
55
|
+
end
|
56
|
+
|
57
|
+
it "doesn't translate non-Serbian Cyrillic" do
|
42
58
|
expect(Byk.send(method, non_serbian_cyrillic)).to eq non_serbian_cyrillic
|
43
59
|
end
|
44
60
|
|
45
|
-
it "
|
46
|
-
expect(Byk.send(method,
|
61
|
+
it "translates Latin to Cyrillic" do
|
62
|
+
expect(Byk.send(method, pangram_latin)).to eq pangram
|
63
|
+
end
|
64
|
+
|
65
|
+
it "translates Latin caps to Cyrillic caps" do
|
66
|
+
expect(Byk.send(method, pangram_latin_caps)).to eq pangram_caps
|
67
|
+
end
|
68
|
+
|
69
|
+
it "translates mixed text properly" do
|
70
|
+
expect(Byk.send(method, mixed)).to eq mixed_cyrillic
|
71
|
+
end
|
72
|
+
|
73
|
+
it "translates edge cases properly" do
|
74
|
+
edge_cases.each do |input, output|
|
75
|
+
expect(Byk.send(method, input)).to eq output
|
76
|
+
end
|
77
|
+
end
|
78
|
+
|
79
|
+
it "translates ABECEDA to AZBUKA" do
|
80
|
+
expect(Byk::ABECEDA.map { |l| l.dup.send(:to_cyrillic) }).to match_array(Byk::AZBUKA)
|
81
|
+
end
|
82
|
+
|
83
|
+
it "translates ABECEDA_CAPS to AZBUKA_CAPS" do
|
84
|
+
expect(Byk::ABECEDA_CAPS.map { |l| l.dup.send(:to_cyrillic) }).to match_array(Byk::AZBUKA_CAPS)
|
47
85
|
end
|
48
86
|
end
|
49
87
|
|
50
88
|
shared_examples :latinization_method do |method|
|
51
89
|
include_examples :base, method
|
52
90
|
|
53
|
-
let(:edge_cases)
|
91
|
+
let(:edge_cases) do
|
54
92
|
[
|
55
|
-
["Њ", "Nj"],
|
56
|
-
["Љ", "Lj"],
|
57
|
-
["Џ", "Dž"],
|
58
|
-
["ЊЊ", "NJNJ"],
|
59
93
|
["ЉЉ", "LJLJ"],
|
94
|
+
["ЊЊ", "NJNJ"],
|
60
95
|
["ЏЏ", "DŽDŽ"]
|
61
96
|
]
|
62
|
-
|
97
|
+
end
|
63
98
|
|
64
|
-
it "doesn't
|
99
|
+
it "doesn't translate ASCII" do
|
100
|
+
expect(Byk.send(method, ascii)).to eq ascii
|
101
|
+
end
|
102
|
+
|
103
|
+
it "doesn't translate Latin" do
|
65
104
|
expect(Byk.send(method, pangram_latin)).to eq pangram_latin
|
66
105
|
end
|
67
106
|
|
68
|
-
it "
|
107
|
+
it "doesn't translate non-Serbian Cyrillic" do
|
108
|
+
expect(Byk.send(method, non_serbian_cyrillic)).to eq non_serbian_cyrillic
|
109
|
+
end
|
110
|
+
|
111
|
+
it "translates Cyrillic to Latin" do
|
69
112
|
expect(Byk.send(method, pangram)).to eq pangram_latin
|
70
113
|
end
|
71
114
|
|
72
|
-
it "
|
115
|
+
it "translates Cyrillic caps to Latin caps" do
|
73
116
|
expect(Byk.send(method, pangram_caps)).to eq pangram_latin_caps
|
74
117
|
end
|
75
118
|
|
76
|
-
it "
|
119
|
+
it "translates mixed text properly" do
|
77
120
|
expect(Byk.send(method, mixed)).to eq mixed_latin
|
78
121
|
end
|
79
122
|
|
80
|
-
it "
|
123
|
+
it "translates edge cases properly" do
|
81
124
|
edge_cases.each do |input, output|
|
82
125
|
expect(Byk.send(method, input)).to eq output
|
83
126
|
end
|
84
127
|
end
|
85
128
|
|
86
|
-
it "
|
129
|
+
it "translates AZBUKA to ABECEDA" do
|
87
130
|
expect(Byk::AZBUKA.map { |l| l.dup.send(method) }).to match_array(Byk::ABECEDA)
|
88
131
|
end
|
89
132
|
|
90
|
-
it "
|
133
|
+
it "translates AZBUKA_CAPS to ABECEDA_CAPS" do
|
91
134
|
expect(Byk::AZBUKA_CAPS.map { |l| l.dup.send(method) }).to match_array(Byk::ABECEDA_CAPS)
|
92
135
|
end
|
93
136
|
end
|
@@ -95,7 +138,7 @@ describe Byk do
|
|
95
138
|
shared_examples :ascii_latinization_method do |method|
|
96
139
|
include_examples :base, method
|
97
140
|
|
98
|
-
let(:edge_cases)
|
141
|
+
let(:edge_cases) do
|
99
142
|
[
|
100
143
|
["Њ", "Nj"],
|
101
144
|
["Љ", "Lj"],
|
@@ -107,32 +150,36 @@ describe Byk do
|
|
107
150
|
["ЏЏ", "DZDZ"],
|
108
151
|
["ЂЂ", "DJDJ"],
|
109
152
|
["ĐĐ", "DJDJ"],
|
110
|
-
["ЂУРАЂ
|
111
|
-
["ĐURAĐ
|
153
|
+
["ЂУРАЂ Ђурђевић", "DJURADJ Djurdjevic"],
|
154
|
+
["ĐURAĐ Đurđević", "DJURADJ Djurdjevic"]
|
112
155
|
]
|
113
|
-
}
|
114
|
-
|
115
|
-
it "converts Cyrillic to ASCII Latin" do
|
116
|
-
expect(Byk.send(method, pangram)).to eq pangram_ascii_latin
|
117
156
|
end
|
118
157
|
|
119
|
-
it "
|
120
|
-
expect(Byk.send(method,
|
158
|
+
it "doesn't translate ASCII" do
|
159
|
+
expect(Byk.send(method, ascii)).to eq ascii
|
121
160
|
end
|
122
161
|
|
123
|
-
it "
|
162
|
+
it "translates Latin to ASCII Latin" do
|
124
163
|
expect(Byk.send(method, pangram_latin)).to eq pangram_ascii_latin
|
125
164
|
end
|
126
165
|
|
127
|
-
it "
|
166
|
+
it "translates Latin caps to ASCII Latin caps" do
|
128
167
|
expect(Byk.send(method, pangram_latin_caps)).to eq pangram_ascii_latin_caps
|
129
168
|
end
|
130
169
|
|
131
|
-
it "
|
170
|
+
it "translates Cyrillic to ASCII Latin" do
|
171
|
+
expect(Byk.send(method, pangram)).to eq pangram_ascii_latin
|
172
|
+
end
|
173
|
+
|
174
|
+
it "translates Cyrillic caps to ASCII Latin caps" do
|
175
|
+
expect(Byk.send(method, pangram_caps)).to eq pangram_ascii_latin_caps
|
176
|
+
end
|
177
|
+
|
178
|
+
it "translates mixed text properly" do
|
132
179
|
expect(Byk.send(method, mixed)).to eq mixed_ascii_latin
|
133
180
|
end
|
134
181
|
|
135
|
-
it "
|
182
|
+
it "translates edge cases properly" do
|
136
183
|
edge_cases.each do |input, output|
|
137
184
|
expect(Byk.send(method, input)).to eq output
|
138
185
|
end
|
@@ -141,18 +188,28 @@ describe Byk do
|
|
141
188
|
|
142
189
|
shared_examples :non_destructive_method do |method|
|
143
190
|
it "doesn't modify the arg" do
|
144
|
-
str = "
|
191
|
+
str = "ЖŽ"
|
145
192
|
expect { Byk.send(method, str) }.to_not change { str }
|
146
193
|
end
|
147
194
|
end
|
148
195
|
|
149
196
|
shared_examples :destructive_method do |method|
|
150
197
|
it "modifies the arg" do
|
151
|
-
str = "
|
198
|
+
str = "ЖŽ"
|
152
199
|
expect { Byk.send(method, str) }.to change { str }
|
153
200
|
end
|
154
201
|
end
|
155
202
|
|
203
|
+
describe ".to_cyrillic" do
|
204
|
+
it_behaves_like :cyrillization_method, :to_cyrillic
|
205
|
+
it_behaves_like :non_destructive_method, :to_cyrillic
|
206
|
+
end
|
207
|
+
|
208
|
+
describe ".to_cyrillic!" do
|
209
|
+
it_behaves_like :cyrillization_method, :to_cyrillic!
|
210
|
+
it_behaves_like :destructive_method, :to_cyrillic!
|
211
|
+
end
|
212
|
+
|
156
213
|
describe ".to_latin" do
|
157
214
|
it_behaves_like :latinization_method, :to_latin
|
158
215
|
it_behaves_like :non_destructive_method, :to_latin
|
@@ -176,7 +233,7 @@ end
|
|
176
233
|
|
177
234
|
describe String do
|
178
235
|
it "responds to Byk methods" do
|
179
|
-
Byk.
|
236
|
+
Byk.singleton_methods.each do |method|
|
180
237
|
expect("").to respond_to(method)
|
181
238
|
end
|
182
239
|
end
|
metadata
CHANGED
@@ -1,15 +1,29 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: byk
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 1.0.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Nikola Topalović
|
8
8
|
autorequire:
|
9
|
-
bindir:
|
9
|
+
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2016-04-09 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
|
+
- !ruby/object:Gem::Dependency
|
14
|
+
name: rake
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
16
|
+
requirements:
|
17
|
+
- - "~>"
|
18
|
+
- !ruby/object:Gem::Version
|
19
|
+
version: '10.5'
|
20
|
+
type: :development
|
21
|
+
prerelease: false
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
23
|
+
requirements:
|
24
|
+
- - "~>"
|
25
|
+
- !ruby/object:Gem::Version
|
26
|
+
version: '10.5'
|
13
27
|
- !ruby/object:Gem::Dependency
|
14
28
|
name: rake-compiler
|
15
29
|
requirement: !ruby/object:Gem::Requirement
|
@@ -38,10 +52,11 @@ dependencies:
|
|
38
52
|
- - "~>"
|
39
53
|
- !ruby/object:Gem::Version
|
40
54
|
version: '3.2'
|
41
|
-
description:
|
42
|
-
|
55
|
+
description: Fast transliteration of Serbian Cyrillic to Latin and back. Brzo preslovljavanje
|
56
|
+
ćirilice u latinicu i obratno.
|
43
57
|
email: nikola.topalovic@gmail.com
|
44
|
-
executables:
|
58
|
+
executables:
|
59
|
+
- byk
|
45
60
|
extensions:
|
46
61
|
- ext/byk/extconf.rb
|
47
62
|
extra_rdoc_files: []
|
@@ -49,6 +64,7 @@ files:
|
|
49
64
|
- CHANGELOG.md
|
50
65
|
- LICENSE
|
51
66
|
- README.md
|
67
|
+
- exe/byk
|
52
68
|
- ext/byk/byk.c
|
53
69
|
- ext/byk/extconf.rb
|
54
70
|
- lib/byk.rb
|
@@ -76,9 +92,10 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
76
92
|
version: '0'
|
77
93
|
requirements: []
|
78
94
|
rubyforge_project:
|
79
|
-
rubygems_version: 2.
|
95
|
+
rubygems_version: 2.5.1
|
80
96
|
signing_key:
|
81
97
|
specification_version: 4
|
82
|
-
summary: Fast transliteration of Serbian Cyrillic
|
98
|
+
summary: Fast transliteration of Serbian Cyrillic to Latin and back. Brzo preslovljavanje
|
99
|
+
ćirilice u latinicu i obratno.
|
83
100
|
test_files:
|
84
101
|
- spec/byk_spec.rb
|