RubyGems - byk - Versions diffs - 0.6.0 → 1.0.0 - Mend

byk 0.6.0 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 34203e0b4291cde495d17da65522df586de7e712
-  data.tar.gz: 290d743dab23c58241520252bd81d4ae4115ce98
+  metadata.gz: cc996c9d9dc81f884e02cc1dd760eeb57b6545fc
+  data.tar.gz: de07860c2cb41bcb39b299fee4500fd2bf01db73
 SHA512:
-  metadata.gz: f11d00e9ac1057a5596804e03c6c4a6c41841bedc21030a9ed776cfbaaabba85a341a62de71c990c8deadc8f8384bf263b41d477b33b299af26b55acef47fe0c
-  data.tar.gz: 335ddfeca9f6793f2887c1cc93cfc916e011f7dd01fd97073162871148d0fe61395bdb1115c5ed4f7583ff207f6b6d27462c8920a7a70b016b9427420796bc28
+  metadata.gz: 16e97855924c380b205e2e651fdcde391785fe051c2971d948f801ff4260eb691dc4c3304ac17b3083fc0a2469f26d134c9622f74b058f6950d5fd8dfaf62383
+  data.tar.gz: c85659aaaccbc5e1db30305b52e2f4955de160dcb7a617ae564877619fe5f36d852ea7c17228f301e639aae3d4793133baa5d644faeffc83942d6b179bef53e9

data/CHANGELOG.md CHANGED

@@ -1,5 +1,10 @@
 # Changelog
+### Byk 1.0.0 (2016-04-09)
+* Introduced `#to_cyrillic` and `#to_cyrillic!`
+* Introduced console utility
 ### Byk 0.6.0 (2015-04-25)
 * Introduced module methods and the optional safe require

data/README.md CHANGED

@@ -4,39 +4,85 @@ Byk
 [![Gem Version](https://badge.fury.io/rb/byk.svg)](https://rubygems.org/gems/byk)
 [![Build Status](https://travis-ci.org/topalovic/byk.svg?branch=master)](https://travis-ci.org/topalovic/byk)
-Ruby gem for fast transliteration of Serbian Cyrillic into Latin
-<br />
-<sub>Inspired by @dejan's
-[nice little gem](https://github.com/dejan/srbovanje),
-this one comes with a C-optimized twist</sub>
+Ruby gem for fast transliteration of Serbian Cyrillic ↔ Latin
 ![byk](https://cloud.githubusercontent.com/assets/626128/7155207/07545960-e35d-11e4-804e-5fdee70a3e30.png)
 ## Installation
-Add this line to your application's Gemfile:
+Byk can be used as a standalone console utility or as a `String`
+extension in your Ruby programs. It has zero dependencies beyond
+vanilla Ruby and the toolchain for building native gems <sup>1</sup>.
+You can install it directly:
+```ruby
+$ gem install byk
+```
+or add it as a dependency in your application's Gemfile:
 ```ruby
 gem "byk"
 ```
-And then execute:
+<sub><sup>1</sup> For Windows, you might want to check out
+[DevKit](https://github.com/oneclick/rubyinstaller/wiki/Development-Kit)</sub>
+## Usage
+### As a standalone utility
+Here's the help banner with all the available options:
 ```
-$ bundle
+usage: byk [options] [files]
+options:
+  -c, --cyrillic       convert input to Cyrillic (default)
+  -l, --latin          convert input to Latin
+  -a, --ascii          convert input to "ASCII Latin"
+  -v, --version        show version
 ```
-Or install it yourself as:
+Translation goes to stdout so you can redirect it or pipe it as you
+see fit. Let's take a look at some common scenarios.
+To translate files to Cyrillic:
+```sh
+$ byk in1.txt in2.txt > out.txt
 ```
-$ gem install byk
+To translate files to Latin and search for a phrase:
+```sh
+$ byk -l file.txt | grep stvar
 ```
+Ad hoc conversion:
+```sh
+$ echo "Вук Стефановић Караџић" | byk -a
+Vuk Stefanovic Karadzic
+```
-## Usage
+or simply omit args and type away:
+```sh
+$ byk
+a u ruke Mandušića Vuka
+biće svaka puška ubojita!
+^D
+а у руке Мандушића Вука
+биће свака пушка убојита!
+```
-First, make sure to require the gem in your initializer:
+`^D` being <kbd>ctrl</kbd> <kbd>d</kbd>.
+### As a `String` extension
+Unless you're using Bundler, make sure to require the gem in your
+initializer:
 ```ruby
 require "byk"
@@ -45,22 +91,23 @@ require "byk"
 This will extend `String` with a couple of simple methods:
 ```ruby
-"Шеширџија".to_latin         # => "Šeširdžija"
-"Шеширџија".to_ascii_latin   # => "Sesirdzija"
-"Šeširdžija".to_ascii_latin  # => "Sesirdzija"
+"Šeširdžija".to_cyrillic    # => "Шеширџија"
+"Шеширџија".to_latin        # => "Šeširdžija"
+"Шеширџија".to_ascii_latin  # => "Sesirdzija"
 ```
-There's also a destructive variant of each:
+These do not modify the receiver. For that, there's a destructive
+variant of each:
 ```ruby
-text = "Жвазбука"
-text.to_latin!        # => "Žvazbuka"
-text                  # => "Žvazbuka"
-text.to_ascii_latin!  # => "Zvazbuka"
-text                  # => "Zvazbuka"
+text = "Šeširdžija"
+text.to_cyrillic!     # => "Шеширџија"
+text.to_latin!        # => "Šeširdžija"
+text.to_ascii_latin!  # => "Sesirdzija"
+text                  # => "Sesirdzija"
 ```
-Note that these methods take into account the
+Note that both latinization methods observe
 [digraph capitalization rules](http://sr.wikipedia.org/wiki/Гајица#.D0.94.D0.B8.D0.B3.D1.80.D0.B0.D1.84.D0.B8):
 ```ruby
@@ -68,63 +115,88 @@ Note that these methods take into account the
 "ĐORĐE Đorđević".to_ascii_latin  # => "DJORDJE Djordjevic"
 ```
-If you prefer not to monkey patch your strings, you can use the "safe"
-require:
+### Safe require
+If you prefer not to monkey patch `String`, you can do a "safe"
+require in your Gemfile:
 ```ruby
-require "byk/safe"
+gem "byk", :require => "byk/safe"
 ```
-and then call the module methods:
+or initializer:
 ```ruby
-text = "Вук"
-Byk.to_latin(text)   # => "Vuk"
-text                 # => "Byk"
-Byk.to_latin!(text)  # => "Vuk"
-text                 # => "Vuk"
+require "byk/safe"
 ```
+Then, you should rely on module methods:
-## Testing
+```ruby
+text = "Жвазбука"
-To test the gem, clone the repo and run:
+Byk.to_latin(text)   # => "Žvazbuka"
+text                 # => "Жвазбука"
+Byk.to_latin!(text)  # => "Žvazbuka"
+text                 # => "Žvazbuka"
+# etc.
 ```
-$ bundle
-$ bundle exec rake
+## How fast is "fast" transliteration?
+Here's a quick test:
+```sh
+$ wget https://sr.wikipedia.org/ -O sample
+$ du -h sample
+128K
+$ time byk -l sample > /dev/null
+0.08s user 0.04s system 96% cpu 0.126 total
 ```
+Let's up the ante:
+```sh
+$ for i in {1..800}; do cat sample; done > big
+$ du -h big
+97M
+$ time byk -l big > /dev/null
+1.71s user 0.13s system 99% cpu 1.846 total
+```
-## How fast is fast?
+So, ~100MB in under 2s. Fast enough, I suppose. You can expect it to
+scale linearly.
-About [10-40x faster](benchmark) than the baseline Ruby implementation
-on my hardware, depending on the string's Cyrillic content ratio. YMMV
-of course.
+Compared to the pure Ruby implementation, it is about
+[10-30x faster](benchmark), depending on the input composition and the
+transliteration method applied.
-## Raison d'être
+## Testing
-This kind of speed-up might be worthwhile for massive localization
-projects, e.g. sites supporting dual script content. Remember,
-`Benchmark` is your friend.
+To test the gem, clone the repo and run:
-I found transliteration to be a straightforward little problem that
-lends itself well to optimization. It also gave me an excuse to play
-with Ruby extensions, so there :smirk_cat:
+```
+$ bundle && bundle exec rake
+```
 ## Compatibility
-Byk is supported under MRI Ruby >= 1.9.2.
+Byk is supported under MRI 1.9.2+. I might try my hand in writing a
+JRuby extension in a future release.
-I don't plan to support 1.8.7 or older due to substantial C API
-changes between 1.8 and 1.9. It doesn't build under Rubinius
-currently, but I intend to support it in future releases.
 ## License
-This gem is released under the [MIT License](http://www.opensource.org/licenses/MIT).
+This gem is released under the [MIT License](LICENSE).
 Уздравље!

data/exe/byk ADDED

@@ -0,0 +1,51 @@
+#!/usr/bin/env ruby
+require "byk/safe"
+require "optparse"
+trap "SIGINT" do
+  exit 130
+end
+method_name = :to_cyrillic
+opts = OptionParser.new do |opt|
+  opt.banner = "usage: byk [options] [files]"
+  opt.summary_width = 20
+  opt.separator ""
+  opt.separator "options:"
+  opt.on("-c", "--cyrillic", "convert input to Cyrillic (default)") do
+    method_name = :to_cyrillic
+  end
+  opt.on("-l", "--latin", "convert input to Latin") do
+    method_name = :to_latin
+  end
+  opt.on("-a", "--ascii", 'convert input to "ASCII Latin"') do
+    method_name = :to_ascii_latin
+  end
+  opt.on_tail("-v", "--version", "show version") do
+    puts Byk::VERSION
+    exit
+  end
+end
+begin
+  opts.parse!
+rescue OptionParser::InvalidOption => e
+  puts e
+  puts
+  puts opts
+  exit 1
+end
+begin
+  puts Byk.send(method_name, ARGF.read)
+rescue => e
+  puts e
+  exit 1
+end

data/ext/byk/byk.c CHANGED

@@ -3,103 +3,225 @@
 #define STR_ENC_GET(str) rb_enc_from_index(ENCODING_GET(str))
-#define STR_CAT_COND_ASCII(ascii, dest, chr, ascii_chr, len, enc)       \
-    ascii ? rb_str_buf_cat(dest, chr, len)                              \
-          : str_cat_char(dest, ascii_chr, enc)
+static inline void
+_str_cat_char(VALUE str, unsigned c, rb_encoding *enc)
+{
+    char s[16];
+    int n = rb_enc_codelen(c, enc);
+    rb_enc_mbcput(c, s, enc);
+    rb_str_buf_cat(str, s, n);
+}
 enum {
-    LAT_CAP_TJ = 0x106,
-    LAT_TJ,
-    LAT_CAP_CH = 0x10c,
-    LAT_CH,
-    LAT_CAP_DJ = 0x110,
-    LAT_DJ,
-    LAT_CAP_SH = 0x160,
-    LAT_SH,
-    LAT_CAP_ZH = 0x17d,
-    LAT_ZH,
-    CYR_CAP_DJ = 0x402,
-    CYR_CAP_J  = 0x408,
-    CYR_CAP_LJ,
-    CYR_CAP_NJ,
-    CYR_CAP_TJ,
-    CYR_CAP_DZ = 0x40f,
-    CYR_CAP_A,
-    CYR_CAP_ZH = 0x416,
-    CYR_CAP_C  = 0x426,
-    CYR_CAP_CH,
-    CYR_CAP_SH,
-    CYR_A  = 0x430,
-    CYR_ZH = 0x436,
-    CYR_C  = 0x446,
-    CYR_CH,
-    CYR_SH,
-    CYR_DJ = 0x452,
-    CYR_J  = 0x458,
-    CYR_LJ,
-    CYR_NJ,
-    CYR_TJ,
-    CYR_DZ = 0x45f
+    LAT_CAP_TJ=262, LAT_TJ, LAT_CAP_CH=268, LAT_CH,
+    LAT_CAP_DJ=272, LAT_DJ, LAT_CAP_SH=352, LAT_SH,
+    LAT_CAP_ZH=381, LAT_ZH, CYR_CAP_DJ=1026, CYR_CAP_J=1032,
+    CYR_CAP_LJ, CYR_CAP_NJ, CYR_CAP_TJ, CYR_CAP_DZ=1039,
+    CYR_CAP_A, CYR_CAP_B, CYR_CAP_V, CYR_CAP_G,
+    CYR_CAP_D, CYR_CAP_E, CYR_CAP_ZH, CYR_CAP_Z,
+    CYR_CAP_I, CYR_CAP_K=1050, CYR_CAP_L, CYR_CAP_M,
+    CYR_CAP_N, CYR_CAP_O, CYR_CAP_P, CYR_CAP_R,
+    CYR_CAP_S, CYR_CAP_T, CYR_CAP_U, CYR_CAP_F,
+    CYR_CAP_H, CYR_CAP_C, CYR_CAP_CH, CYR_CAP_SH,
+    CYR_A=1072, CYR_B, CYR_V, CYR_G, CYR_D,
+    CYR_E, CYR_ZH, CYR_Z, CYR_I, CYR_K=1082,
+    CYR_L, CYR_M, CYR_N, CYR_O, CYR_P,
+    CYR_R, CYR_S, CYR_T, CYR_U, CYR_F,
+    CYR_H, CYR_C, CYR_CH, CYR_SH, CYR_DJ=1106,
+    CYR_J=1112, CYR_LJ, CYR_NJ, CYR_TJ, CYR_DZ=1119
 };
-static inline unsigned int
-is_cyrillic(unsigned int c)
+static inline unsigned
+is_cap(unsigned codepoint)
 {
-    return c >= CYR_CAP_DJ && c <= CYR_DZ;
+    if (codepoint >= 65 && codepoint <= 90) return 1;
+    if (codepoint >= CYR_CAP_DJ && codepoint <= CYR_CAP_SH) return 1;
+    switch(codepoint) {
+    case LAT_CAP_TJ:
+    case LAT_CAP_CH:
+    case LAT_CAP_DJ:
+    case LAT_CAP_SH:
+    case LAT_CAP_ZH:
+        return 1;
+    default:
+        return 0;
+    }
 }
-static inline unsigned int
-is_upper(unsigned int c)
+static inline unsigned
+is_digraph(unsigned codepoint)
 {
-    return (c >= 65 && c <= 90)
-        || (c >= CYR_CAP_DJ && c <= CYR_CAP_SH)
-        || c == LAT_CAP_TJ
-        || c == LAT_CAP_CH
-        || c == LAT_CAP_DJ
-        || c == LAT_CAP_SH
-        || c == LAT_CAP_ZH;
+    switch(codepoint) {
+    case CYR_LJ:
+    case CYR_NJ:
+    case CYR_DZ:
+    case CYR_CAP_LJ:
+    case CYR_CAP_NJ:
+    case CYR_CAP_DZ:
+        return 1;
+    default:
+        return 0;
+    }
 }
-static inline unsigned int
-maps_directly(unsigned int c)
+static unsigned
+digraph_to_cyr(unsigned codepoint, unsigned codepoint2, unsigned capitalize, unsigned *next_out)
 {
-    return c != CYR_ZH
-        && c != CYR_CAP_ZH
-        && ((c >= CYR_A && c <= CYR_C) || (c >= CYR_CAP_A && c <= CYR_CAP_C));
+    static unsigned CYR_MAP[] = {
+        CYR_A, CYR_B, CYR_C, CYR_D, CYR_E, CYR_F,
+        CYR_G, CYR_H, CYR_I, CYR_J, CYR_K, CYR_L,
+        CYR_M, CYR_N, CYR_O, CYR_P, 0, CYR_R,
+        CYR_S, CYR_T, CYR_U, CYR_V, 0, 0, 0, CYR_Z
+    };
+    static unsigned CYR_CAPS_MAP[] = {
+        CYR_CAP_A, CYR_CAP_B, CYR_CAP_C, CYR_CAP_D, CYR_CAP_E, CYR_CAP_F,
+        CYR_CAP_G, CYR_CAP_H, CYR_CAP_I, CYR_CAP_J, CYR_CAP_K, CYR_CAP_L,
+        CYR_CAP_M, CYR_CAP_N, CYR_CAP_O, CYR_CAP_P, 0, CYR_CAP_R,
+        CYR_CAP_S, CYR_CAP_T, CYR_CAP_U, CYR_CAP_V, 0, 0, 0, CYR_CAP_Z
+    };
+    if (codepoint2 == LAT_CAP_ZH || codepoint2 == LAT_ZH) {
+        switch (codepoint) {
+        case 'd': return CYR_DZ;
+        case 'D': return CYR_CAP_DZ;
+        }
+    }
+    if (codepoint2 == 'j' || codepoint2 == 'J') {
+        switch (codepoint) {
+        case 'l': return CYR_LJ;
+        case 'n': return CYR_NJ;
+        case 'L': return CYR_CAP_LJ;
+        case 'N': return CYR_CAP_NJ;
+        }
+    }
+    if (codepoint >= 'a' && codepoint <= 'z') return CYR_MAP[codepoint - 'a'];
+    if (codepoint >= 'A' && codepoint <= 'Z') return CYR_CAPS_MAP[codepoint - 'A'];
+    switch (codepoint) {
+    case LAT_CH: return CYR_CH;
+    case LAT_DJ: return CYR_DJ;
+    case LAT_SH: return CYR_SH;
+    case LAT_TJ: return CYR_TJ;
+    case LAT_ZH: return CYR_ZH;
+    case LAT_CAP_CH: return CYR_CAP_CH;
+    case LAT_CAP_DJ: return CYR_CAP_DJ;
+    case LAT_CAP_SH: return CYR_CAP_SH;
+    case LAT_CAP_TJ: return CYR_CAP_TJ;
+    case LAT_CAP_ZH: return CYR_CAP_ZH;
+    }
+    return 0;
 }
-static void
-str_cat_char(VALUE str, unsigned int c, rb_encoding *enc)
+static unsigned
+digraph_to_latin(unsigned codepoint, unsigned codepoint2, unsigned capitalize, unsigned *next_out)
 {
-    char s[16];
-    int n = rb_enc_codelen(c, enc);
-    rb_enc_mbcput(c, s, enc);
-    rb_str_buf_cat(str, s, n);
+    static char LAT_MAP[] = {
+        'a', 'b', 'v', 'g', 'd', 'e', 0, 'z', 'i', 0, 'k', 'l',
+        'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'f', 'h', 'c'
+    };
+    static char LAT_CAPS_MAP[] = {
+        'A', 'B', 'V', 'G', 'D', 'E', 0, 'Z', 'I', 0, 'K', 'L',
+        'M', 'N', 'O', 'P', 'R', 'S', 'T', 'U', 'F', 'H', 'C'
+    };
+    if (codepoint < CYR_CAP_DJ || codepoint > CYR_DZ) return 0;
+    switch (codepoint) {
+    case CYR_ZH: return LAT_ZH;
+    case CYR_CAP_ZH: return LAT_CAP_ZH;
+    }
+    if (codepoint >= CYR_A && codepoint <= CYR_C)
+        return LAT_MAP[codepoint - CYR_A];
+    if (codepoint >= CYR_CAP_A && codepoint <= CYR_CAP_C)
+        return LAT_CAPS_MAP[codepoint - CYR_CAP_A];
+    if (codepoint >= CYR_A) {
+        switch (codepoint) {
+        case CYR_J:  return 'j';
+        case CYR_TJ: return LAT_TJ;
+        case CYR_CH: return LAT_CH;
+        case CYR_SH: return LAT_SH;
+        case CYR_DJ: return LAT_DJ;
+        case CYR_LJ: *next_out = 'j'; return 'l';
+        case CYR_NJ: *next_out = 'j'; return 'n';
+        case CYR_DZ: *next_out = LAT_ZH; return 'd';
+        }
+    }
+    else {
+        switch (codepoint) {
+        case CYR_CAP_J:  return 'J';
+        case CYR_CAP_TJ: return LAT_CAP_TJ;
+        case CYR_CAP_CH: return LAT_CAP_CH;
+        case CYR_CAP_SH: return LAT_CAP_SH;
+        case CYR_CAP_DJ: return LAT_CAP_DJ;
+        case CYR_CAP_LJ: *next_out = (capitalize || is_cap(codepoint2)) ? 'J' : 'j'; return 'L';
+        case CYR_CAP_NJ: *next_out = (capitalize || is_cap(codepoint2)) ? 'J' : 'j'; return 'N';
+        case CYR_CAP_DZ: *next_out = (capitalize || is_cap(codepoint2)) ? LAT_CAP_ZH : LAT_ZH; return 'D';
+        }
+    }
+    return 0;
+}
+static unsigned
+digraph_to_ascii(unsigned codepoint, unsigned codepoint2, unsigned capitalize, unsigned *next_out)
+{
+    switch (codepoint) {
+    case LAT_TJ:
+    case LAT_CH:
+    case CYR_TJ:
+    case CYR_CH: return 'c';
+    case LAT_SH:
+    case CYR_SH: return 's';
+    case LAT_ZH:
+    case CYR_ZH: return 'z';
+    case LAT_DJ:
+    case CYR_DJ: *next_out = 'j'; return 'd';
+    case LAT_CAP_TJ:
+    case LAT_CAP_CH:
+    case CYR_CAP_TJ:
+    case CYR_CAP_CH: return 'C';
+    case LAT_CAP_SH:
+    case CYR_CAP_SH: return 'S';
+    case LAT_CAP_ZH:
+    case CYR_CAP_ZH: return 'Z';
+    case LAT_CAP_DJ:
+    case CYR_CAP_DJ:
+        *next_out = (capitalize || is_cap(codepoint2)) ? 'J' : 'j'; return 'D';
+    case CYR_DZ:
+        *next_out = (capitalize || is_cap(codepoint2)) ? 'Z' : 'z'; return 'd';
+    case CYR_CAP_DZ:
+        *next_out = (capitalize || is_cap(codepoint2)) ? 'Z' : 'z'; return 'D';
+    default:
+        return digraph_to_latin(codepoint, codepoint2, capitalize, next_out);
+    }
 }
 static VALUE
-str_to_latin(VALUE str, int ascii, int bang)
+str_to_srb(VALUE str, int strategy, int bang)
 {
     VALUE dest;
-    long dest_len;
+    rb_encoding *enc;
     int len, next_len;
-    int seen_upper = 0;
-    int force_upper = 0;
+    unsigned in, in2, out, out2, seen_cap = 0;
     char *pos, *end, *seq_start = 0;
-    char cyr;
-    unsigned int codepoint = 0;
-    unsigned int next_codepoint = 0;
-    rb_encoding *enc;
-    char CYR_MAP[] = {
-        'a', 'b', 'v', 'g', 'd', 'e', '\0', 'z', 'i', '\0', 'k',
-        'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'f', 'h', 'c'
-    };
+    unsigned (*method)(unsigned, unsigned, unsigned, unsigned*);
-    char CYR_CAPS_MAP[] = {
-        'A', 'B', 'V', 'G', 'D', 'E', '\0', 'Z', 'I', '\0', 'K',
-        'L', 'M', 'N', 'O', 'P', 'R', 'S', 'T', 'U', 'F', 'H', 'C'
-    };
+    switch(strategy) {
+    case 0:  method = &digraph_to_cyr;   break;
+    case 1:  method = &digraph_to_latin; break;
+    default: method = &digraph_to_ascii;
+    }
     StringValue(str);
     pos = RSTRING_PTR(str);
@@ -107,123 +229,50 @@ str_to_latin(VALUE str, int ascii, int bang)
     end = RSTRING_END(str);
     enc = STR_ENC_GET(str);
-    dest_len = RSTRING_LEN(str) + 30;
-    dest = rb_str_buf_new(dest_len);
+    dest = rb_str_buf_new(RSTRING_LEN(str) + 30);
     rb_enc_associate(dest, enc);
-    codepoint = rb_enc_codepoint_len(pos, end, &len, enc);
+    in = rb_enc_codepoint_len(pos, end, &len, enc);
     while (pos < end) {
-        if (pos + len < end) {
-            next_codepoint = rb_enc_codepoint_len(pos + len, end, &next_len, enc);
-        }
+        in2 = out2 = 0;
-        /* Latin -> "ASCII Latin" conversion */
-        if (ascii && codepoint >= LAT_CAP_TJ && codepoint <= LAT_ZH) {
-            if (seq_start) {
-                rb_str_buf_cat(dest, seq_start, pos - seq_start);
-                seq_start = 0;
-            }
+        if (pos + len < end)
+            in2 = rb_enc_codepoint_len(pos + len, end, &next_len, enc);
-            switch (codepoint) {
-            case LAT_TJ:
-            case LAT_CH:     rb_str_buf_cat(dest, "c",  1); break;
-            case LAT_DJ:     rb_str_buf_cat(dest, "dj", 2); break;
-            case LAT_SH:     rb_str_buf_cat(dest, "s",  1); break;
-            case LAT_ZH:     rb_str_buf_cat(dest, "z",  1); break;
-            case LAT_CAP_TJ:
-            case LAT_CAP_CH: rb_str_buf_cat(dest, "C",  1); break;
-            case LAT_CAP_SH: rb_str_buf_cat(dest, "S",  1); break;
-            case LAT_CAP_ZH: rb_str_buf_cat(dest, "Z",  1); break;
-            case LAT_CAP_DJ:
-                (seen_upper || is_upper(next_codepoint))
-                    ? rb_str_buf_cat(dest, "DJ", 2)
-                    : rb_str_buf_cat(dest, "Dj", 2);
-                break;
-            default:
-                rb_str_buf_cat(dest, pos, len);
-            }
-        }
+        out = (*method)(in, in2, seen_cap, &out2);
-        /* Cyrillic coderange */
-        else if (is_cyrillic(codepoint)) {
+        if (out) {
+            /* flush previous untranslatable sequence */
             if (seq_start) {
                 rb_str_buf_cat(dest, seq_start, pos - seq_start);
                 seq_start = 0;
             }
-            if (codepoint >= CYR_A) {
-                if (maps_directly(codepoint)) {
-                    cyr = CYR_MAP[codepoint - CYR_A];
-                    cyr ? rb_str_buf_cat(dest, &cyr, 1)
-                        : rb_str_buf_cat(dest, pos, len);
-                }
-                else {
-                    switch (codepoint) {
-                    case CYR_J:  rb_str_buf_cat(dest, "j",  1); break;
-                    case CYR_LJ: rb_str_buf_cat(dest, "lj", 2); break;
-                    case CYR_NJ: rb_str_buf_cat(dest, "nj", 2); break;
-                    case CYR_DJ: STR_CAT_COND_ASCII(ascii, dest, "dj", LAT_DJ, 2, enc); break;
-                    case CYR_TJ: STR_CAT_COND_ASCII(ascii, dest, "c",  LAT_TJ, 1, enc); break;
-                    case CYR_CH: STR_CAT_COND_ASCII(ascii, dest, "c",  LAT_CH, 1, enc); break;
-                    case CYR_SH: STR_CAT_COND_ASCII(ascii, dest, "s",  LAT_SH, 1, enc); break;
-                    case CYR_ZH: STR_CAT_COND_ASCII(ascii, dest, "z",  LAT_ZH, 1, enc); break;
-                    case CYR_DZ:
-                        rb_str_buf_cat(dest, "d", 1);
-                        STR_CAT_COND_ASCII(ascii, dest, "z", LAT_ZH, 1, enc);
-                        break;
-                    default:
-                        rb_str_buf_cat(dest, pos, len);
-                    }
-                }
-            }
-            else {
-                if (maps_directly(codepoint)) {
-                    cyr = CYR_CAPS_MAP[codepoint - CYR_CAP_A];
-                    cyr ? rb_str_buf_cat(dest, &cyr, 1)
-                        : rb_str_buf_cat(dest, pos, len);
-                }
-                else {
-                    force_upper = seen_upper || is_upper(next_codepoint);
-                    switch (codepoint) {
-                    case CYR_CAP_J:  rb_str_buf_cat(dest, "J", 1); break;
-                    case CYR_CAP_LJ: rb_str_buf_cat(dest, (force_upper ? "LJ" : "Lj"), 2); break;
-                    case CYR_CAP_NJ: rb_str_buf_cat(dest, (force_upper ? "NJ" : "Nj"), 2); break;
-                    case CYR_CAP_TJ: STR_CAT_COND_ASCII(ascii, dest, "C", LAT_CAP_TJ, 1, enc); break;
-                    case CYR_CAP_CH: STR_CAT_COND_ASCII(ascii, dest, "C", LAT_CAP_CH, 1, enc); break;
-                    case CYR_CAP_SH: STR_CAT_COND_ASCII(ascii, dest, "S", LAT_CAP_SH, 1, enc); break;
-                    case CYR_CAP_ZH: STR_CAT_COND_ASCII(ascii, dest, "Z", LAT_CAP_ZH, 1, enc); break;
-                    case CYR_CAP_DJ: STR_CAT_COND_ASCII(ascii, dest, (force_upper ? "DJ" : "Dj"), LAT_CAP_DJ, 2, enc); break;
-                    case CYR_CAP_DZ:
-                        rb_str_buf_cat(dest, "D", 1);
-                        force_upper ? STR_CAT_COND_ASCII(ascii, dest, "Z", LAT_CAP_ZH, 1, enc)
-                                    : STR_CAT_COND_ASCII(ascii, dest, "z", LAT_ZH, 1, enc);
-                        break;
-                    default:
-                        rb_str_buf_cat(dest, pos, len);
-                    }
-                }
-            }
+            _str_cat_char(dest, out, enc);
+            if (out2) _str_cat_char(dest, out2, enc);
         }
-        else {
-            /* Mark the start of a copyable sequence */
-            if (!seq_start) seq_start = pos;
+        else if (!seq_start) {
+            /* mark the beginning of an untranslatable sequence */
+            seq_start = pos;
+        }
+        /* for cyrillic output, skip the second half of an input digraph */
+        if (strategy == 0 && is_digraph(out)) {
+            pos += next_len;
+            if (pos + len < end)
+                in2 = rb_enc_codepoint_len(pos + len, end, &next_len, enc);
         }
-        seen_upper = is_upper(codepoint);
+        seen_cap = is_cap(in);
         pos += len;
         len = next_len;
-        codepoint = next_codepoint;
-        next_codepoint = 0;
+        in = in2;
     }
-    /* Flush the last sequence, if any */
-    if (seq_start) {
-        rb_str_buf_cat(dest, seq_start, pos - seq_start);
-    }
+    /* flush final sequence */
+    if (seq_start) rb_str_buf_cat(dest, seq_start, pos - seq_start);
     if (bang) {
         rb_str_shared_replace(str, dest);
@@ -237,7 +286,35 @@ str_to_latin(VALUE str, int ascii, int bang)
 }
 /**
- * Returns a copy of <i>str</i> with the Serbian Cyrillic characters
+ * Returns a copy of <i>str</i> with Latin characters transliterated
+ * into Serbian Cyrillic.
+ *
+ * @overload to_cyrillic(str)
+ *   @param  [String] str text to be transliterated
+ *   @return [String] transliterated text
+ */
+static VALUE
+rb_str_to_cyrillic(VALUE self, VALUE str)
+{
+    return str_to_srb(str, 0, 0);
+}
+/**
+ * Performs transliteration of <code>Byk.to_cyrillic</code> in place,
+ * returning <i>str</i>, whether any changes were made or not.
+ *
+ * @overload to_cyrillic!(str)
+ *   @param  [String] str text to be transliterated
+ *   @return [String] transliterated text
+ */
+static VALUE
+rb_str_to_cyrillic_bang(VALUE self, VALUE str)
+{
+    return str_to_srb(str, 0, 1);
+}
+/**
+ * Returns a copy of <i>str</i> with Serbian Cyrillic characters
  * transliterated into Latin.
  *
  * @overload to_latin(str)
@@ -247,12 +324,12 @@ str_to_latin(VALUE str, int ascii, int bang)
 static VALUE
 rb_str_to_latin(VALUE self, VALUE str)
 {
-    return str_to_latin(str, 0, 0);
+    return str_to_srb(str, 1, 0);
 }
 /**
- * Performs the transliteration of <code>Byk.to_latin</code> in place,
- * returning <i>str</i>, whether changes were made or not.
+ * Performs transliteration of <code>Byk.to_latin</code> in place,
+ * returning <i>str</i>, whether any changes were made or not.
  *
  * @overload to_latin!(str)
  *   @param  [String] str text to be transliterated
@@ -261,12 +338,12 @@ rb_str_to_latin(VALUE self, VALUE str)
 static VALUE
 rb_str_to_latin_bang(VALUE self, VALUE str)
 {
-    return str_to_latin(str, 0, 1);
+    return str_to_srb(str, 1, 1);
 }
 /**
- * Returns a copy of <i>str</i> with the Serbian Cyrillic
- * characters transliterated into ASCII Latin.
+ * Returns a copy of <i>str</i> with Serbian characters transliterated
+ * into ASCII Latin.
  *
  * @overload to_ascii_latin(str)
  *   @param  [String] str text to be transliterated
@@ -275,12 +352,12 @@ rb_str_to_latin_bang(VALUE self, VALUE str)
 static VALUE
 rb_str_to_ascii_latin(VALUE self, VALUE str)
 {
-    return str_to_latin(str, 1, 0);
+    return str_to_srb(str, 2, 0);
 }
 /**
- * Performs the transliteration of <code>Byk.to_ascii_latin</code> in
- * place, returning <i>str</i>, whether changes were made or not.
+ * Performs transliteration of <code>Byk.to_ascii_latin</code> in
+ * place, returning <i>str</i>, whether any changes were made or not.
  *
  * @overload to_ascii_latin!(str)
  *   @param  [String] str text to be transliterated
@@ -289,12 +366,14 @@ rb_str_to_ascii_latin(VALUE self, VALUE str)
 static VALUE
 rb_str_to_ascii_latin_bang(VALUE self, VALUE str)
 {
-    return str_to_latin(str, 1, 1);
+    return str_to_srb(str, 2, 1);
 }
 void Init_byk_native(void)
 {
     VALUE Byk = rb_define_module("Byk");
+    rb_define_singleton_method(Byk, "to_cyrillic", rb_str_to_cyrillic, 1);
+    rb_define_singleton_method(Byk, "to_cyrillic!", rb_str_to_cyrillic_bang, 1);
     rb_define_singleton_method(Byk, "to_latin", rb_str_to_latin, 1);
     rb_define_singleton_method(Byk, "to_latin!", rb_str_to_latin_bang, 1);
     rb_define_singleton_method(Byk, "to_ascii_latin", rb_str_to_ascii_latin, 1);

data/lib/byk/version.rb CHANGED

@@ -1,3 +1,3 @@
 module Byk
-  VERSION = "0.6.0"
+  VERSION = "1.0.0"
 end

data/spec/byk_spec.rb CHANGED

@@ -1,5 +1,4 @@
 # coding: utf-8
 require "spec_helper"
 describe Byk do
@@ -24,70 +23,114 @@ describe Byk do
     let(:non_serbian_cyrillic) { non_serbian_cyrillic_coderange.join }
     let(:ascii) { "The quick brown fox jumps over the lazy dog." }
-    let(:other) { "संस्कृतम् saṃskṛtam" }
+    let(:other) { "संस्कृतम्" }
-    let(:mixed) { "संस्कृतम् saṃskṛtam илити Sanskrit, obrati ПАЖЊУ." }
-    let(:mixed_latin) { "संस्कृतम् saṃskṛtam iliti Sanskrit, obrati PAŽNJU." }
-    let(:mixed_ascii_latin) { "संस्कृतम् saṃskṛtam iliti Sanskrit, obrati PAZNJU." }
+    let(:mixed) { "संस्कृतम् илити Sanskrit, obrati ПАЖЊУ." }
+    let(:mixed_cyrillic) { "संस्कृतम् илити Санскрит, обрати ПАЖЊУ." }
+    let(:mixed_latin) { "संस्कृतम् iliti Sanskrit, obrati PAŽNJU." }
+    let(:mixed_ascii_latin) { "संस्कृतम् iliti Sanskrit, obrati PAZNJU." }
-    it "doesn't convert an empty string" do
+    it "doesn't translate an empty string" do
       expect(Byk.send(method, "")).to eq ""
     end
-    it "doesn't convert ASCII text" do
-      expect(Byk.send(method, ascii)).to eq ascii
+    it "doesn't translate foreign coderanges" do
+      expect(Byk.send(method, other)).to eq other
     end
+  end
-    it "doesn't convert non-Serbian Cyrillic" do
+  shared_examples :cyrillization_method do |method|
+    include_examples :base, method
+    let(:edge_cases) do
+      [
+        ["lJ", "љ"],
+        ["nJ", "њ"],
+        ["dŽ", "џ"]
+      ]
+    end
+    it "doesn't translate Cyrillic" do
+      expect(Byk.send(method, pangram)).to eq pangram
+    end
+    it "doesn't translate non-Serbian Cyrillic" do
       expect(Byk.send(method, non_serbian_cyrillic)).to eq non_serbian_cyrillic
     end
-    it "doesn't convert other coderanges" do
-      expect(Byk.send(method, other)).to eq other
+    it "translates Latin to Cyrillic" do
+      expect(Byk.send(method, pangram_latin)).to eq pangram
+    end
+    it "translates Latin caps to Cyrillic caps" do
+      expect(Byk.send(method, pangram_latin_caps)).to eq pangram_caps
+    end
+    it "translates mixed text properly" do
+      expect(Byk.send(method, mixed)).to eq mixed_cyrillic
+    end
+    it "translates edge cases properly" do
+      edge_cases.each do |input, output|
+        expect(Byk.send(method, input)).to eq output
+      end
+    end
+    it "translates ABECEDA to AZBUKA" do
+      expect(Byk::ABECEDA.map { |l| l.dup.send(:to_cyrillic) }).to match_array(Byk::AZBUKA)
+    end
+    it "translates ABECEDA_CAPS to AZBUKA_CAPS" do
+      expect(Byk::ABECEDA_CAPS.map { |l| l.dup.send(:to_cyrillic) }).to match_array(Byk::AZBUKA_CAPS)
     end
   end
   shared_examples :latinization_method do |method|
     include_examples :base, method
-    let(:edge_cases) {
+    let(:edge_cases) do
       [
-        ["Њ", "Nj"],
-        ["Љ", "Lj"],
-        ["Џ", "Dž"],
-        ["ЊЊ", "NJNJ"],
         ["ЉЉ", "LJLJ"],
+        ["ЊЊ", "NJNJ"],
         ["ЏЏ", "DŽDŽ"]
       ]
-    }
+    end
-    it "doesn't convert Latin" do
+    it "doesn't translate ASCII" do
+      expect(Byk.send(method, ascii)).to eq ascii
+    end
+    it "doesn't translate Latin" do
       expect(Byk.send(method, pangram_latin)).to eq pangram_latin
     end
-    it "converts Cyrillic to Latin" do
+    it "doesn't translate non-Serbian Cyrillic" do
+      expect(Byk.send(method, non_serbian_cyrillic)).to eq non_serbian_cyrillic
+    end
+    it "translates Cyrillic to Latin" do
       expect(Byk.send(method, pangram)).to eq pangram_latin
     end
-    it "converts Cyrillic caps to Latin caps" do
+    it "translates Cyrillic caps to Latin caps" do
       expect(Byk.send(method, pangram_caps)).to eq pangram_latin_caps
     end
-    it "converts mixed text properly" do
+    it "translates mixed text properly" do
       expect(Byk.send(method, mixed)).to eq mixed_latin
     end
-    it "converts edge cases properly" do
+    it "translates edge cases properly" do
       edge_cases.each do |input, output|
         expect(Byk.send(method, input)).to eq output
       end
     end
-    it "converts AZBUKA to ABECEDA" do
+    it "translates AZBUKA to ABECEDA" do
       expect(Byk::AZBUKA.map { |l| l.dup.send(method) }).to match_array(Byk::ABECEDA)
     end
-    it "converts AZBUKA_CAPS to ABECEDA_CAPS" do
+    it "translates AZBUKA_CAPS to ABECEDA_CAPS" do
       expect(Byk::AZBUKA_CAPS.map { |l| l.dup.send(method) }).to match_array(Byk::ABECEDA_CAPS)
     end
   end
@@ -95,7 +138,7 @@ describe Byk do
   shared_examples :ascii_latinization_method do |method|
     include_examples :base, method
-    let(:edge_cases) {
+    let(:edge_cases) do
       [
         ["Њ", "Nj"],
         ["Љ", "Lj"],
@@ -107,32 +150,36 @@ describe Byk do
         ["ЏЏ", "DZDZ"],
         ["ЂЂ", "DJDJ"],
         ["ĐĐ", "DJDJ"],
-        ["ЂУРАЂ Ђорђевић", "DJURADJ Djordjevic"],
-        ["ĐURAĐ Đorđević", "DJURADJ Djordjevic"]
+        ["ЂУРАЂ Ђурђевић", "DJURADJ Djurdjevic"],
+        ["ĐURAĐ Đurđević", "DJURADJ Djurdjevic"]
       ]
-    }
-    it "converts Cyrillic to ASCII Latin" do
-      expect(Byk.send(method, pangram)).to eq pangram_ascii_latin
     end
-    it "converts Cyrillic caps to ASCII Latin caps" do
-      expect(Byk.send(method, pangram_caps)).to eq pangram_ascii_latin_caps
+    it "doesn't translate ASCII" do
+      expect(Byk.send(method, ascii)).to eq ascii
     end
-    it "converts Latin to ASCII Latin" do
+    it "translates Latin to ASCII Latin" do
       expect(Byk.send(method, pangram_latin)).to eq pangram_ascii_latin
     end
-    it "converts Latin caps to ASCII Latin caps" do
+    it "translates Latin caps to ASCII Latin caps" do
       expect(Byk.send(method, pangram_latin_caps)).to eq pangram_ascii_latin_caps
     end
-    it "converts mixed text properly" do
+    it "translates Cyrillic to ASCII Latin" do
+      expect(Byk.send(method, pangram)).to eq pangram_ascii_latin
+    end
+    it "translates Cyrillic caps to ASCII Latin caps" do
+      expect(Byk.send(method, pangram_caps)).to eq pangram_ascii_latin_caps
+    end
+    it "translates mixed text properly" do
       expect(Byk.send(method, mixed)).to eq mixed_ascii_latin
     end
-    it "converts edge cases properly" do
+    it "translates edge cases properly" do
       edge_cases.each do |input, output|
         expect(Byk.send(method, input)).to eq output
       end
@@ -141,18 +188,28 @@ describe Byk do
   shared_examples :non_destructive_method do |method|
     it "doesn't modify the arg" do
-      str = "Ж"
+      str = "ЖŽ"
       expect { Byk.send(method, str) }.to_not change { str }
     end
   end
   shared_examples :destructive_method do |method|
     it "modifies the arg" do
-      str = "Ж"
+      str = "ЖŽ"
       expect { Byk.send(method, str) }.to change { str }
     end
   end
+  describe ".to_cyrillic" do
+    it_behaves_like :cyrillization_method, :to_cyrillic
+    it_behaves_like :non_destructive_method, :to_cyrillic
+  end
+  describe ".to_cyrillic!" do
+    it_behaves_like :cyrillization_method, :to_cyrillic!
+    it_behaves_like :destructive_method, :to_cyrillic!
+  end
   describe ".to_latin" do
     it_behaves_like :latinization_method, :to_latin
     it_behaves_like :non_destructive_method, :to_latin
@@ -176,7 +233,7 @@ end
 describe String do
   it "responds to Byk methods" do
-    Byk.instance_methods.each do |method|
+    Byk.singleton_methods.each do |method|
       expect("").to respond_to(method)
     end
   end

metadata CHANGED

@@ -1,15 +1,29 @@
 --- !ruby/object:Gem::Specification
 name: byk
 version: !ruby/object:Gem::Version
-  version: 0.6.0
+  version: 1.0.0
 platform: ruby
 authors:
 - Nikola Topalović
 autorequire:
-bindir: bin
+bindir: exe
 cert_chain: []
-date: 2015-04-25 00:00:00.000000000 Z
+date: 2016-04-09 00:00:00.000000000 Z
 dependencies:
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '10.5'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '10.5'
 - !ruby/object:Gem::Dependency
   name: rake-compiler
   requirement: !ruby/object:Gem::Requirement
@@ -38,10 +52,11 @@ dependencies:
     - - "~>"
       - !ruby/object:Gem::Version
         version: '3.2'
-description: Provides C-optimized methods for transliteration of Serbian Cyrillic
-  into Latin.
+description: Fast transliteration of Serbian Cyrillic to Latin and back. Brzo preslovljavanje
+  ćirilice u latinicu i obratno.
 email: nikola.topalovic@gmail.com
-executables: []
+executables:
+- byk
 extensions:
 - ext/byk/extconf.rb
 extra_rdoc_files: []
@@ -49,6 +64,7 @@ files:
 - CHANGELOG.md
 - LICENSE
 - README.md
+- exe/byk
 - ext/byk/byk.c
 - ext/byk/extconf.rb
 - lib/byk.rb
@@ -76,9 +92,10 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
 requirements: []
 rubyforge_project:
-rubygems_version: 2.4.5
+rubygems_version: 2.5.1
 signing_key:
 specification_version: 4
-summary: Fast transliteration of Serbian Cyrillic into Latin.
+summary: Fast transliteration of Serbian Cyrillic to Latin and back. Brzo preslovljavanje
+  ćirilice u latinicu i obratno.
 test_files:
 - spec/byk_spec.rb