RubyGems - babosa - Versions diffs - 0.2.2 → 0.3.0 - Mend

babosa 0.2.2 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (31) hide show

data/.gemtest +0 -0
data/README.md +59 -28
data/Rakefile +14 -8
data/lib/babosa.rb +11 -1
data/lib/babosa/identifier.rb +26 -16
data/lib/babosa/transliterator/base.rb +89 -0
data/lib/babosa/transliterator/bulgarian.rb +27 -0
data/lib/babosa/transliterator/cyrillic.rb +111 -0
data/lib/babosa/transliterator/danish.rb +15 -0
data/lib/babosa/transliterator/german.rb +15 -0
data/lib/babosa/transliterator/latin.rb +199 -0
data/lib/babosa/transliterator/russian.rb +22 -0
data/lib/babosa/transliterator/serbian.rb +34 -0
data/lib/babosa/transliterator/spanish.rb +9 -0
data/lib/babosa/transliterator/ukranian.rb +11 -0
data/lib/babosa/utf8/dumb_proxy.rb +1 -0
data/lib/babosa/version.rb +1 -1
data/spec/babosa_spec.rb +131 -0
data/spec/spec_helper.rb +33 -0
data/spec/transliterators/base_spec.rb +16 -0
data/spec/transliterators/bulgarian_spec.rb +20 -0
data/spec/transliterators/danish_spec.rb +17 -0
data/spec/transliterators/german_spec.rb +17 -0
data/spec/transliterators/russian_spec.rb +9 -0
data/spec/transliterators/serbian_spec.rb +25 -0
data/spec/transliterators/spanish_spec.rb +13 -0
data/spec/transliterators/ukranian_spec.rb +9 -0
data/spec/utf8_proxy_spec.rb +48 -0
metadata +63 -19
data/lib/babosa/characters.rb +0 -80
data/test/babosa_test.rb +0 -198

data/.gemtest ADDED

File without changes

data/README.md CHANGED

@@ -13,12 +13,12 @@ FriendlyId.
 ### ASCII transliteration
-    "Gölcük, Turkey".to_slug.approximate_ascii.to_s #=> "Golcuk, Turkey"
+    "Gölcük, Turkey".to_slug.transliterate.to_s #=> "Golcuk, Turkey"
 ### Per-locale transliteration
-    "Jürgen Müller".to_slug.approximate_ascii.to_s           #=> "Jurgen Muller"
-    "Jürgen Müller".to_slug.approximate_ascii(:german).to_s  #=> "Juergen Mueller"
+    "Jürgen Müller".to_slug.transliterate.to_s           #=> "Jurgen Muller"
+    "Jürgen Müller".to_slug.transliterate(:german).to_s  #=> "Juergen Mueller"
 Supported language currently include Danish, German, Serbian and Spanish. I'll
 gladly accept contributions and support more languages.
@@ -59,16 +59,45 @@ in method names, but you may not want to):
     "über cool stuff!".to_slug.to_ruby_method(false) #=> uber_cool_stuff
-You can add not only transliterations, but expansions for some characters if you want:
+You can easily add custom transliterators for your language with very little code,
+for example here's the transliterator for German:
+    # encoding: utf-8
+    module Babosa
+      module Transliterator
+        class German < Latin
+          APPROXIMATIONS = {
+            "ä" => "ae",
+            "ö" => "oe",
+            "ü" => "ue",
+            "Ä" => "Ae",
+            "Ö" => "Oe",
+            "Ü" => "Ue"
+          }
+        end
+      end
+    end
+And a spec (you can use this as a template):
+    # encoding: utf-8
+    require File.expand_path("../../spec_helper", __FILE__)
+    describe Babosa::Transliterator::German do
+      let(:t) { described_class.instance }
+      it_behaves_like "a latin transliterator"
+      it "should transliterate Eszett" do
+        t.transliterate("ß").should eql("ss")
+      end
+      it "should transliterate vowels with umlauts" do
+        t.transliterate("üöä").should eql("ueoeae")
+      end
+    end
-    Babosa::Characters.add_approximations(:user, {
-      "0" => "oh",
-      "1" => "one",
-      "2" => "two",
-      "3" => "three",
-      "." => " dot "
-    })
-    "Web 2.0".to_slug.normalize!(:transliterations => :user) #=> "web-two-dot-oh"
 ### UTF-8 support
@@ -77,30 +106,29 @@ ActiveSupport gems installed and required prior to requiring "babosa", these
 will be used to perform upcasing and downcasing on UTF-8 strings. On JRuby 1.5
 and above, Java's native Unicode support will be used instead. Unless you're on
 JRuby, which already has excellent support for Unicode via Java's Standard
-Library, I recommend using the Unicode gem because it's the fastest Ruby
-Unicode library available.
+Library, I recommend using the Unicode gem because it's the fastest Ruby Unicode
+library available.
 If none of these libraries are available, Babosa falls back to a simple module
-which only supports Latin characters.
+which **only** supports Latin characters.
 This default module is fast and can do very naive Unicode composition to ensure
-that, for example, "é" will always be composed to a single codepoint rather
-than an "e" and a "´" - making it safe to use as a hash key. But seriously -
-save yourself the headache and install a real Unicode library.
+that, for example, "é" will always be composed to a single codepoint rather than
+an "e" and a "´" - making it safe to use as a hash key. But seriously - save
+yourself the headache and install a real Unicode library.
+If you are using Babosa with a language that uses the Cyrillic alphabet, Babosa
+requires either Unicode, Active Support or Java.
 ### Rails 3
-Most of Babosa's functionality is already present in Active Support/Rails 3.
-Babosa exists primarily to support non-Rails applications, and Rails apps prior
-to 3.0. Most of the code here was originally written for FriendlyId. Several
-things, like `tidy_bytes` and ASCII transliteration, were later added to Rails
-and I18N.
+Some of Babosa's functionality is already present in Active Support/Rails 3.
 Babosa differs from ActiveSupport primarily in that it supports non-Latin
-strings by default, and has per-locale ASCII transliterations already baked-in. If
-you are considering using Babosa with Rails 3, you should first take a look at
-Active Support's
+strings by default, and has per-locale ASCII transliterations already baked-in.
+If you are considering using Babosa with Rails 3, you should first take a look
+at Active Support's
 [transliterate](http://edgeapi.rubyonrails.org/classes/ActiveSupport/Inflector.html#M000565)
 and
 [parameterize](http://edgeapi.rubyonrails.org/classes/ActiveSupport/Inflector.html#M000566)
@@ -136,12 +164,15 @@ Please use Babosa's [Github issue tracker](http://github.com/norman/babosa/issue
 ## Contributors
-* [Molte Emil Strange Andersen](http://github.com/molte) - Danish support
-* [Milan Dobrota](http://github.com/milandobrota) - Serbian support
+* [Alexey Shkolnikov](https://github.com/grlm) - Russian support
+* [Martin Petrov](https://github.com/martin-petrov) - Bulgarian support
+* [Molte Emil Strange Andersen](https://github.com/molte) - Danish support
+* [Milan Dobrota](https://github.com/milandobrota) - Serbian support
 ## Changelog
+* 0.3.0 - Cyrillic support. Improve support for various Unicode spaces and dashes.
 * 0.2.2 - Fix for "smart" quote handling.
 * 0.2.1 - Implement #empty? for compatiblity with Active Support's #blank?.
 * 0.2.0 - Added support for Danish. Added method to generate Ruby identifiers. Improved performance.

data/Rakefile CHANGED

@@ -3,11 +3,10 @@ require "rake/testtask"
 require "rake/clean"
 require "rake/gempackagetask"
-task :default => :test
+task :default => :spec
+task :test    => :spec
 CLEAN << "pkg" << "doc" << "coverage" << ".yardoc"
-Rake::GemPackageTask.new(eval(File.read("babosa.gemspec"))) { |pkg| }
-Rake::TestTask.new(:test) { |t| t.pattern = "test/**/*_test.rb" }
 begin
   require "yard"
@@ -18,11 +17,18 @@ rescue LoadError
 end
 begin
-  require "rcov/rcovtask"
-  Rcov::RcovTask.new do |r|
-    r.test_files = FileList["test/**/*_test.rb"]
-    r.verbose = true
-    r.rcov_opts << "--exclude gems/*"
+  desc "Run SimpleCov"
+  task :coverage do
+    ENV["COV"] = "true"
+    Rake::Task["spec"].execute
   end
 rescue LoadError
 end
+gemspec = File.expand_path("../babosa.gemspec", __FILE__)
+if File.exist? gemspec
+  Rake::GemPackageTask.new(eval(File.read(gemspec))) { |pkg| }
+end
+require 'rspec/core/rake_task'
+RSpec::Core::RakeTask.new(:spec)

data/lib/babosa.rb CHANGED

@@ -16,8 +16,18 @@ class String
       unpack("C*").length
     end
   end
+  # Define unless Active Support has already added this method.
+  if !public_method_defined? :classify
+    # Convert from underscores to class name. E.g.:
+    #     hello_world => HelloWorld
+    def classify
+      split("_").map {|a| a.gsub(/\b('?[a-z])/) { $1.upcase }}.join
+    end
+  end
 end
-require "babosa/characters"
+require "babosa/transliterator/base"
 require "babosa/utf8/proxy"
 require "babosa/identifier"

data/lib/babosa/identifier.rb CHANGED

@@ -1,6 +1,16 @@
 # encoding: utf-8
 module Babosa
+  # Codepoints for characters that will be deleted by +#word_chars!+.
+  STRIPPABLE = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 14, 15, 16, 17, 18, 19,
+    20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 33, 34, 35, 36, 37, 38, 39,
+    40, 41, 42, 43, 44, 45, 46, 47, 58, 59, 60, 61, 62, 63, 64, 91, 92, 93, 94,
+    95, 96, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136,
+    137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151,
+    152, 153, 154, 155, 156, 157, 158, 159, 161, 162, 163, 164, 165, 166, 167,
+    168, 169, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 182, 183, 184,
+    185, 187, 188, 189, 190, 191, 215, 247, 8203, 8204, 8205, 8239, 65279]
   # This class provides some string-manipulation methods specific to slugs.
   #
   # Note that this class includes many "bang methods" such as {#clean!} and
@@ -56,6 +66,14 @@ module Babosa
       normalize_utf8!
     end
+    def ==(value)
+      @wrapped_string.to_s == value.to_s
+    end
+    def eql?(value)
+      @wrapped_string == value
+    end
     def empty?
       # included to make this class :respond_to? :empty for compatibility with Active Support's
       # #blank?
@@ -91,16 +109,12 @@ module Babosa
     # to remove non-ASCII characters such as "¡" and "¿", use {#to_ascii!}:
     #
     #   string.transliterate!(:spanish)       # => "¡Feliz anio!"
-    #   string.transliterate!                 # => "Feliz anio!"
+    #   string.transliterate!                 # => "¡Feliz anio!"
     # @param *args <Symbol>
     # @return String
-    def transliterate!(transliterations = {})
-      if transliterations.kind_of? Symbol
-        transliterations = Characters.approximations[transliterations]
-      else
-        transliterations ||= {}
-      end
-      @wrapped_string = unpack("U*").map { |char| approx_char(char, transliterations) }.flatten.pack("U*")
+    def transliterate!(kind = nil)
+      transliterator = Transliterator.get(kind || :latin).instance
+      @wrapped_string = transliterator.transliterate(@wrapped_string)
     end
     # Converts dashes to spaces, removes leading and trailing spaces, and
@@ -114,7 +128,7 @@ module Babosa
     # anything other than letters, numbers, spaces, newlines and linefeeds.
     # @return String
     def word_chars!
-      @wrapped_string = (unpack("U*") - Characters.strippable).pack("U*")
+      @wrapped_string = (unpack("U*") - Babosa::STRIPPABLE).pack("U*")
     end
     # Normalize the string for use as a URL slug. Note that in this context,
@@ -228,8 +242,9 @@ module Babosa
     end
     %w[transliterate clean downcase word_chars normalize normalize_utf8
-      tidy_bytes to_ascii truncate truncate_bytes upcase with_separators].each do |method|
-      class_eval(<<-EOM, __FILE__, __LINE__ +1)
+      tidy_bytes to_ascii to_ruby_method truncate truncate_bytes upcase
+      with_separators].each do |method|
+      class_eval(<<-EOM, __FILE__, __LINE__ + 1)
         def #{method}(*args)
           send_to_new_instance(:#{method}!, *args)
         end
@@ -253,11 +268,6 @@ module Babosa
     private
-    # Look up the character's approximation in the configured maps.
-    def approx_char(char, transliterations = {})
-      transliterations[char] or Characters.approximations[:latin][char] or char
-    end
     # Used as the basis of the bangless methods.
     def send_to_new_instance(*args)
       id = Identifier.allocate

data/lib/babosa/transliterator/base.rb ADDED

@@ -0,0 +1,89 @@
+# encoding: utf-8
+require 'singleton'
+module Babosa
+  module Transliterator
+    autoload :Latin,     "babosa/transliterator/latin"
+    autoload :Spanish,   "babosa/transliterator/spanish"
+    autoload :German,    "babosa/transliterator/german"
+    autoload :Danish,    "babosa/transliterator/danish"
+    autoload :Serbian,   "babosa/transliterator/serbian"
+    autoload :Cyrillic,  "babosa/transliterator/cyrillic"
+    autoload :Russian,   "babosa/transliterator/russian"
+    autoload :Ukranian,  "babosa/transliterator/ukranian"
+    autoload :Bulgarian, "babosa/transliterator/bulgarian"
+    def self.get(symbol)
+      const_get(symbol.to_s.classify)
+    end
+    class Base
+      include Singleton
+      APPROXIMATIONS = {
+        "×" => "x",
+        "÷" => "/",
+        "‐" => "-",
+        "‑" => "-",
+        "‒" => "-",
+        "–" => "-",
+        "—" => "-",
+        "―" => "-",
+        "―" => "-",
+        "‘" => "'",
+        "‛" => "'",
+        "“" => '"',
+        "”" => '"',
+        "„" => '"',
+        "‟" => '"',
+        '’' => "'",
+        # various kinds of space characters
+        "\xc2\xa0"     => " ",
+        "\xe2\x80\x80" => " ",
+        "\xe2\x80\x81" => " ",
+        "\xe2\x80\x82" => " ",
+        "\xe2\x80\x83" => " ",
+        "\xe2\x80\x84" => " ",
+        "\xe2\x80\x85" => " ",
+        "\xe2\x80\x86" => " ",
+        "\xe2\x80\x87" => " ",
+        "\xe2\x80\x88" => " ",
+        "\xe2\x80\x89" => " ",
+        "\xe2\x80\x8a" => " ",
+        "\xe2\x81\x9f" => " ",
+        "\xe3\x80\x80" => " ",
+      }.freeze
+      attr_reader :approximations
+      def initialize
+        if self.class < Base
+          @approximations = self.class.superclass.instance.approximations.dup
+        else
+          @approximations = {}
+        end
+        self.class.const_get(:APPROXIMATIONS).inject(@approximations) do |memo, object|
+          index       = object[0].unpack("U").shift
+          value       = object[1].unpack("C*")
+          memo[index] = value.length == 1 ? value[0] : value
+          memo
+        end
+        @approximations.freeze
+      end
+      # Accepts a single UTF-8 codepoint and returns the ASCII character code
+      # used as the transliteration value.
+      def [](codepoint)
+        @approximations[codepoint]
+      end
+      def transliterate(string)
+        string.unpack("U*").map {|char| self[char] || char}.flatten.pack("U*")
+      end
+    end
+  end
+end

data/lib/babosa/transliterator/bulgarian.rb ADDED

@@ -0,0 +1,27 @@
+# encoding: utf-8
+module Babosa
+  module Transliterator
+    class Bulgarian < Cyrillic
+      APPROXIMATIONS = {
+        "Ж" => "J",
+        "Й" => "I",
+        "Х" => "H",
+        "Ц" => "C",
+        "Щ" => "Sht",
+        "Ъ" => "U",
+        "Ь" => "I",
+        "Ю" => "Iu",
+        "Я" => "Ia",
+        "ж" => "j",
+        "й" => "i",
+        "х" => "h",
+        "ц" => "c",
+        "щ" => "sht",
+        "ъ" => "u",
+        "ь" => "i",
+        "ю" => "iu",
+        "я" => "ia"
+      }
+    end
+  end
+end

data/lib/babosa/transliterator/cyrillic.rb ADDED

@@ -0,0 +1,111 @@
+# encoding: utf-8
+module Babosa
+  module Transliterator
+    # Approximations are based on GOST 7.79, System B:
+    # http://en.wikipedia.org/wiki/ISO_9#GOST_7.79
+    class Cyrillic < Base
+      APPROXIMATIONS = {
+        "S" => "Z",
+        "j" => "j",
+        "s" => "z",
+        "Ё" => "Yo",
+        "Ѓ" => "G",
+        "Є" => "Ye",
+        "Ї" => "Yi",
+        "Љ" => "L",
+        "Њ" => "N",
+        "Ќ" => "K",
+        "Ў" => "U",
+        "Џ" => "Dh",
+        "А" => "A",
+        "Б" => "B",
+        "В" => "V",
+        "Г" => "G",
+        "Д" => "D",
+        "Е" => "E",
+        "Ж" => "Zh",
+        "З" => "Z",
+        "И" => "I",
+        "Й" => "J",
+        "К" => "K",
+        "Л" => "L",
+        "М" => "M",
+        "Н" => "N",
+        "О" => "O",
+        "П" => "P",
+        "Р" => "R",
+        "С" => "S",
+        "Т" => "T",
+        "У" => "U",
+        "Ф" => "F",
+        "Х" => "X",
+        "Ц" => "Cz",
+        "Ч" => "Ch",
+        "Ш" => "Sh",
+        "Щ" => "Shh",
+        "Ъ" => "",
+        "Ы" => "Y",
+        "Ь" => "",
+        "Э" => "E",
+        "Ю" => "Yu",
+        "Я" => "Ya",
+        "а" => "a",
+        "б" => "b",
+        "в" => "v",
+        "г" => "g",
+        "д" => "d",
+        "е" => "e",
+        "ж" => "zh",
+        "з" => "z",
+        "и" => "i",
+        "й" => "j",
+        "к" => "k",
+        "л" => "l",
+        "м" => "m",
+        "н" => "n",
+        "о" => "o",
+        "п" => "p",
+        "р" => "r",
+        "с" => "s",
+        "т" => "t",
+        "у" => "u",
+        "ф" => "f",
+        "х" => "x",
+        "ц" => "cz",
+        "ч" => "ch",
+        "ш" => "sh",
+        "щ" => "shh",
+        "ъ" => "",
+        "ы" => "y",
+        "ь" => "",
+        "э" => "e",
+        "ю" => "yu",
+        "я" => "ya",
+        "ё" => "yo",
+        "ѓ" => "g",
+        "є" => "ye",
+        "ї" => "yi",
+        "љ" => "l",
+        "њ" => "n",
+        "ќ" => "k",
+        "ў" => "u",
+        "џ" => "dh",
+        "Ѣ" => "Ye",
+        "ѣ" => "ye",
+        "Ѫ" => "O",
+        "ѫ" => "o",
+        "Ѳ" => "Fh",
+        "ѳ" => "fh",
+        "Ѵ" => "Yh",
+        "ѵ" => "yh",
+        "Ґ" => "G",
+        "ґ" => "g",
+      }
+      def transliterate(string)
+        super.gsub(/(c)z([ieyj])/) { "#{$1}#{$2}" }
+      end
+    end
+  end
+end