babosa 0.2.2 → 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
File without changes
data/README.md CHANGED
@@ -13,12 +13,12 @@ FriendlyId.
13
13
 
14
14
  ### ASCII transliteration
15
15
 
16
- "Gölcük, Turkey".to_slug.approximate_ascii.to_s #=> "Golcuk, Turkey"
16
+ "Gölcük, Turkey".to_slug.transliterate.to_s #=> "Golcuk, Turkey"
17
17
 
18
18
  ### Per-locale transliteration
19
19
 
20
- "Jürgen Müller".to_slug.approximate_ascii.to_s #=> "Jurgen Muller"
21
- "Jürgen Müller".to_slug.approximate_ascii(:german).to_s #=> "Juergen Mueller"
20
+ "Jürgen Müller".to_slug.transliterate.to_s #=> "Jurgen Muller"
21
+ "Jürgen Müller".to_slug.transliterate(:german).to_s #=> "Juergen Mueller"
22
22
 
23
23
  Supported language currently include Danish, German, Serbian and Spanish. I'll
24
24
  gladly accept contributions and support more languages.
@@ -59,16 +59,45 @@ in method names, but you may not want to):
59
59
  "über cool stuff!".to_slug.to_ruby_method(false) #=> uber_cool_stuff
60
60
 
61
61
 
62
- You can add not only transliterations, but expansions for some characters if you want:
62
+ You can easily add custom transliterators for your language with very little code,
63
+ for example here's the transliterator for German:
64
+
65
+ # encoding: utf-8
66
+ module Babosa
67
+ module Transliterator
68
+ class German < Latin
69
+ APPROXIMATIONS = {
70
+ "ä" => "ae",
71
+ "ö" => "oe",
72
+ "ü" => "ue",
73
+ "Ä" => "Ae",
74
+ "Ö" => "Oe",
75
+ "Ü" => "Ue"
76
+ }
77
+ end
78
+ end
79
+ end
80
+
81
+ And a spec (you can use this as a template):
82
+
83
+ # encoding: utf-8
84
+ require File.expand_path("../../spec_helper", __FILE__)
85
+
86
+ describe Babosa::Transliterator::German do
87
+
88
+ let(:t) { described_class.instance }
89
+ it_behaves_like "a latin transliterator"
90
+
91
+ it "should transliterate Eszett" do
92
+ t.transliterate("ß").should eql("ss")
93
+ end
94
+
95
+ it "should transliterate vowels with umlauts" do
96
+ t.transliterate("üöä").should eql("ueoeae")
97
+ end
98
+
99
+ end
63
100
 
64
- Babosa::Characters.add_approximations(:user, {
65
- "0" => "oh",
66
- "1" => "one",
67
- "2" => "two",
68
- "3" => "three",
69
- "." => " dot "
70
- })
71
- "Web 2.0".to_slug.normalize!(:transliterations => :user) #=> "web-two-dot-oh"
72
101
 
73
102
  ### UTF-8 support
74
103
 
@@ -77,30 +106,29 @@ ActiveSupport gems installed and required prior to requiring "babosa", these
77
106
  will be used to perform upcasing and downcasing on UTF-8 strings. On JRuby 1.5
78
107
  and above, Java's native Unicode support will be used instead. Unless you're on
79
108
  JRuby, which already has excellent support for Unicode via Java's Standard
80
- Library, I recommend using the Unicode gem because it's the fastest Ruby
81
- Unicode library available.
109
+ Library, I recommend using the Unicode gem because it's the fastest Ruby Unicode
110
+ library available.
82
111
 
83
112
  If none of these libraries are available, Babosa falls back to a simple module
84
- which only supports Latin characters.
113
+ which **only** supports Latin characters.
85
114
 
86
115
  This default module is fast and can do very naive Unicode composition to ensure
87
- that, for example, "é" will always be composed to a single codepoint rather
88
- than an "e" and a "´" - making it safe to use as a hash key. But seriously -
89
- save yourself the headache and install a real Unicode library.
116
+ that, for example, "é" will always be composed to a single codepoint rather than
117
+ an "e" and a "´" - making it safe to use as a hash key. But seriously - save
118
+ yourself the headache and install a real Unicode library.
119
+
120
+ If you are using Babosa with a language that uses the Cyrillic alphabet, Babosa
121
+ requires either Unicode, Active Support or Java.
90
122
 
91
123
 
92
124
  ### Rails 3
93
125
 
94
- Most of Babosa's functionality is already present in Active Support/Rails 3.
95
- Babosa exists primarily to support non-Rails applications, and Rails apps prior
96
- to 3.0. Most of the code here was originally written for FriendlyId. Several
97
- things, like `tidy_bytes` and ASCII transliteration, were later added to Rails
98
- and I18N.
126
+ Some of Babosa's functionality is already present in Active Support/Rails 3.
99
127
 
100
128
  Babosa differs from ActiveSupport primarily in that it supports non-Latin
101
- strings by default, and has per-locale ASCII transliterations already baked-in. If
102
- you are considering using Babosa with Rails 3, you should first take a look at
103
- Active Support's
129
+ strings by default, and has per-locale ASCII transliterations already baked-in.
130
+ If you are considering using Babosa with Rails 3, you should first take a look
131
+ at Active Support's
104
132
  [transliterate](http://edgeapi.rubyonrails.org/classes/ActiveSupport/Inflector.html#M000565)
105
133
  and
106
134
  [parameterize](http://edgeapi.rubyonrails.org/classes/ActiveSupport/Inflector.html#M000566)
@@ -136,12 +164,15 @@ Please use Babosa's [Github issue tracker](http://github.com/norman/babosa/issue
136
164
 
137
165
  ## Contributors
138
166
 
139
- * [Molte Emil Strange Andersen](http://github.com/molte) - Danish support
140
- * [Milan Dobrota](http://github.com/milandobrota) - Serbian support
167
+ * [Alexey Shkolnikov](https://github.com/grlm) - Russian support
168
+ * [Martin Petrov](https://github.com/martin-petrov) - Bulgarian support
169
+ * [Molte Emil Strange Andersen](https://github.com/molte) - Danish support
170
+ * [Milan Dobrota](https://github.com/milandobrota) - Serbian support
141
171
 
142
172
 
143
173
  ## Changelog
144
174
 
175
+ * 0.3.0 - Cyrillic support. Improve support for various Unicode spaces and dashes.
145
176
  * 0.2.2 - Fix for "smart" quote handling.
146
177
  * 0.2.1 - Implement #empty? for compatiblity with Active Support's #blank?.
147
178
  * 0.2.0 - Added support for Danish. Added method to generate Ruby identifiers. Improved performance.
data/Rakefile CHANGED
@@ -3,11 +3,10 @@ require "rake/testtask"
3
3
  require "rake/clean"
4
4
  require "rake/gempackagetask"
5
5
 
6
- task :default => :test
6
+ task :default => :spec
7
+ task :test => :spec
7
8
 
8
9
  CLEAN << "pkg" << "doc" << "coverage" << ".yardoc"
9
- Rake::GemPackageTask.new(eval(File.read("babosa.gemspec"))) { |pkg| }
10
- Rake::TestTask.new(:test) { |t| t.pattern = "test/**/*_test.rb" }
11
10
 
12
11
  begin
13
12
  require "yard"
@@ -18,11 +17,18 @@ rescue LoadError
18
17
  end
19
18
 
20
19
  begin
21
- require "rcov/rcovtask"
22
- Rcov::RcovTask.new do |r|
23
- r.test_files = FileList["test/**/*_test.rb"]
24
- r.verbose = true
25
- r.rcov_opts << "--exclude gems/*"
20
+ desc "Run SimpleCov"
21
+ task :coverage do
22
+ ENV["COV"] = "true"
23
+ Rake::Task["spec"].execute
26
24
  end
27
25
  rescue LoadError
28
26
  end
27
+
28
+ gemspec = File.expand_path("../babosa.gemspec", __FILE__)
29
+ if File.exist? gemspec
30
+ Rake::GemPackageTask.new(eval(File.read(gemspec))) { |pkg| }
31
+ end
32
+
33
+ require 'rspec/core/rake_task'
34
+ RSpec::Core::RakeTask.new(:spec)
@@ -16,8 +16,18 @@ class String
16
16
  unpack("C*").length
17
17
  end
18
18
  end
19
+
20
+ # Define unless Active Support has already added this method.
21
+ if !public_method_defined? :classify
22
+ # Convert from underscores to class name. E.g.:
23
+ # hello_world => HelloWorld
24
+ def classify
25
+ split("_").map {|a| a.gsub(/\b('?[a-z])/) { $1.upcase }}.join
26
+ end
27
+ end
28
+
19
29
  end
20
30
 
21
- require "babosa/characters"
31
+ require "babosa/transliterator/base"
22
32
  require "babosa/utf8/proxy"
23
33
  require "babosa/identifier"
@@ -1,6 +1,16 @@
1
1
  # encoding: utf-8
2
2
  module Babosa
3
3
 
4
+ # Codepoints for characters that will be deleted by +#word_chars!+.
5
+ STRIPPABLE = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 14, 15, 16, 17, 18, 19,
6
+ 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 33, 34, 35, 36, 37, 38, 39,
7
+ 40, 41, 42, 43, 44, 45, 46, 47, 58, 59, 60, 61, 62, 63, 64, 91, 92, 93, 94,
8
+ 95, 96, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136,
9
+ 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151,
10
+ 152, 153, 154, 155, 156, 157, 158, 159, 161, 162, 163, 164, 165, 166, 167,
11
+ 168, 169, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 182, 183, 184,
12
+ 185, 187, 188, 189, 190, 191, 215, 247, 8203, 8204, 8205, 8239, 65279]
13
+
4
14
  # This class provides some string-manipulation methods specific to slugs.
5
15
  #
6
16
  # Note that this class includes many "bang methods" such as {#clean!} and
@@ -56,6 +66,14 @@ module Babosa
56
66
  normalize_utf8!
57
67
  end
58
68
 
69
+ def ==(value)
70
+ @wrapped_string.to_s == value.to_s
71
+ end
72
+
73
+ def eql?(value)
74
+ @wrapped_string == value
75
+ end
76
+
59
77
  def empty?
60
78
  # included to make this class :respond_to? :empty for compatibility with Active Support's
61
79
  # #blank?
@@ -91,16 +109,12 @@ module Babosa
91
109
  # to remove non-ASCII characters such as "¡" and "¿", use {#to_ascii!}:
92
110
  #
93
111
  # string.transliterate!(:spanish) # => "¡Feliz anio!"
94
- # string.transliterate! # => "Feliz anio!"
112
+ # string.transliterate! # => "¡Feliz anio!"
95
113
  # @param *args <Symbol>
96
114
  # @return String
97
- def transliterate!(transliterations = {})
98
- if transliterations.kind_of? Symbol
99
- transliterations = Characters.approximations[transliterations]
100
- else
101
- transliterations ||= {}
102
- end
103
- @wrapped_string = unpack("U*").map { |char| approx_char(char, transliterations) }.flatten.pack("U*")
115
+ def transliterate!(kind = nil)
116
+ transliterator = Transliterator.get(kind || :latin).instance
117
+ @wrapped_string = transliterator.transliterate(@wrapped_string)
104
118
  end
105
119
 
106
120
  # Converts dashes to spaces, removes leading and trailing spaces, and
@@ -114,7 +128,7 @@ module Babosa
114
128
  # anything other than letters, numbers, spaces, newlines and linefeeds.
115
129
  # @return String
116
130
  def word_chars!
117
- @wrapped_string = (unpack("U*") - Characters.strippable).pack("U*")
131
+ @wrapped_string = (unpack("U*") - Babosa::STRIPPABLE).pack("U*")
118
132
  end
119
133
 
120
134
  # Normalize the string for use as a URL slug. Note that in this context,
@@ -228,8 +242,9 @@ module Babosa
228
242
  end
229
243
 
230
244
  %w[transliterate clean downcase word_chars normalize normalize_utf8
231
- tidy_bytes to_ascii truncate truncate_bytes upcase with_separators].each do |method|
232
- class_eval(<<-EOM, __FILE__, __LINE__ +1)
245
+ tidy_bytes to_ascii to_ruby_method truncate truncate_bytes upcase
246
+ with_separators].each do |method|
247
+ class_eval(<<-EOM, __FILE__, __LINE__ + 1)
233
248
  def #{method}(*args)
234
249
  send_to_new_instance(:#{method}!, *args)
235
250
  end
@@ -253,11 +268,6 @@ module Babosa
253
268
 
254
269
  private
255
270
 
256
- # Look up the character's approximation in the configured maps.
257
- def approx_char(char, transliterations = {})
258
- transliterations[char] or Characters.approximations[:latin][char] or char
259
- end
260
-
261
271
  # Used as the basis of the bangless methods.
262
272
  def send_to_new_instance(*args)
263
273
  id = Identifier.allocate
@@ -0,0 +1,89 @@
1
+ # encoding: utf-8
2
+
3
+ require 'singleton'
4
+
5
+ module Babosa
6
+
7
+ module Transliterator
8
+
9
+ autoload :Latin, "babosa/transliterator/latin"
10
+ autoload :Spanish, "babosa/transliterator/spanish"
11
+ autoload :German, "babosa/transliterator/german"
12
+ autoload :Danish, "babosa/transliterator/danish"
13
+ autoload :Serbian, "babosa/transliterator/serbian"
14
+ autoload :Cyrillic, "babosa/transliterator/cyrillic"
15
+ autoload :Russian, "babosa/transliterator/russian"
16
+ autoload :Ukranian, "babosa/transliterator/ukranian"
17
+ autoload :Bulgarian, "babosa/transliterator/bulgarian"
18
+
19
+ def self.get(symbol)
20
+ const_get(symbol.to_s.classify)
21
+ end
22
+
23
+ class Base
24
+
25
+ include Singleton
26
+
27
+ APPROXIMATIONS = {
28
+ "×" => "x",
29
+ "÷" => "/",
30
+ "‐" => "-",
31
+ "‑" => "-",
32
+ "‒" => "-",
33
+ "–" => "-",
34
+ "—" => "-",
35
+ "―" => "-",
36
+ "―" => "-",
37
+ "‘" => "'",
38
+ "‛" => "'",
39
+ "“" => '"',
40
+ "”" => '"',
41
+ "„" => '"',
42
+ "‟" => '"',
43
+ '’' => "'",
44
+ # various kinds of space characters
45
+ "\xc2\xa0" => " ",
46
+ "\xe2\x80\x80" => " ",
47
+ "\xe2\x80\x81" => " ",
48
+ "\xe2\x80\x82" => " ",
49
+ "\xe2\x80\x83" => " ",
50
+ "\xe2\x80\x84" => " ",
51
+ "\xe2\x80\x85" => " ",
52
+ "\xe2\x80\x86" => " ",
53
+ "\xe2\x80\x87" => " ",
54
+ "\xe2\x80\x88" => " ",
55
+ "\xe2\x80\x89" => " ",
56
+ "\xe2\x80\x8a" => " ",
57
+ "\xe2\x81\x9f" => " ",
58
+ "\xe3\x80\x80" => " ",
59
+ }.freeze
60
+
61
+ attr_reader :approximations
62
+
63
+ def initialize
64
+ if self.class < Base
65
+ @approximations = self.class.superclass.instance.approximations.dup
66
+ else
67
+ @approximations = {}
68
+ end
69
+ self.class.const_get(:APPROXIMATIONS).inject(@approximations) do |memo, object|
70
+ index = object[0].unpack("U").shift
71
+ value = object[1].unpack("C*")
72
+ memo[index] = value.length == 1 ? value[0] : value
73
+ memo
74
+ end
75
+ @approximations.freeze
76
+ end
77
+
78
+ # Accepts a single UTF-8 codepoint and returns the ASCII character code
79
+ # used as the transliteration value.
80
+ def [](codepoint)
81
+ @approximations[codepoint]
82
+ end
83
+
84
+ def transliterate(string)
85
+ string.unpack("U*").map {|char| self[char] || char}.flatten.pack("U*")
86
+ end
87
+ end
88
+ end
89
+ end
@@ -0,0 +1,27 @@
1
+ # encoding: utf-8
2
+ module Babosa
3
+ module Transliterator
4
+ class Bulgarian < Cyrillic
5
+ APPROXIMATIONS = {
6
+ "Ж" => "J",
7
+ "Й" => "I",
8
+ "Х" => "H",
9
+ "Ц" => "C",
10
+ "Щ" => "Sht",
11
+ "Ъ" => "U",
12
+ "Ь" => "I",
13
+ "Ю" => "Iu",
14
+ "Я" => "Ia",
15
+ "ж" => "j",
16
+ "й" => "i",
17
+ "х" => "h",
18
+ "ц" => "c",
19
+ "щ" => "sht",
20
+ "ъ" => "u",
21
+ "ь" => "i",
22
+ "ю" => "iu",
23
+ "я" => "ia"
24
+ }
25
+ end
26
+ end
27
+ end
@@ -0,0 +1,111 @@
1
+ # encoding: utf-8
2
+ module Babosa
3
+ module Transliterator
4
+
5
+ # Approximations are based on GOST 7.79, System B:
6
+ # http://en.wikipedia.org/wiki/ISO_9#GOST_7.79
7
+ class Cyrillic < Base
8
+ APPROXIMATIONS = {
9
+ "S" => "Z",
10
+ "j" => "j",
11
+ "s" => "z",
12
+ "Ё" => "Yo",
13
+ "Ѓ" => "G",
14
+ "Є" => "Ye",
15
+ "Ї" => "Yi",
16
+ "Љ" => "L",
17
+ "Њ" => "N",
18
+ "Ќ" => "K",
19
+ "Ў" => "U",
20
+ "Џ" => "Dh",
21
+ "А" => "A",
22
+ "Б" => "B",
23
+ "В" => "V",
24
+ "Г" => "G",
25
+ "Д" => "D",
26
+ "Е" => "E",
27
+ "Ж" => "Zh",
28
+ "З" => "Z",
29
+ "И" => "I",
30
+ "Й" => "J",
31
+ "К" => "K",
32
+ "Л" => "L",
33
+ "М" => "M",
34
+ "Н" => "N",
35
+ "О" => "O",
36
+ "П" => "P",
37
+ "Р" => "R",
38
+ "С" => "S",
39
+ "Т" => "T",
40
+ "У" => "U",
41
+ "Ф" => "F",
42
+ "Х" => "X",
43
+ "Ц" => "Cz",
44
+ "Ч" => "Ch",
45
+ "Ш" => "Sh",
46
+ "Щ" => "Shh",
47
+ "Ъ" => "",
48
+ "Ы" => "Y",
49
+ "Ь" => "",
50
+ "Э" => "E",
51
+ "Ю" => "Yu",
52
+ "Я" => "Ya",
53
+ "а" => "a",
54
+ "б" => "b",
55
+ "в" => "v",
56
+ "г" => "g",
57
+ "д" => "d",
58
+ "е" => "e",
59
+ "ж" => "zh",
60
+ "з" => "z",
61
+ "и" => "i",
62
+ "й" => "j",
63
+ "к" => "k",
64
+ "л" => "l",
65
+ "м" => "m",
66
+ "н" => "n",
67
+ "о" => "o",
68
+ "п" => "p",
69
+ "р" => "r",
70
+ "с" => "s",
71
+ "т" => "t",
72
+ "у" => "u",
73
+ "ф" => "f",
74
+ "х" => "x",
75
+ "ц" => "cz",
76
+ "ч" => "ch",
77
+ "ш" => "sh",
78
+ "щ" => "shh",
79
+ "ъ" => "",
80
+ "ы" => "y",
81
+ "ь" => "",
82
+ "э" => "e",
83
+ "ю" => "yu",
84
+ "я" => "ya",
85
+ "ё" => "yo",
86
+ "ѓ" => "g",
87
+ "є" => "ye",
88
+ "ї" => "yi",
89
+ "љ" => "l",
90
+ "њ" => "n",
91
+ "ќ" => "k",
92
+ "ў" => "u",
93
+ "џ" => "dh",
94
+ "Ѣ" => "Ye",
95
+ "ѣ" => "ye",
96
+ "Ѫ" => "O",
97
+ "ѫ" => "o",
98
+ "Ѳ" => "Fh",
99
+ "ѳ" => "fh",
100
+ "Ѵ" => "Yh",
101
+ "ѵ" => "yh",
102
+ "Ґ" => "G",
103
+ "ґ" => "g",
104
+ }
105
+
106
+ def transliterate(string)
107
+ super.gsub(/(c)z([ieyj])/) { "#{$1}#{$2}" }
108
+ end
109
+ end
110
+ end
111
+ end