RubyGems - twitter_cldr - Versions diffs - 1.3.6 → 1.4.0 - Mend

twitter_cldr 1.3.6 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

data/README.md +47 -2
data/lib/twitter_cldr/core_ext/calendars/datetime.rb +2 -2
data/lib/twitter_cldr/core_ext/calendars/timespan.rb +11 -13
data/lib/twitter_cldr/normalizers.rb +3 -0
data/lib/twitter_cldr/normalizers/base.rb +34 -0
data/lib/twitter_cldr/normalizers/nfc.rb +24 -0
data/lib/twitter_cldr/normalizers/nfd.rb +1 -1
data/lib/twitter_cldr/normalizers/nfkc.rb +126 -0
data/lib/twitter_cldr/normalizers/nfkd.rb +9 -17
data/lib/twitter_cldr/shared.rb +1 -1
data/lib/twitter_cldr/shared/code_point.rb +116 -0
data/lib/twitter_cldr/tokenizers/base.rb +2 -2
data/lib/twitter_cldr/utils.rb +8 -0
data/lib/twitter_cldr/version.rb +1 -1
data/resources/unicode_data/blocks_hangul.yml +46 -0
data/resources/unicode_data/composition_exclusions.yml +293 -0
data/resources/unicode_data/decomposition_map.yml +4565 -0
data/spec/normalizers/NormalizationTestShort.txt +66 -66
data/spec/normalizers/base_spec.rb +17 -0
data/spec/normalizers/normalization_spec.rb +10 -0
data/spec/readme_spec.rb +26 -1
data/spec/shared/code_point_spec.rb +152 -0
data/spec/tokenizers/base_spec.rb +0 -10
data/spec/utils/{code_point_spec.rb → code_points_spec.rb} +0 -0
data/spec/utils_spec.rb +10 -0
metadata +16 -10
data/lib/twitter_cldr/shared/unicode_data.rb +0 -64
data/spec/normalizers/nfd_spec.rb +0 -21
data/spec/shared/unicode_data_spec.rb +0 -51

data/README.md CHANGED Viewed

@@ -109,6 +109,51 @@ dt = TwitterCldr::LocalizedDateTime.new(DateTime.now, :es)
 dt.to_short_s  # ...etc
 ```
+#### Relative Dates and Times
+In addition to formatting full dates and times, TwitterCLDR supports relative time spans via several convenience methods and the `LocalizedTimespan` class.  TwitterCLDR tries to guess the best time unit (eg. days, hours, minutes, etc) based on the length of the time span.  Unless otherwise specified, TwitterCLDR will use the current date and time as the reference point for the calculation.
+```ruby
+(DateTime.now - 1).localize.ago        # 1 day ago
+(DateTime.now - 0.5).localize.ago      # 12 hours ago  (i.e. half a day)
+(DateTime.now + 1).localize.until      # In 1 day
+(DateTime.now + 0.5).localize.until    # In 12 hours
+```
+Specify other locales:
+```ruby
+(DateTime.now - 1).localize(:de).ago        # Vor 1 Tag
+(DateTime.now + 1).localize(:de).until      # In 1 Tag
+```
+Force TwitterCLDR to use a specific time unit by including the `:unit` option:
+```ruby
+(DateTime.now - 1).localize(:de).ago(:unit => :hour)        # Vor 24 Stunden
+(DateTime.now + 1).localize(:de).until(:unit => :hour)      # In 24 Stunden
+```
+Specify a different reference point for the time span calculation:
+```ruby
+# 86400 = 1 day in seconds, 259200 = 3 days in seconds
+(Time.now + 86400).localize(:de).ago(:unit => :hour, :base_time => (Time.now + 259200))  # Vor 48 Stunden
+```
+Behind the scenes, these convenience methods are creating instances of `LocalizedTimespan`, whose constructor accepts a number of seconds as the first argument.  You can do the same thing if you're feeling adventurous:
+```ruby
+ts = TwitterCldr::LocalizedTimespan.new(86400, :de)
+ts.to_s                    # In 1 Tag
+ts.to_s(:hour)             # In 24 Stunden
+ts = TwitterCldr::LocalizedTimespan.new(-86400, :de)
+ts.to_s                    # Vor 1 Tag
+ts.to_s(:hour)             # Vor 24 Stunden
+```
 ### Plural Rules
 Some languages, like English, have "countable" nouns.  You probably know this concept better as "plural" and "singular", i.e. the difference between "strawberry" and "strawberries".  Other languages, like Russian, have three plural forms: one (numbers ending in 1), few (numbers ending in 2, 3, or 4), and many (everything else).  Still other languages like Japanese don't use countable nouns at all.
@@ -233,7 +278,7 @@ TwitterCLDR provides ways to retrieve individual code points as well as normaliz
 Retrieve data for code points:
 ```ruby
-code_point = TwitterCldr::Shared::UnicodeData.for_code_point("1F3E9")
+code_point = TwitterCldr::Shared::CodePoint.for_hex("1F3E9")
 code_point.name             # "LOVE HOTEL"
 code_point.bidi_mirrored    # "N"
 code_point.category         # "So"
@@ -252,7 +297,7 @@ Convert code points to characters:
 TwitterCldr::Utils::CodePoints.to_string(["00BF"])  # "¿"
 ```
-Normalize/decompose a Unicode string (NFD, NFKD implementations available).  Note that the normalized string will almost always look the same as the original string because most character display systems automatically combine decomposed characters.
+Normalize/decompose a Unicode string (NFD, NFKD, NFC, and NFKC implementations available).  Note that the normalized string will almost always look the same as the original string because most character display systems automatically combine decomposed characters.
 ```ruby
 TwitterCldr::Normalizers::NFD.normalize("français")  # "français"

data/lib/twitter_cldr/core_ext/calendars/datetime.rb CHANGED Viewed

@@ -28,14 +28,14 @@ module TwitterCldr
       base_time = options[:base_time] || Time.now
       seconds = self.to_time.base_obj.to_i - base_time.to_i
       raise ArgumentError.new('Start date is after end date. Consider using "until" function.') if seconds > 0
-      TwitterCldr::Shared::LocalizedTimespan.new(seconds, @locale).to_s(options[:unit])
+      TwitterCldr::LocalizedTimespan.new(seconds, @locale).to_s(options[:unit])
     end
     def until(options = {})
       base_time = options[:base_time] || Time.now
       seconds = self.to_time.base_obj.to_i - base_time.to_i
       raise ArgumentError.new('End date is before start date. Consider using "ago" function.') if seconds < 0
-      TwitterCldr::Shared::LocalizedTimespan.new(seconds, @locale).to_s(options[:unit])
+      TwitterCldr::LocalizedTimespan.new(seconds, @locale).to_s(options[:unit])
     end
     def to_s

data/lib/twitter_cldr/core_ext/calendars/timespan.rb CHANGED Viewed

@@ -4,23 +4,21 @@
 # http://www.apache.org/licenses/LICENSE-2.0
 module TwitterCldr
-  module Shared
-    class LocalizedTimespan < LocalizedObject
+  class LocalizedTimespan < LocalizedObject
-      def initialize(seconds, locale)
-        @formatter = TwitterCldr::Formatters::TimespanFormatter.new(:locale => locale)
-        @seconds = seconds
-      end
+    def initialize(seconds, locale)
+      @formatter = TwitterCldr::Formatters::TimespanFormatter.new(:locale => locale)
+      @seconds = seconds
+    end
-      def to_s(unit = :default)
-        @formatter.format(@seconds, unit)
-      end
+    def to_s(unit = :default)
+      @formatter.format(@seconds, unit)
+    end
-      protected
+    protected
-      def formatter_const
-        TwitterCldr::Formatters::TimespanFormatter
-      end
+    def formatter_const
+      TwitterCldr::Formatters::TimespanFormatter
     end
   end
 end

data/lib/twitter_cldr/normalizers.rb CHANGED Viewed

@@ -5,7 +5,10 @@
 module TwitterCldr
   module Normalizers
+    autoload :Base, 'twitter_cldr/normalizers/base'
     autoload :NFD,  'twitter_cldr/normalizers/nfd'
     autoload :NFKD, 'twitter_cldr/normalizers/nfkd'
+    autoload :NFC,  'twitter_cldr/normalizers/nfc'
+    autoload :NFKC, 'twitter_cldr/normalizers/nfkc'
   end
 end

data/lib/twitter_cldr/normalizers/base.rb ADDED Viewed

@@ -0,0 +1,34 @@
+# encoding: UTF-8
+# Copyright 2012 Twitter, Inc
+# http://www.apache.org/licenses/LICENSE-2.0
+module TwitterCldr
+  module Normalizers
+    class Base
+      class << self
+        HANGUL_DECOMPOSITION_CONSTANTS = {
+            :SBase  => 0xAC00,
+            :LBase  => 0x1100,
+            :VBase  => 0x1161,
+            :TBase  => 0x11A7,
+            :LCount => 19,
+            :VCount => 21,
+            :TCount => 28,
+            :NCount => 588,  # VCount * TCount
+            :SCount => 11172 # LCount * NCount
+        }
+        def combining_class_for(code_point)
+          TwitterCldr::Shared::CodePoint.for_hex(code_point).combining_class.to_i
+        rescue NoMethodError
+          0
+        end
+      end
+    end
+  end
+end

data/lib/twitter_cldr/normalizers/nfc.rb ADDED Viewed

@@ -0,0 +1,24 @@
+# encoding: UTF-8
+# Copyright 2012 Twitter, Inc
+# http://www.apache.org/licenses/LICENSE-2.0
+module TwitterCldr
+  module Normalizers
+    # Implements normalization of a Unicode string to Normalization Form C (NFC).
+    # This normalization includes canonical decomposition followed by canonical composition.
+    #
+    class NFC < NFKC
+      class << self
+        def normalize_code_points(code_points)
+          compose(TwitterCldr::Normalizers::NFD.normalize_code_points(code_points))
+        end
+      end
+    end
+  end
+end

data/lib/twitter_cldr/normalizers/nfd.rb CHANGED Viewed

@@ -7,7 +7,7 @@ module TwitterCldr
   module Normalizers
     # Implements normalization of a Unicode string to Normalization Form D (NFD).
-    # This normalization includes only Canonical Decomposition.
+    # This normalization includes only canonical decomposition.
     #
     class NFD < NFKD

data/lib/twitter_cldr/normalizers/nfkc.rb ADDED Viewed

@@ -0,0 +1,126 @@
+# encoding: UTF-8
+# Copyright 2012 Twitter, Inc
+# http://www.apache.org/licenses/LICENSE-2.0
+module TwitterCldr
+  module Normalizers
+    # Implements normalization of a Unicode string to Normalization Form KC (NFKC).
+    # This normalization form includes compatibility decomposition followed by compatibility composition.
+    #
+    class NFKC < Base
+      class << self
+        def normalize(string)
+          code_points = TwitterCldr::Utils::CodePoints.from_string(string)
+          normalized_code_points = normalize_code_points(code_points)
+          TwitterCldr::Utils::CodePoints.to_string(normalized_code_points)
+        end
+        def normalize_code_points(code_points)
+          compose(TwitterCldr::Normalizers::NFKD.normalize_code_points(code_points))
+        end
+        protected
+        def compose(code_points)
+          final = []
+          hangul_code_points = []
+          code_points.each_with_index do |code_point, index|
+            final << code_point
+            hangul_type = TwitterCldr::Shared::CodePoint.hangul_type(code_point)
+            next_hangul_type = TwitterCldr::Shared::CodePoint.hangul_type(code_points[index + 1])
+            if valid_hangul_sequence?(hangul_code_points.size, hangul_type)
+              hangul_code_points << code_point
+              unless valid_hangul_sequence?(hangul_code_points.size, next_hangul_type)
+                next_hangul_type = nil
+              end
+            else
+              hangul_code_points.clear
+            end
+            if hangul_code_points.size > 1 && !next_hangul_type
+              hangul_code_points.size.times { final.pop }
+              final << compose_hangul(hangul_code_points)
+              hangul_code_points.clear
+            end
+          end
+          compose_normal(final)
+          final
+        end
+        def valid_hangul_sequence?(buffer_size, hangul_type)
+          case [buffer_size, hangul_type]
+            when [0, :lparts], [1, :vparts], [2, :tparts]
+              true
+            else
+              false
+          end
+        end
+        # Special composition for Hangul syllables. Documented in Section 3.12 at
+        # http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
+        #
+        def compose_hangul(code_points)
+          l_index = code_points.first.hex - HANGUL_DECOMPOSITION_CONSTANTS[:LBase]
+          v_index = code_points[1].hex - HANGUL_DECOMPOSITION_CONSTANTS[:VBase]
+          t_index = code_points[2] ? code_points[2].hex - HANGUL_DECOMPOSITION_CONSTANTS[:TBase] : 0  # tpart may be missing, that's ok
+          lv_index = (l_index * HANGUL_DECOMPOSITION_CONSTANTS[:NCount]) + (v_index * HANGUL_DECOMPOSITION_CONSTANTS[:TCount])
+          (HANGUL_DECOMPOSITION_CONSTANTS[:SBase] + lv_index + t_index).to_s(16).upcase.rjust(4, "0")
+        end
+        # Implements composition of Unicode code points following the guidelines here:
+        # http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf - Section 3.12
+        # Combining code points are combined with their base characters.  For example, "ñ"
+        # can be decomposed into 006E 0303, one code point for the "n" and the "˜" respectively.
+        # Composition reverses this process, turning 006E 0303 into a single 00F1 code point.
+        #
+        def compose_normal(code_points)
+          index = 1
+          while index < code_points.size
+            code_point = code_points[index]
+            combining_class = combining_class_for(code_point)
+            starter_index = find_starter_index(index, code_points)
+            # is this character blocked from combining with the last starter?
+            if starter_index < index - 1
+              previous_combining_class = combining_class_for(code_points[index - 1])
+              blocked = (previous_combining_class == 0) || (previous_combining_class >= combining_class)
+            else
+              blocked = false
+            end
+            unless blocked
+              # do a reverse-lookup for the decomposed code points
+              decomp_data = TwitterCldr::Shared::CodePoint.for_decomposition([code_points[starter_index], code_point])
+              # check if two code points are canonically equivalent
+              if decomp_data && !decomp_data.excluded_from_composition?
+                # combine the characters
+                code_points[starter_index] = decomp_data.code_point
+                code_points.delete_at(index)
+                index -= 1
+              end
+            end
+            index += 1
+          end
+        end
+        def find_starter_index(start_pos, code_points)
+          start_pos.times do |i|
+            return start_pos - i - 1 if combining_class_for(code_points[start_pos - i - 1]) == 0
+          end
+        end
+      end
+    end
+  end
+end

data/lib/twitter_cldr/normalizers/nfkd.rb CHANGED Viewed

@@ -10,7 +10,11 @@ module TwitterCldr
   # latest version at the moment (for Unicode 6.1) is available at http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf.
   #
   module Normalizers
-    class NFKD
+    # Implements normalization of a Unicode string to Normalization Form KD (NFKD).
+    # This normalization form includes only compatibility decomposition.
+    #
+    class NFKD < Base
       class << self
@@ -27,16 +31,16 @@ module TwitterCldr
         protected
         def decomposition(code_points)
-          code_points.map{ |code_point| decompose_recursively(code_point) }.flatten
+          code_points.map { |code_point| decompose_recursively(code_point) }.flatten
         end
         # Recursively decomposes a given code point with the values in its Decomposition Mapping property.
         #
         def decompose_recursively(code_point)
-          unicode_data = TwitterCldr::Shared::UnicodeData.for_code_point(code_point)
+          unicode_data = TwitterCldr::Shared::CodePoint.for_hex(code_point)
           return code_point unless unicode_data
-          if unicode_data.name.include?('Hangul')
+          if unicode_data.hangul_type == :compositions
             decompose_hangul(code_point)
           else
             decompose_regular(code_point, decomposition_mapping(unicode_data))
@@ -139,7 +143,7 @@ module TwitterCldr
         end
         def combining_class_for(code_point)
-          TwitterCldr::Shared::UnicodeData.for_code_point(code_point).combining_class.to_i
+          TwitterCldr::Shared::CodePoint.for_hex(code_point).combining_class.to_i
         rescue NoMethodError
           0
         end
@@ -148,18 +152,6 @@ module TwitterCldr
       COMPATIBILITY_FORMATTING_TAG_REGEXP = /^<.*>$/
-      HANGUL_DECOMPOSITION_CONSTANTS = {
-          :SBase  => 0xAC00,
-          :LBase  => 0x1100,
-          :VBase  => 0x1161,
-          :TBase  => 0x11A7,
-          :LCount => 19,
-          :VCount => 21,
-          :TCount => 28,
-          :NCount => 588,  # VCount * TCount
-          :Scount => 11172 # LCount * NCount
-      }
     end
   end
 end

data/lib/twitter_cldr/shared.rb CHANGED Viewed

@@ -10,6 +10,6 @@ module TwitterCldr
     autoload :Languages,   'twitter_cldr/shared/languages'
     autoload :Numbers,     'twitter_cldr/shared/numbers'
     autoload :Resources,   'twitter_cldr/shared/resources'
-    autoload :UnicodeData, 'twitter_cldr/shared/unicode_data'
+    autoload :CodePoint,   'twitter_cldr/shared/code_point'
   end
 end

data/lib/twitter_cldr/shared/code_point.rb ADDED Viewed

@@ -0,0 +1,116 @@
+# encoding: UTF-8
+# Copyright 2012 Twitter, Inc
+# http://www.apache.org/licenses/LICENSE-2.0
+module TwitterCldr
+  module Shared
+    CODE_POINT_FIELDS = [
+      :code_point,
+      :name,
+      :category,
+      :combining_class,
+      :bidi_class,
+      :decomposition,
+      :digit_value,
+      :non_decimal_digit_value,
+      :numeric_value,
+      :bidi_mirrored,
+      :unicode1_name,
+      :iso_comment,
+      :simple_uppercase_map,
+      :simple_lowercase_map,
+      :simple_titlecase_map
+    ]
+    CodePoint = Struct.new(*CODE_POINT_FIELDS) do
+      DECOMPOSITION_DATA_INDEX = 5
+      def hangul_type
+        CodePoint.hangul_type(code_point)
+      end
+      def excluded_from_composition?
+        CodePoint.excluded_from_composition?(code_point)
+      end
+      class << self
+        def for_hex(code_point)
+          target = get_block(code_point.rjust(4, "0").upcase)
+          if target && target.first
+            block_data = TwitterCldr.get_resource(:unicode_data, target.first)
+            code_point_data = block_data.fetch(code_point.to_sym) { |code_point_sym| get_range_start(code_point_sym, block_data) }
+            CodePoint.new(*code_point_data) if code_point_data
+          else
+            nil
+          end
+        end
+        def for_decomposition(code_points)
+          @decomposition_map ||= TwitterCldr.get_resource(:unicode_data, :decomposition_map)
+          key = code_points.join(" ").to_sym
+          if @decomposition_map.include?(key)
+            for_hex(@decomposition_map[key])
+          else
+            nil
+          end
+        end
+        def hangul_type(code_point)
+          if code_point
+            code_point_int = code_point.hex
+            [:lparts, :vparts, :tparts, :compositions, :decompositions].each do |type|
+              hangul_blocks[type].each do |range|
+                return type if range.include?(code_point_int)
+              end
+            end
+          end
+          nil
+        end
+        def excluded_from_composition?(code_point)
+          code_point_int = code_point.hex
+          composition_exclusions.any? { |excl| excl.include?(code_point_int) }
+        end
+        protected
+        def hangul_blocks
+          @hangul_blocks ||= TwitterCldr.get_resource(:unicode_data, :blocks_hangul)
+        end
+        def composition_exclusions
+          @composition_exclusions ||= TwitterCldr.get_resource(:unicode_data, :composition_exclusions)
+        end
+        def get_block(code_point)
+          blocks = TwitterCldr.get_resource(:unicode_data, :blocks)
+          code_point_int = code_point.hex
+          # Find the target block
+          blocks.find do |block_name, range|
+            range.include?(code_point_int)
+          end
+        end
+        # Check if block constitutes a range. The code point beginning a range will have a name enclosed in <>, ending with 'First'
+        # eg: <CJK Ideograph Extension A, First>
+        # http://unicode.org/reports/tr44/#Code_Point_Ranges
+        def get_range_start(code_point, block_data)
+          start_code_point = block_data.keys.sort_by { |key| key.to_s.hex }.first
+          start_data = block_data[start_code_point].clone
+          if start_data[1] =~ /<.*, First>/
+            start_data[0] = code_point.to_s
+            start_data[1] = start_data[1].sub(', First', '')
+            start_data
+          end
+        end
+      end
+    end
+  end
+end