RubyGems - alphabets - Versions diffs - 0.1.0 → 0.1.1 - Mend

alphabets 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: b5d826c435c38e5c8faf7963d7de6e6dcf0e2fb7
-  data.tar.gz: 5187392b8e6fbb12e249709526edf1bb2def2513
+  metadata.gz: 7310d705b53b7f04b8a588b831d71940728aa333
+  data.tar.gz: 3306b5393e10208c4f4f97d523b3a7366042d669
 SHA512:
-  metadata.gz: 0e8aac5a5a65d137c710a9623d444d9f996171caee9869cb32af78cd09b26ac0404fa88a40499279527ba6d9276c029396a34241f507adf289551e9327a6ce05
-  data.tar.gz: 7cb3dda8f5804fc39f67c866c319b24af8ee6119609052a7eb7ff6c0927f1ede8b50cbcf89d98709d83ba72bc900d56efafb4ce038d4f80554c916f057b6b5ec
+  metadata.gz: c3ba94979d0141b763f9370a520e60124bcadb58d1358f87054f7c10973cb35eecbbc3f36a8fdf75589a4b38911124bc1b5dcad05a638e6fb4b3ae83bc4f1edd
+  data.tar.gz: aac6b538a571553b4d6aa8af63d1bbae722863feb9fdaa22a2b7a1ce1caa06798d0e38ce8e8be3d6ecaf98441d433d6ad0f0a410af2056e210e9edf943a910fa

data/{HISTORY.md → CHANGELOG.md} RENAMED

File without changes

data/Manifest.txt CHANGED

@@ -1,4 +1,4 @@
-HISTORY.md
+CHANGELOG.md
 Manifest.txt
 NOTES.md
 README.md

data/NOTES.md CHANGED

@@ -14,8 +14,25 @@ Use Upcase, Downcase AND Titlecase (!)
 ## Libraries
+**Ruby**
 - <https://github.com/SixArm/sixarm_ruby_unaccent> - Replace a string's accent characters with ASCII characters. Based on Perl Text::Unaccent from CPAN.
+- <https://github.com/fractalsoft/diacritics> - support downcase, upcase and permanent link with diacritical characters
+**Perl**
+- <https://metacpan.org/pod/Unicode::Diacritic::Strip> - strip diacritics from Unicode text
+**JavaScript**
+- <https://github.com/dundalek/latinize> -  convert accents (diacritics) from strings to latin characters
+- <https://github.com/tyxla/remove-accents> - removes the accents from a string, converting them to their corresponding non-accented ascii characters
+**PostgreSQL**
+- <https://www.postgresql.org/docs/current/unaccent.html> - unaccent is a text search dictionary that removes accents (diacritic signs) from lexemes
 ## Links
@@ -35,9 +52,142 @@ Proper Unicoding - Ruby's Regexp engine has a powerful feature built in: It can
 Regex with Class - Ruby's regex engine defines a lot of shortcut character classes. Besides the common meta characters (\w, etc.), there is also the POSIX style expressions and the unicode property syntax. This is an overview of all character classes
+**Unicode**
+- <https://unicode.org/reports/tr15/> - Unicode Standard Annex #15 - UNICODE NORMALIZATION FORMS
 **W3C**
 - <https://www.w3.org/TR/charmod-norm/>
 - <https://www.w3.org/International/wiki/Case_folding>
 In Western European languages, the letter 'i' (U+0069) upper cases to a dotless 'I' (U+0049). In Turkish, this letter upper cases to a dotted upper case letter 'İ' (U+0130). Similarly, 'I' (U+0049) lower cases to 'ı' (U+0131), which is a dotless lowercase letter i.
+**Wikipedia**
+- <https://en.wikipedia.org/wiki/Diacritic>
+**More**
+- [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/)
+by Joel Spolsky, 2003
+- [Unicode Normalization in Ruby](https://www.honeybadger.io/blog/ruby-unicode-normalization/) by Starr Horne, 2017
+## Mappings
+Open questions ...
+```
+ Þ  =>  TH    ???
+ þ  =>  th    ???
+```
+## Alphabets
+Add more alphabets... why? why not?
+- Portuguese [Â, "abcdefghijklmnopqrstuvwxyzáâãàçéêíóôõú", "ABCDEFGHIJKLMNOPQRSTUVWXYZÁÂÃÀÇÉÊÍÓÔÕÚ"]
+- Russian [Щ, Ъ, Э, "абвгдеёжзийклмнопрстуфхцчшщъыьэюя", "АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ"]
+- Greek [Β, Μ, Χ, Ω, Ή, Ύ, Ώ, ΐ, ΰ, Ϊ, Ϋ]
+- Slovak ["aáäeéiíoóôuúyýbcčdďfghjklĺľmnňpqrŕsštťvwxzž", "AÁÄEÉIÍOÓÔUÚYÝBCČDĎFGHJKLĹĽMNŇPQRŔSŠTŤVWXZŽ"]
+- Italian ["aàbcdeèéfghiìíîlmnoòópqrstuùúvz", "AÀBCDEÈÉFGHIÌÍÎLMNOÒÓPQRSTUÙÚVZ"]
+- Romanian ["aăâbcdefghiîjklmnopqrsștțuvwxyz", "AĂÂBCDEFGHIÎJKLMNOPQRSȘTȚUVWXYZ"]
+- Danish [å, â, ô, Å, Â, Ô]
+```
+    def de
+      { # German
+        downcase:  %w(ä ö ü ß),
+        upcase:    %w(Ä Ö Ü ẞ),
+        permanent: %w(ae oe ue ss)
+      }
+    end
+    def pl
+      { # Polish
+        downcase:  %w(ą ć ę ł ń ó ś ż ź),
+        upcase:    %w(Ą Ć Ę Ł Ń Ó Ś Ż Ź),
+        permanent: %w(a c e l n o s z z)
+      }
+    end
+    def cs
+      { # Czech uses acute (á é í ó ú ý), caron (č ď ě ň ř š ť ž), ring (ů)
+        # aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž
+        # AÁBCČDĎEÉĚFGHIÍJKLMNŇOÓPQRŘSŠTŤUÚŮVWXYÝZŽ
+        downcase:  %w(á é í ó ú ý č ď ě ň ř š ť ů ž),
+        upcase:    %w(Á É Í Ó Ú Ý Č Ď Ě Ň Ř Š Ť Ů Ž),
+        permanent: %w(a e i o u y c d e n r s t u z)
+      }
+    end
+    def fr
+      { # French
+        # abcdefghijklmnopqrstuvwxyzàâæçéèêëîïôœùûüÿ
+        # ABCDEFGHIJKLMNOPQRSTUVWXYZÀÂÆÇÉÈÊËÎÏÔŒÙÛÜŸ
+        downcase:  %w(à â é è ë ê ï î ô ù û ü ÿ ç œ æ),
+        upcase:    %w(À Â É È Ë Ê Ï Î Ô Ù Û Ü Ÿ Ç Œ Æ),
+        permanent: %w(a a e e e e i i o u u ue y c oe ae)
+      }
+    end
+    def it
+      { # Italian
+        downcase:  %w(à è é ì î ò ó ù),
+        upcase:    %w(À È É Ì Î Ò Ó Ù),
+        permanent: %w(a e e i i o o u)
+      }
+    end
+    def eo
+      { # Esperantohas the symbols ŭ, ĉ, ĝ, ĥ, ĵ and ŝ
+        downcase:  %w(ĉ ĝ ĥ ĵ ŝ ŭ),
+        upcase:    %w(Ĉ Ĝ Ĥ Ĵ Ŝ Ŭ),
+        permanent: %w(c g h j s u)
+      }
+    end
+    def is
+      { # Iceland
+        downcase:  %w(ð þ),
+        upcase:    %w(Ð Þ),
+        permanent: %w(d p)
+      }
+    end
+    def pt
+      { # Portugal uses á, â, ã, à, ç, é, ê, í, ó, ô, õ and ú
+        downcase:  %w(ã ç),
+        upcase:    %w(Ã Ç),
+        permanent: %w(a c)
+      }
+    end
+    def sp
+      { # Spanish
+        downcase:  ['ñ', 'õ', '¿', '¡'],
+        upcase:    ['Ñ', 'Õ', '¿', '¡'],
+        permanent: ['n', 'o', '', '']
+      }
+    end
+    def hu
+      { # Hungarian
+        downcase:  %w(ő),
+        upcase:    %w(Ő),
+        permanent: %w(oe)
+      }
+    end
+    def nn
+      { # Norwegian
+        downcase:  %w(æ å),
+        upcase:    %w(Æ Å),
+        permanent: %w(ae a)
+      }
+    end
+```

data/Rakefile CHANGED

@@ -15,7 +15,7 @@ Hoe.spec 'alphabets' do
   # switch extension to .markdown for gihub formatting
   self.readme_file = 'README.md'
-  self.history_file = 'HISTORY.md'
+  self.history_file = 'CHANGELOG.md'
   self.licenses = ['Public Domain']

data/lib/alphabets/alphabets.rb CHANGED

@@ -12,9 +12,9 @@ UNACCENT = Reader.parse( <<TXT )
     Æ AE  æ ae   # ae ligature
           ā a
           ă a
-          ą a
+          ą a    # ą - U+0105 (261) - LATIN SMALL LETTER A WITH OGONEK
-    Ç C   ç c
+    Ç C   ç c    # ç - U+00E7 (231) - LATIN SMALL LETTER C WITH CEDILLA
           ć c
     Č C   č c
@@ -31,7 +31,7 @@ UNACCENT = Reader.parse( <<TXT )
     Í I   í i
           î i
           ī i
-          ı i    # small dotless i
+          ı i    # ı - U+0131 (305) - LATIN SMALL LETTER DOTLESS I
     Ł L   ł l
@@ -41,6 +41,7 @@ UNACCENT = Reader.parse( <<TXT )
     Ö O   ö o
           ó o
+          ò o
           õ o
           ô o
           ø o
@@ -50,15 +51,16 @@ UNACCENT = Reader.parse( <<TXT )
           ř r
     Ś S   ś s
-    Ş S   ş s
+    Ş S   ş s   # ş - U+015F (351) - LATIN SMALL LETTER S WITH CEDILLA
+    Ș S   ș s   # ș - U+0219 (537) - LATIN SMALL LETTER S WITH COMMA BELOW
     Š S   š s
-          ș s   # U+0219
-          ß ss
+          ß ss  # ß - U+00DF (223) - LATIN SMALL LETTER SHARP S
-          ţ t   # U+0163
-          ț t   # U+021B
+    Ţ t   ţ t   # ţ - U+0163 (355) - LATIN SMALL LETTER T WITH CEDILLA
+    Ț t   ț t   # ț - U+021B (539) - LATIN SMALL LETTER T WITH COMMA BELOW
-          þ p    #### fix/check!!!! icelandic - use p is p or th - why? why not?
+          þ p   # þ - U+00FE (254) - LATIN SMALL LETTER THORN
+                #### fix/check!!!! icelandic - use p is p or th - why? why not?
     Ü U   ü u
     Ú U   ú u
@@ -71,6 +73,14 @@ UNACCENT = Reader.parse( <<TXT )
     Ž Z   ž z
 TXT
+##
+# Notes:
+#  Romanian did NOT initially get its Ș/ș and Ț/ț (with comma) letters,
+#  because these letters were initially unified with Ş/ş and Ţ/ţ (with cedilla)
+#  by the Unicode Consortium, considering the shapes with comma beneath
+#  to be glyph variants of the shapes with cedilla.
+#  However, the letters with explicit comma below were later added to the Unicode standard and are also in ISO 8859-16.
 ##  de,at,ch translation for umlauts
 UNACCENT_DE = Reader.parse( <<TXT )
@@ -90,9 +100,9 @@ DOWNCASE = %w[A B C D E F G H I J K L M N O P Q R S T U V W X Y Z].reduce({}) do
     Ä ä
     Á á
     Å å
-    Æ æ   # ae ligature
+    Æ æ   # LATIN LETTER AE  - ae ligature
-    Ç ç
+    Ç ç   # LATIN LETTER C WITH CEDILLA
     Č č
     É é
@@ -103,12 +113,16 @@ DOWNCASE = %w[A B C D E F G H I J K L M N O P Q R S T U V W X Y Z].reduce({}) do
     Ł ł
     Ö ö
-    Œ œ   # oe ligature
+    Œ œ   # LATIN LIGATURE OE
     Ś ś
-    Ş ş
+    Ş ş   # LATIN LETTER S WITH CEDILLA
+    Ș ș   # LATIN LETTER S WITH COMMA BELOW
     Š š
+    Ţ ţ   # LATIN LETTER T WITH CEDILLA
+    Ț ț   # LATIN LETTER T WITH COMMA BELOW
     Ü ü
     Ú ú

data/lib/alphabets/reader.rb CHANGED

@@ -1,62 +1,62 @@
-class Alphabet
-class Reader   ## todo/check: rename to CharReader or something - why? why not?
-  def self.read( path )   ## use - rename to read_file or from_file etc. - why? why not?
-    txt = File.open( path, 'r:utf-8' ).read
-    parse( txt )
-  end
-  def self.parse( txt )
-    h = {}  ## char(acter) table mappings
-    txt.each_line do |line|
-      line = line.strip
-      next if line.empty?
-      next if line.start_with?( '#' )   ## skip comments too
-      ## strip inline (until end-of-line) comments too
-      ##  e.g  ţ  t  ## U+0163
-      ##   =>  ţ  t
-      line = line.sub( /#.*/, '' ).strip
-      ## pp line
-      values = line.split( /[ \t]+/ )
-      ## pp values
-      ## check - must be a even - a multiple of two
-      if values.size % 2 != 0
-        puts "** !!! ERROR !!! - missing mapping pair - mappings must be even (a multiple of two):"
-        pp values
-        exit 1
-      end
-      # add mappings in pairs
-      values.each_slice(2) do |slice|
-        ## pp slice
-        key   = slice[0]
-        value = slice[1]
-        ## check - key must be a single-character/letter in unicode
-        if key.size != 1
-          puts "** !!! ERROR !!! - mapping character must be a single-character, size is #{key.size}"
-          pp slice
-          exit 1
-        end
-        ## check - check for duplicates
-        if h[ key ]
-          puts "** !!! ERROR !!! - duplicate mapping character; key already present"
-          pp slice
-          exit 1
-        else
-          h[ key ] = value
-        end
-      end
-    end
-    h
-  end # method parse
-end # class Reader
-end # class Alphabet
+class Alphabet
+class Reader   ## todo/check: rename to CharReader or something - why? why not?
+  def self.read( path )   ## use - rename to read_file or from_file etc. - why? why not?
+    txt = File.open( path, 'r:utf-8' ).read
+    parse( txt )
+  end
+  def self.parse( txt )
+    h = {}  ## char(acter) table mappings
+    txt.each_line do |line|
+      line = line.strip
+      next if line.empty?
+      next if line.start_with?( '#' )   ## skip comments too
+      ## strip inline (until end-of-line) comments too
+      ##  e.g  ţ  t  ## U+0163
+      ##   =>  ţ  t
+      line = line.sub( /#.*/, '' ).strip
+      ## pp line
+      values = line.split( /[ \t]+/ )
+      ## pp values
+      ## check - must be a even - a multiple of two
+      if values.size % 2 != 0
+        puts "** !!! ERROR !!! - missing mapping pair - mappings must be even (a multiple of two):"
+        pp values
+        exit 1
+      end
+      # add mappings in pairs
+      values.each_slice(2) do |slice|
+        ## pp slice
+        key   = slice[0]
+        value = slice[1]
+        ## check - key must be a single-character/letter in unicode
+        if key.size != 1
+          puts "** !!! ERROR !!! - mapping character must be a single-character, size is #{key.size}"
+          pp slice
+          exit 1
+        end
+        ## check - check for duplicates
+        if h[ key ]
+          puts "** !!! ERROR !!! - duplicate mapping character; key already present"
+          pp slice
+          exit 1
+        else
+          h[ key ] = value
+        end
+      end
+    end
+    h
+  end # method parse
+end # class Reader
+end # class Alphabet

data/lib/alphabets/utils.rb CHANGED

@@ -1,75 +1,75 @@
-class Alphabet
-  def self.frequency_table( name )   ## todo/check: use/rename to char_frequency_table
-    ## calculate the frequency table of letters, digits, etc.
-    freq = Hash.new(0)
-    name.each_char do |ch|
-       freq[ch] += 1
-    end
-    freq
-  end
-  def self.count( freq, mapping_or_chars )
-    chars = if mapping_or_chars.is_a?( Hash )
-              mapping_or_chars.keys
-            else   ## todo/fix: check for is_a? Array and if is String split into Array (on char at a time?) - why? why not?
-              mapping_or_chars  ## assume it's an array/list of characters
-            end
-    chars.reduce(0) do |count,ch|
-      count += freq[ch]
-      count
-    end
-  end
-  def self.sub( name, mapping )   ## todo/check: use a different/better name - gsub/map/replace/fold/... - why? why not?
-    buf = String.new
-    name.each_char do |ch|
-      buf << if mapping[ch]
-                mapping[ch]
-              else
-                ch
-              end
-    end
-    buf
-  end
-  class Unaccenter #Worker    ## todo/change - find a better name - why? why not?
-    def initialize( mapping )
-      @mapping = mapping
-    end
-    def count( freq )      Alphabet.count( freq, @mapping ); end
-    def unaccent( name )   Alphabet.sub( name, @mapping );   end
-  end  # class Unaccent Worker
-  def self.find_unaccenter( key )
-    if key == :de
-      @de ||= Unaccenter.new( UNACCENT_DE )
-      @de
-    else
-      ## use uni(versal) or unicode or something - why? why not?
-      ##  use all or int'l (international) - why? why not?
-      ##  use en  (english) - why? why not?
-      @default ||= Unaccenter.new( UNACCENT )
-      @default
-    end
-  end
-  def self.unaccent( name )
-    @default ||= Unaccenter.new( UNACCENT )
-    @default.unaccent( name )
-  end
-  def self.downcase_i18n( name )    ## our very own downcase for int'l characters / letters
-    sub( name, DOWNCASE )
-  end
-  ## add downcase_uni  - univeral/unicode - why? why not?
-end  # class Alphabet
+class Alphabet
+  def self.frequency_table( name )   ## todo/check: use/rename to char_frequency_table
+    ## calculate the frequency table of letters, digits, etc.
+    freq = Hash.new(0)
+    name.each_char do |ch|
+       freq[ch] += 1
+    end
+    freq
+  end
+  def self.count( freq, mapping_or_chars )
+    chars = if mapping_or_chars.is_a?( Hash )
+              mapping_or_chars.keys
+            else   ## todo/fix: check for is_a? Array and if is String split into Array (on char at a time?) - why? why not?
+              mapping_or_chars  ## assume it's an array/list of characters
+            end
+    chars.reduce(0) do |count,ch|
+      count += freq[ch]
+      count
+    end
+  end
+  def self.sub( name, mapping )   ## todo/check: use a different/better name - gsub/map/replace/fold/... - why? why not?
+    buf = String.new
+    name.each_char do |ch|
+      buf << if mapping[ch]
+                mapping[ch]
+              else
+                ch
+              end
+    end
+    buf
+  end
+  class Unaccenter #Worker    ## todo/change - find a better name - why? why not?
+    def initialize( mapping )
+      @mapping = mapping
+    end
+    def count( freq )      Alphabet.count( freq, @mapping ); end
+    def unaccent( name )   Alphabet.sub( name, @mapping );   end
+  end  # class Unaccent Worker
+  def self.find_unaccenter( key )
+    if key == :de
+      @de ||= Unaccenter.new( UNACCENT_DE )
+      @de
+    else
+      ## use uni(versal) or unicode or something - why? why not?
+      ##  use all or int'l (international) - why? why not?
+      ##  use en  (english) - why? why not?
+      @default ||= Unaccenter.new( UNACCENT )
+      @default
+    end
+  end
+  def self.unaccent( name )
+    @default ||= Unaccenter.new( UNACCENT )
+    @default.unaccent( name )
+  end
+  def self.downcase_i18n( name )    ## our very own downcase for int'l characters / letters
+    sub( name, DOWNCASE )
+  end
+  ## add downcase_uni  - univeral/unicode - why? why not?
+end  # class Alphabet

data/lib/alphabets/version.rb CHANGED

@@ -6,7 +6,7 @@
 class Alphabet
   MAJOR = 0    ## todo: namespace inside version or something - why? why not??
   MINOR = 1
-  PATCH = 0
+  PATCH = 1
   VERSION = [MAJOR,MINOR,PATCH].join('.')
   def self.version

data/test/test_reader.rb CHANGED

@@ -1,37 +1,37 @@
-###
-#  to run use
-#     ruby -I ./lib -I ./test test/test_reader.rb
-require 'helper'
-class TestReader < MiniTest::Test
-  def test_parse
-    h = Alphabet::Reader.parse( <<TXT )
-      ## hello
-      Ä  A   ä  a   ## hello
-      Á  A   á  a
-             à  a
-             ã  a
-             â  a   ### yada yada
-      Å  A   å  a
-             æ   ae
-    Ç C ç c
-        ć c
-         ß ss
-TXT
-    pp h
-    assert_equal 'A',   h['Ä']
-    assert_equal 'a',   h['ä']
-    assert_equal 'ae',  h['æ']
-    assert_equal 'ss',  h['ß']
-  end
-end # class TestReader
+###
+#  to run use
+#     ruby -I ./lib -I ./test test/test_reader.rb
+require 'helper'
+class TestReader < MiniTest::Test
+  def test_parse
+    h = Alphabet::Reader.parse( <<TXT )
+      ## hello
+      Ä  A   ä  a   ## hello
+      Á  A   á  a
+             à  a
+             ã  a
+             â  a   ### yada yada
+      Å  A   å  a
+             æ   ae
+    Ç C ç c
+        ć c
+         ß ss
+TXT
+    pp h
+    assert_equal 'A',   h['Ä']
+    assert_equal 'a',   h['ä']
+    assert_equal 'ae',  h['æ']
+    assert_equal 'ss',  h['ß']
+  end
+end # class TestReader

metadata CHANGED

@@ -1,60 +1,54 @@
 --- !ruby/object:Gem::Specification
 name: alphabets
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.1.1
 platform: ruby
 authors:
 - Gerald Bauer
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2019-08-14 00:00:00.000000000 Z
+date: 2020-01-07 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: rdoc
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
         version: '4.0'
-    - - "<"
-      - !ruby/object:Gem::Version
-        version: '7'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
         version: '4.0'
-    - - "<"
-      - !ruby/object:Gem::Version
-        version: '7'
 - !ruby/object:Gem::Dependency
   name: hoe
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '3.18'
+        version: '3.16'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '3.18'
+        version: '3.16'
 description: 'alphabets - '
 email: opensport@googlegroups.com
 executables: []
 extensions: []
 extra_rdoc_files:
-- HISTORY.md
+- CHANGELOG.md
 - Manifest.txt
 - NOTES.md
 - README.md
 files:
-- HISTORY.md
+- CHANGELOG.md
 - Manifest.txt
 - NOTES.md
 - README.md