RubyGems - debilinguifier - Versions diffs - 0.1.0 → 1.0.0 - Mend

debilinguifier 0.1.0 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 4b1d6cc0a60c5fce3deb06f4aa276333dafb0856
-  data.tar.gz: 38b2458b5f352ec40a857c963d37408319325339
+  metadata.gz: f5b8ba510a26476e167291f6eaa489f4e7afe617
+  data.tar.gz: 0c979ecf4e2fa145179d17b011597d1ef8c3995d
 SHA512:
-  metadata.gz: 42933e09d17df8926f2bd567a94758f735b08574922eaf5bd2228af9962fb40ddfaf86126283af23d79019c534568a86c0c6fbc2860749b53afb9ae873078254
-  data.tar.gz: 2d7512d15a5770cc59dab2e354347bf2388ecc07ae5956c4798705f29ca1c656e76f1f15be4290d6691866b71bf4c8ddc0eeea3419236533f32b2f9272f37f21
+  metadata.gz: 7ceff85e93f06d3f4281aad75abe644bcdad56f286c334652d9bbb8b4ab560f76c24ddca210bbde4a7c012ea795d928a5d8fb42958ddbc7ae8ccbbab25a5e53a
+  data.tar.gz: 06c304360041ea89407fbba9c5f43556484c31ffefdb83020a6301f5604c42e328e4c7a14102d15610841b7ff80c57c20225bc8a016a05b957f33f30d54815d1

data/LICENSE.txt CHANGED

@@ -18,3 +18,6 @@ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
 LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
 OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
 WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+IF YOU ARE READING TO THIS LINE YOU ARE AS CRAZY AS I AM. USE THE
+CODE IN HERE AS YOU PLEASE. IT IS NOT LIKE I REINVENTED THE WHEEL.

data/README.rdoc CHANGED

@@ -1,25 +1,53 @@
 = debilinguifier
+<em>Keep in mind that this</em> <b>only</b> <em>works with uppercase characters. For clarity and better readbility, due to the nature of what we are trying to accomplish, most of the examples are in lowercase. Always consider the uppercase equivalent of any such string (in ruby terms: </em> <tt>string.upcase</tt> <em>)</em>
+=== Explanation (The short version):
 This is a module to help me sanitize an already populated db. The db contains company names and product descriptions in capital letters. Users populating the db had been careless enough to allow both [greek, latin] characters to be used in a single word or phrase, making it difficult to alphabetically sort them or search for them.
 The purpose of this gem is to help me import the data into a new db (for a completely different app) in a more deterministic way.
+=== Explanation (The long version):
-* Ruby version 2.4.0 is the minimum required due to [String supports Unicode case mappings] https://www.ruby-lang.org/en/news/2016/12/25/ruby-2-4-0-released/
+debilinguifier (in greek: αποδιγλωσσοποιητής): A word that does not exist in neither english or greek and attempts to describe the following behaviour:
-* To use this gem, just add it (require 'debilinguifier') and call:  DeBiLinguifier.dbl(your_string_to_de-bi-linguified)
+Latin and greek capital letters have confused the users of an existing system, by their identical looks. The users have -over the years- manually populated a database with entries, which are lacking consistency due to those looks. By convention, only capital letters are used in the system, and as a result, one can find entries as the following (please consider the following in uppercase!): αlpha, vιτα, γama and so on. In every one of these cases we have come across, the solution is relatively simple:
+1. The phrase (or word) is already written in greek-only or latin-only characters. In this case return it as-is.
+2. The phrase (or word) is written using characters from both charsets, but can be written using only one (or even both) charsets. In this case, replace the similar looking characters and return the result (for example if it contains characters like 'Φ' XOR 'C', which have no equivalent in the other charset but the rest of the characters are common looking in both charsets, return it in the 'correct' charset. If it can be returned in both charsets (eg. αυto), return it in greek charset (in this example αυτο). <em>This is actually cases 2 and 3 in our code.</em> And here comes the tricky part (there was no such example in our case, but "what if...?"!):
+3. The word (or phrase) contains characters from both charsets, that have no equivalent in the other charset (eg. contains both 'Ψ' and 'C'). If it a single phrase like 'cv φωta' there is no real problem: you simply split the phrase, apply the above rules to each word and end up with 'cv φωτα'. And everybody is happy.
+4. But what if a word is like 'c3ψima'? What kind of query would end up finding this entry in the db? (By the way, the whole idea is that queries will be executed in an AJAX way). The solution was to make our dbl(abbr. for de-bi-linguifier) return it using a bias, which by default is 'greek': it will return 'c3ψιμα'. (You can also choose to use a 'latin' bias, or set it to anything other than that to return the initial word as-is). <em> 3 and 4 correspond to case 4 in our code.</em>
-* Please note that this only works with capital and detoned characters.
+<em>As you may have figured out, the 4th case above attempts to solve a problem that currently does not exist and although it (probably) does not fail miserably, it is outside the scope of our specifications. As such, it only attempts to provide a possible solution for a problem that will probably never appear: it is nearly impossible to find a reason to write a word using simultaneously characters from both charsets on purpose. In any case, if such a problem comes up and the implemented solution is not satisfying, we will revise the code. </em>
+The above 4 cases do not correspond to the 4 cases in the module
+What we are accomplishing: we can now search in our new db for something that looked like 'ATIMA', but was written as 'aτιμa'.upcase. To be able to succefully query the db we have to check a couple of things before each query:
+* If the phrase can be written in both greek and latin charsets (in other words, can_write_only_greek?(input) AND can_write_only_latin?(input) returns true); then the query must be a union of the results of two subqueries: one with the result of return_in_greek(input) and one with the result of return_in_latin(input).
+* If the phrase can be written in only one of the two charsets (in other words, can_write_only_greek?(input) XOR can_write_only_latin?(input) returns true); just run through dbl the phrase (just in case the user is using mixed charset) and you are good to go with your query.
+* If the phrase cannot be written with a single charset (in other words can_write_only_greek?(input) OR can_write_only_latin?(input) returns false): Run the phrase through dbl with the same bias as the one used when importing the data!
+<em>(Note to myself: I will implement the above functionality in a method later on, when it will be needed. For the time being, I have only documented how this will be done)</em>
+---
+=== Notes:
+* Ruby version 2.4.0 is the minimum required due to: {String supports Unicode case mappings}[https://bugs.ruby-lang.org/issues/10085] found in {Ruby 2.4.0 release announcement}[https://www.ruby-lang.org/en/news/2016/12/25/ruby-2-4-0-released/]
+* To use this gem, just add it (require 'debilinguifier') and call:  DeBiLinguifier.dbl(your_string_to_de-bi-linguified)
+* Please note that this only works with capital and detoned characters.
+* This is my very first gem and I am very proud of it!
+* I used {Juwelier}[https://github.com/flajann2/juwelier] to create it.
-This is my very first gem and I used Juwelier https://github.com/flajann2/juwelier to create it.
+---
+To install it, just run: <tt>gem install debilinguifier</tt>
+To add it to your project: <tt>require 'debilinguifier'</tt>
+To use it: <tt>DeBiLinguifier.dbl(<em>your_string</em>)</tt> <em>It will return you the dbl'ed string</em>
-== Contributing to debilinguifier
+=== Contributing to debilinguifier
 * Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet.
 * Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it.

data/Rakefile CHANGED

@@ -11,6 +11,8 @@ rescue Bundler::BundlerError => e
 end
 require 'rake'
 require 'juwelier'
 Juwelier::Tasks.new do |gem|
   # gem is a Gem::Specification... see http://guides.rubygems.org/specification-reference/ for more options
@@ -18,9 +20,11 @@ Juwelier::Tasks.new do |gem|
   gem.homepage = "http://github.com/apapamichalis/debilinguifier"
   gem.license = "MIT"
   gem.summary = %Q{A [greek, latin] debilinguifier}
-  gem.description = %Q{The purpose of this gem is to return a phrase written using two charsets due to user's mistake. The reason behind this is that we have a db we want to migrate populated with such entries and we want to somehow sanitize it. The db contains company and product names in capital letters (e.g. the user might have written "komπολοι".upcase instead of "κομπολοι".upcase", resulting in a string that in capital letters seems to be the same, but in practice is not)}
+  gem.description = %Q{The purpose of this gem is to return a phrase written using two charsets [greek, latin] (uppercase characters only) due to user's mistake.}
   gem.email = "dimxer@hotmail.com"
   gem.authors = ["apapamichalis"]
+  # This gem will work with 1.8.6 or greater...
+  gem.required_ruby_version = '>= 2.4.0'
   # dependencies defined in Gemfile
 end

data/VERSION CHANGED

	@@ -1 +1 @@
1	- 0.1.0
1	+ 1.0.0

data/debilinguifier.gemspec CHANGED

@@ -2,17 +2,17 @@
 # DO NOT EDIT THIS FILE DIRECTLY
 # Instead, edit Juwelier::Tasks in Rakefile, and run 'rake gemspec'
 # -*- encoding: utf-8 -*-
-# stub: debilinguifier 0.1.0 ruby lib
+# stub: debilinguifier 1.0.0 ruby lib
 Gem::Specification.new do |s|
   s.name = "debilinguifier".freeze
-  s.version = "0.1.0"
+  s.version = "1.0.0"
   s.required_rubygems_version = Gem::Requirement.new(">= 0".freeze) if s.respond_to? :required_rubygems_version=
   s.require_paths = ["lib".freeze]
   s.authors = ["apapamichalis".freeze]
-  s.date = "2017-03-26"
-  s.description = "The purpose of this gem is to return a phrase written using two charsets due to user's mistake. The reason behind this is that we have a db we want to migrate populated with such entries and we want to somehow sanitize it. The db contains company and product names in capital letters (e.g. the user might have written \"kom\u03C0\u03BF\u03BB\u03BF\u03B9\".upcase instead of \"\u03BA\u03BF\u03BC\u03C0\u03BF\u03BB\u03BF\u03B9\".upcase\", resulting in a string that in capital letters seems to be the same, but in practice is not)".freeze
+  s.date = "2017-04-03"
+  s.description = "The purpose of this gem is to return a phrase written using two charsets [greek, latin] (uppercase characters only) due to user's mistake.".freeze
   s.email = "dimxer@hotmail.com".freeze
   s.extra_rdoc_files = [
     "LICENSE.txt",
@@ -35,6 +35,7 @@ Gem::Specification.new do |s|
   ]
   s.homepage = "http://github.com/apapamichalis/debilinguifier".freeze
   s.licenses = ["MIT".freeze]
+  s.required_ruby_version = Gem::Requirement.new(">= 2.4.0".freeze)
   s.rubygems_version = "2.6.8".freeze
   s.summary = "A [greek, latin] debilinguifier".freeze

data/lib/debilinguifier.rb CHANGED

@@ -15,6 +15,7 @@ module DeBiLinguifier
   GREEK_ALPHABET_PLUS_SYMBOLS = Regexp.new("(^[Α-Ω#{SYMBOLS}]+)+$").freeze
+  ##
   # Only works with latin and greek charsets.
   # An input phrase can only be one of five things:
   # 1)      Already only in greek or only in latin charset.
@@ -22,14 +23,19 @@ module DeBiLinguifier
   # 3)      Written in a mixed charset, but can be written with just the latin charset.
   # 4)      Written in a mixed charset, but cannot be written with only one of the [greek, latin] charsets.
   #           In this case we split the phrase into words and apply the above rules to each word seperately.
-  #           If case 4 applies to a single word, there is nothing more we can do for it than return it "as is".
-  # 5)      Written in a mixed charset, but can be written either with just the greek charset or just the latin charset.
+  #           If case 4 applies to a single word, then we have to return it greek-ified or latin-ified.
+  #           This way we will be able to produce SQL queries in a more deterministic way.
+  #           (Actually, when searching for a phrase that has been processed by our dbl before writting to the db,
+  #            we will also have to process through dbl the phrase we are looking for before quering the db).
+  #
+  # 5)      Written in a mixed charset, but can be written either with just the greek charset or just the latin charset
+  #         (greek bias is the default and only behavior in this case)
   #
   # Note:   We are deliberately ignoring case 5, as it is of no use at the moment as a separate case.
   # It is actually the initersection of cases 2 and 3. Using case 2 instead.
   # @params input [String] the string we want to de-bi-linguify (!)
   # @return       [String] the de-bi-linguized string
-  def dbl(input)
+  def dbl(input, bias='greek')
     if(is_greek_only?(input) || is_latin_only?(input)) # Case 1
       input
     elsif(can_write_only_greek?(input))                # Case 2
@@ -37,7 +43,7 @@ module DeBiLinguifier
     elsif(can_write_only_latin?(input))                # Case 3
       return_in_latin(input)
     else                                               # Case 4
-      return_in_mixed_charset(input)
+      return_in_mixed_charset(input, bias)
     end
   end
@@ -72,13 +78,21 @@ module DeBiLinguifier
   end
   # Return the phrase using both charsets
-  def return_in_mixed_charset(input)
+  def return_in_mixed_charset(input, bias)
     # Split the phrase in words and recursively try to return each word in the "correct" charset
-    # If that is not possible (e.g. a word contains both "Φ" and "C", return it as it was originally
+    # If that is not possible (e.g. a word contains both "Φ" and "C", the word must either be greek-ified (default)
+    # or latin-ified. The reason for this is that we will be able to do SQL queries, as long as the word - or phrase
+    # we are looking for has been passed through dbl.
     # We first split the input phrase, based on the SYMBOLS delimiters
     words_arr = input.split(/(?<=[#{SYMBOLS}])/)
-    if words_arr.length == 1            # If it was only one word, return it.
-      return (words_arr.join.to_s)
+    if words_arr.length == 1            # If it was only one word, return it, according to the bias.
+      if bias == 'greek'
+        return return_in_greek(words_arr.join.to_s)  # If the bias is 'greek', return the word 'greek-ified'
+      elsif bias == 'latin'
+        return return_in_latin(words_arr.join.to_s)  # Else if bias is 'latin' return it 'latinified'.
+      else
+        return (words_arr.join.to_s)                 # Else return it as-is (not advisable!)
+      end
     else                                # Else apply dbl to each word we got after splitting input
       words_arr2 =[]
       words_arr.each do |word|

data/spec/debilinguifier_spec.rb CHANGED

@@ -12,7 +12,25 @@ describe DeBiLinguifier do
         expect(DeBiLinguifier.dbl(input.upcase)).to eq(output.upcase)
       end
     end
+  end
+  context 'When using the bias' do
+    it 'should work with greek' do
+      expect(DeBiLinguifier.dbl('fψaι'.upcase, 'greek')).to eq('fψαι'.upcase)
+    end
+    it 'should work with latin' do
+      expect(DeBiLinguifier.dbl('fψaι'.upcase, 'latin')).to eq('fψai'.upcase)
+    end
+    it 'should work with nil' do
+      expect(DeBiLinguifier.dbl('fψaι'.upcase, nil)).to eq('fψaι'.upcase)
+    end
+    it 'should work with anything other than ["greek", "latin"] like it was nil' do
+      expect(DeBiLinguifier.dbl('fψaι'.upcase, 'asdf')).to eq('fψaι'.upcase)
+    end
   end
 end

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: debilinguifier
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 1.0.0
 platform: ruby
 authors:
 - apapamichalis
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2017-03-26 00:00:00.000000000 Z
+date: 2017-04-03 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: shoulda
@@ -95,11 +95,7 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '3.5'
 description: The purpose of this gem is to return a phrase written using two charsets
-  due to user's mistake. The reason behind this is that we have a db we want to migrate
-  populated with such entries and we want to somehow sanitize it. The db contains
-  company and product names in capital letters (e.g. the user might have written "komπολοι".upcase
-  instead of "κομπολοι".upcase", resulting in a string that in capital letters seems
-  to be the same, but in practice is not)
+  [greek, latin] (uppercase characters only) due to user's mistake.
 email: dimxer@hotmail.com
 executables: []
 extensions: []
@@ -132,7 +128,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
-      version: '0'
+      version: 2.4.0
 required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="