debilinguifier 0.1.0 → 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 4b1d6cc0a60c5fce3deb06f4aa276333dafb0856
4
- data.tar.gz: 38b2458b5f352ec40a857c963d37408319325339
3
+ metadata.gz: f5b8ba510a26476e167291f6eaa489f4e7afe617
4
+ data.tar.gz: 0c979ecf4e2fa145179d17b011597d1ef8c3995d
5
5
  SHA512:
6
- metadata.gz: 42933e09d17df8926f2bd567a94758f735b08574922eaf5bd2228af9962fb40ddfaf86126283af23d79019c534568a86c0c6fbc2860749b53afb9ae873078254
7
- data.tar.gz: 2d7512d15a5770cc59dab2e354347bf2388ecc07ae5956c4798705f29ca1c656e76f1f15be4290d6691866b71bf4c8ddc0eeea3419236533f32b2f9272f37f21
6
+ metadata.gz: 7ceff85e93f06d3f4281aad75abe644bcdad56f286c334652d9bbb8b4ab560f76c24ddca210bbde4a7c012ea795d928a5d8fb42958ddbc7ae8ccbbab25a5e53a
7
+ data.tar.gz: 06c304360041ea89407fbba9c5f43556484c31ffefdb83020a6301f5604c42e328e4c7a14102d15610841b7ff80c57c20225bc8a016a05b957f33f30d54815d1
@@ -18,3 +18,6 @@ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
18
  LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
19
  OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
20
  WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
21
+
22
+ IF YOU ARE READING TO THIS LINE YOU ARE AS CRAZY AS I AM. USE THE
23
+ CODE IN HERE AS YOU PLEASE. IT IS NOT LIKE I REINVENTED THE WHEEL.
@@ -1,25 +1,53 @@
1
1
  = debilinguifier
2
2
 
3
+ <em>Keep in mind that this</em> <b>only</b> <em>works with uppercase characters. For clarity and better readbility, due to the nature of what we are trying to accomplish, most of the examples are in lowercase. Always consider the uppercase equivalent of any such string (in ruby terms: </em> <tt>string.upcase</tt> <em>)</em>
4
+
5
+ === Explanation (The short version):
3
6
  This is a module to help me sanitize an already populated db. The db contains company names and product descriptions in capital letters. Users populating the db had been careless enough to allow both [greek, latin] characters to be used in a single word or phrase, making it difficult to alphabetically sort them or search for them.
4
7
 
5
8
  The purpose of this gem is to help me import the data into a new db (for a completely different app) in a more deterministic way.
6
9
 
10
+ === Explanation (The long version):
7
11
 
8
- * Ruby version 2.4.0 is the minimum required due to [String supports Unicode case mappings] https://www.ruby-lang.org/en/news/2016/12/25/ruby-2-4-0-released/
12
+ debilinguifier (in greek: αποδιγλωσσοποιητής): A word that does not exist in neither english or greek and attempts to describe the following behaviour:
9
13
 
10
- * To use this gem, just add it (require 'debilinguifier') and call: DeBiLinguifier.dbl(your_string_to_de-bi-linguified)
14
+ Latin and greek capital letters have confused the users of an existing system, by their identical looks. The users have -over the years- manually populated a database with entries, which are lacking consistency due to those looks. By convention, only capital letters are used in the system, and as a result, one can find entries as the following (please consider the following in uppercase!): αlpha, vιτα, γama and so on. In every one of these cases we have come across, the solution is relatively simple:
15
+ 1. The phrase (or word) is already written in greek-only or latin-only characters. In this case return it as-is.
16
+ 2. The phrase (or word) is written using characters from both charsets, but can be written using only one (or even both) charsets. In this case, replace the similar looking characters and return the result (for example if it contains characters like 'Φ' XOR 'C', which have no equivalent in the other charset but the rest of the characters are common looking in both charsets, return it in the 'correct' charset. If it can be returned in both charsets (eg. αυto), return it in greek charset (in this example αυτο). <em>This is actually cases 2 and 3 in our code.</em> And here comes the tricky part (there was no such example in our case, but "what if...?"!):
17
+ 3. The word (or phrase) contains characters from both charsets, that have no equivalent in the other charset (eg. contains both 'Ψ' and 'C'). If it a single phrase like 'cv φωta' there is no real problem: you simply split the phrase, apply the above rules to each word and end up with 'cv φωτα'. And everybody is happy.
18
+ 4. But what if a word is like 'c3ψima'? What kind of query would end up finding this entry in the db? (By the way, the whole idea is that queries will be executed in an AJAX way). The solution was to make our dbl(abbr. for de-bi-linguifier) return it using a bias, which by default is 'greek': it will return 'c3ψιμα'. (You can also choose to use a 'latin' bias, or set it to anything other than that to return the initial word as-is). <em> 3 and 4 correspond to case 4 in our code.</em>
11
19
 
12
- * Please note that this only works with capital and detoned characters.
20
+ <em>As you may have figured out, the 4th case above attempts to solve a problem that currently does not exist and although it (probably) does not fail miserably, it is outside the scope of our specifications. As such, it only attempts to provide a possible solution for a problem that will probably never appear: it is nearly impossible to find a reason to write a word using simultaneously characters from both charsets on purpose. In any case, if such a problem comes up and the implemented solution is not satisfying, we will revise the code. </em>
21
+
22
+ The above 4 cases do not correspond to the 4 cases in the module
23
+
24
+ What we are accomplishing: we can now search in our new db for something that looked like 'ATIMA', but was written as 'aτιμa'.upcase. To be able to succefully query the db we have to check a couple of things before each query:
25
+ * If the phrase can be written in both greek and latin charsets (in other words, can_write_only_greek?(input) AND can_write_only_latin?(input) returns true); then the query must be a union of the results of two subqueries: one with the result of return_in_greek(input) and one with the result of return_in_latin(input).
26
+ * If the phrase can be written in only one of the two charsets (in other words, can_write_only_greek?(input) XOR can_write_only_latin?(input) returns true); just run through dbl the phrase (just in case the user is using mixed charset) and you are good to go with your query.
27
+ * If the phrase cannot be written with a single charset (in other words can_write_only_greek?(input) OR can_write_only_latin?(input) returns false): Run the phrase through dbl with the same bias as the one used when importing the data!
13
28
 
29
+ <em>(Note to myself: I will implement the above functionality in a method later on, when it will be needed. For the time being, I have only documented how this will be done)</em>
14
30
 
31
+ ---
32
+ === Notes:
33
+
34
+ * Ruby version 2.4.0 is the minimum required due to: {String supports Unicode case mappings}[https://bugs.ruby-lang.org/issues/10085] found in {Ruby 2.4.0 release announcement}[https://www.ruby-lang.org/en/news/2016/12/25/ruby-2-4-0-released/]
35
+
36
+ * To use this gem, just add it (require 'debilinguifier') and call: DeBiLinguifier.dbl(your_string_to_de-bi-linguified)
37
+ * Please note that this only works with capital and detoned characters.
38
+ * This is my very first gem and I am very proud of it!
39
+ * I used {Juwelier}[https://github.com/flajann2/juwelier] to create it.
15
40
 
16
- This is my very first gem and I used Juwelier https://github.com/flajann2/juwelier to create it.
41
+ ---
17
42
 
43
+ To install it, just run: <tt>gem install debilinguifier</tt>
18
44
 
45
+ To add it to your project: <tt>require 'debilinguifier'</tt>
19
46
 
47
+ To use it: <tt>DeBiLinguifier.dbl(<em>your_string</em>)</tt> <em>It will return you the dbl'ed string</em>
20
48
 
21
49
 
22
- == Contributing to debilinguifier
50
+ === Contributing to debilinguifier
23
51
 
24
52
  * Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet.
25
53
  * Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it.
data/Rakefile CHANGED
@@ -11,6 +11,8 @@ rescue Bundler::BundlerError => e
11
11
  end
12
12
  require 'rake'
13
13
 
14
+
15
+
14
16
  require 'juwelier'
15
17
  Juwelier::Tasks.new do |gem|
16
18
  # gem is a Gem::Specification... see http://guides.rubygems.org/specification-reference/ for more options
@@ -18,9 +20,11 @@ Juwelier::Tasks.new do |gem|
18
20
  gem.homepage = "http://github.com/apapamichalis/debilinguifier"
19
21
  gem.license = "MIT"
20
22
  gem.summary = %Q{A [greek, latin] debilinguifier}
21
- gem.description = %Q{The purpose of this gem is to return a phrase written using two charsets due to user's mistake. The reason behind this is that we have a db we want to migrate populated with such entries and we want to somehow sanitize it. The db contains company and product names in capital letters (e.g. the user might have written "komπολοι".upcase instead of "κομπολοι".upcase", resulting in a string that in capital letters seems to be the same, but in practice is not)}
23
+ gem.description = %Q{The purpose of this gem is to return a phrase written using two charsets [greek, latin] (uppercase characters only) due to user's mistake.}
22
24
  gem.email = "dimxer@hotmail.com"
23
25
  gem.authors = ["apapamichalis"]
26
+ # This gem will work with 1.8.6 or greater...
27
+ gem.required_ruby_version = '>= 2.4.0'
24
28
 
25
29
  # dependencies defined in Gemfile
26
30
  end
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.1.0
1
+ 1.0.0
@@ -2,17 +2,17 @@
2
2
  # DO NOT EDIT THIS FILE DIRECTLY
3
3
  # Instead, edit Juwelier::Tasks in Rakefile, and run 'rake gemspec'
4
4
  # -*- encoding: utf-8 -*-
5
- # stub: debilinguifier 0.1.0 ruby lib
5
+ # stub: debilinguifier 1.0.0 ruby lib
6
6
 
7
7
  Gem::Specification.new do |s|
8
8
  s.name = "debilinguifier".freeze
9
- s.version = "0.1.0"
9
+ s.version = "1.0.0"
10
10
 
11
11
  s.required_rubygems_version = Gem::Requirement.new(">= 0".freeze) if s.respond_to? :required_rubygems_version=
12
12
  s.require_paths = ["lib".freeze]
13
13
  s.authors = ["apapamichalis".freeze]
14
- s.date = "2017-03-26"
15
- s.description = "The purpose of this gem is to return a phrase written using two charsets due to user's mistake. The reason behind this is that we have a db we want to migrate populated with such entries and we want to somehow sanitize it. The db contains company and product names in capital letters (e.g. the user might have written \"kom\u03C0\u03BF\u03BB\u03BF\u03B9\".upcase instead of \"\u03BA\u03BF\u03BC\u03C0\u03BF\u03BB\u03BF\u03B9\".upcase\", resulting in a string that in capital letters seems to be the same, but in practice is not)".freeze
14
+ s.date = "2017-04-03"
15
+ s.description = "The purpose of this gem is to return a phrase written using two charsets [greek, latin] (uppercase characters only) due to user's mistake.".freeze
16
16
  s.email = "dimxer@hotmail.com".freeze
17
17
  s.extra_rdoc_files = [
18
18
  "LICENSE.txt",
@@ -35,6 +35,7 @@ Gem::Specification.new do |s|
35
35
  ]
36
36
  s.homepage = "http://github.com/apapamichalis/debilinguifier".freeze
37
37
  s.licenses = ["MIT".freeze]
38
+ s.required_ruby_version = Gem::Requirement.new(">= 2.4.0".freeze)
38
39
  s.rubygems_version = "2.6.8".freeze
39
40
  s.summary = "A [greek, latin] debilinguifier".freeze
40
41
 
@@ -15,6 +15,7 @@ module DeBiLinguifier
15
15
  GREEK_ALPHABET_PLUS_SYMBOLS = Regexp.new("(^[Α-Ω#{SYMBOLS}]+)+$").freeze
16
16
 
17
17
 
18
+ ##
18
19
  # Only works with latin and greek charsets.
19
20
  # An input phrase can only be one of five things:
20
21
  # 1) Already only in greek or only in latin charset.
@@ -22,14 +23,19 @@ module DeBiLinguifier
22
23
  # 3) Written in a mixed charset, but can be written with just the latin charset.
23
24
  # 4) Written in a mixed charset, but cannot be written with only one of the [greek, latin] charsets.
24
25
  # In this case we split the phrase into words and apply the above rules to each word seperately.
25
- # If case 4 applies to a single word, there is nothing more we can do for it than return it "as is".
26
- # 5) Written in a mixed charset, but can be written either with just the greek charset or just the latin charset.
26
+ # If case 4 applies to a single word, then we have to return it greek-ified or latin-ified.
27
+ # This way we will be able to produce SQL queries in a more deterministic way.
28
+ # (Actually, when searching for a phrase that has been processed by our dbl before writting to the db,
29
+ # we will also have to process through dbl the phrase we are looking for before quering the db).
30
+ #
31
+ # 5) Written in a mixed charset, but can be written either with just the greek charset or just the latin charset
32
+ # (greek bias is the default and only behavior in this case)
27
33
  #
28
34
  # Note: We are deliberately ignoring case 5, as it is of no use at the moment as a separate case.
29
35
  # It is actually the initersection of cases 2 and 3. Using case 2 instead.
30
36
  # @params input [String] the string we want to de-bi-linguify (!)
31
37
  # @return [String] the de-bi-linguized string
32
- def dbl(input)
38
+ def dbl(input, bias='greek')
33
39
  if(is_greek_only?(input) || is_latin_only?(input)) # Case 1
34
40
  input
35
41
  elsif(can_write_only_greek?(input)) # Case 2
@@ -37,7 +43,7 @@ module DeBiLinguifier
37
43
  elsif(can_write_only_latin?(input)) # Case 3
38
44
  return_in_latin(input)
39
45
  else # Case 4
40
- return_in_mixed_charset(input)
46
+ return_in_mixed_charset(input, bias)
41
47
  end
42
48
  end
43
49
 
@@ -72,13 +78,21 @@ module DeBiLinguifier
72
78
  end
73
79
 
74
80
  # Return the phrase using both charsets
75
- def return_in_mixed_charset(input)
81
+ def return_in_mixed_charset(input, bias)
76
82
  # Split the phrase in words and recursively try to return each word in the "correct" charset
77
- # If that is not possible (e.g. a word contains both "Φ" and "C", return it as it was originally
83
+ # If that is not possible (e.g. a word contains both "Φ" and "C", the word must either be greek-ified (default)
84
+ # or latin-ified. The reason for this is that we will be able to do SQL queries, as long as the word - or phrase
85
+ # we are looking for has been passed through dbl.
78
86
  # We first split the input phrase, based on the SYMBOLS delimiters
79
87
  words_arr = input.split(/(?<=[#{SYMBOLS}])/)
80
- if words_arr.length == 1 # If it was only one word, return it.
81
- return (words_arr.join.to_s)
88
+ if words_arr.length == 1 # If it was only one word, return it, according to the bias.
89
+ if bias == 'greek'
90
+ return return_in_greek(words_arr.join.to_s) # If the bias is 'greek', return the word 'greek-ified'
91
+ elsif bias == 'latin'
92
+ return return_in_latin(words_arr.join.to_s) # Else if bias is 'latin' return it 'latinified'.
93
+ else
94
+ return (words_arr.join.to_s) # Else return it as-is (not advisable!)
95
+ end
82
96
  else # Else apply dbl to each word we got after splitting input
83
97
  words_arr2 =[]
84
98
  words_arr.each do |word|
@@ -12,7 +12,25 @@ describe DeBiLinguifier do
12
12
  expect(DeBiLinguifier.dbl(input.upcase)).to eq(output.upcase)
13
13
  end
14
14
  end
15
+ end
16
+
17
+ context 'When using the bias' do
18
+ it 'should work with greek' do
19
+ expect(DeBiLinguifier.dbl('fψaι'.upcase, 'greek')).to eq('fψαι'.upcase)
20
+ end
21
+
22
+ it 'should work with latin' do
23
+ expect(DeBiLinguifier.dbl('fψaι'.upcase, 'latin')).to eq('fψai'.upcase)
24
+ end
15
25
 
26
+ it 'should work with nil' do
27
+ expect(DeBiLinguifier.dbl('fψaι'.upcase, nil)).to eq('fψaι'.upcase)
28
+ end
29
+
30
+ it 'should work with anything other than ["greek", "latin"] like it was nil' do
31
+ expect(DeBiLinguifier.dbl('fψaι'.upcase, 'asdf')).to eq('fψaι'.upcase)
32
+ end
16
33
  end
34
+
17
35
  end
18
36
 
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: debilinguifier
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 1.0.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - apapamichalis
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2017-03-26 00:00:00.000000000 Z
11
+ date: 2017-04-03 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: shoulda
@@ -95,11 +95,7 @@ dependencies:
95
95
  - !ruby/object:Gem::Version
96
96
  version: '3.5'
97
97
  description: The purpose of this gem is to return a phrase written using two charsets
98
- due to user's mistake. The reason behind this is that we have a db we want to migrate
99
- populated with such entries and we want to somehow sanitize it. The db contains
100
- company and product names in capital letters (e.g. the user might have written "komπολοι".upcase
101
- instead of "κομπολοι".upcase", resulting in a string that in capital letters seems
102
- to be the same, but in practice is not)
98
+ [greek, latin] (uppercase characters only) due to user's mistake.
103
99
  email: dimxer@hotmail.com
104
100
  executables: []
105
101
  extensions: []
@@ -132,7 +128,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
132
128
  requirements:
133
129
  - - ">="
134
130
  - !ruby/object:Gem::Version
135
- version: '0'
131
+ version: 2.4.0
136
132
  required_rubygems_version: !ruby/object:Gem::Requirement
137
133
  requirements:
138
134
  - - ">="