debilinguifier 0.1.0 → 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/LICENSE.txt +3 -0
- data/README.rdoc +33 -5
- data/Rakefile +5 -1
- data/VERSION +1 -1
- data/debilinguifier.gemspec +5 -4
- data/lib/debilinguifier.rb +22 -8
- data/spec/debilinguifier_spec.rb +18 -0
- metadata +4 -8
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: f5b8ba510a26476e167291f6eaa489f4e7afe617
|
4
|
+
data.tar.gz: 0c979ecf4e2fa145179d17b011597d1ef8c3995d
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 7ceff85e93f06d3f4281aad75abe644bcdad56f286c334652d9bbb8b4ab560f76c24ddca210bbde4a7c012ea795d928a5d8fb42958ddbc7ae8ccbbab25a5e53a
|
7
|
+
data.tar.gz: 06c304360041ea89407fbba9c5f43556484c31ffefdb83020a6301f5604c42e328e4c7a14102d15610841b7ff80c57c20225bc8a016a05b957f33f30d54815d1
|
data/LICENSE.txt
CHANGED
@@ -18,3 +18,6 @@ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
|
18
18
|
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
19
|
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
20
|
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
21
|
+
|
22
|
+
IF YOU ARE READING TO THIS LINE YOU ARE AS CRAZY AS I AM. USE THE
|
23
|
+
CODE IN HERE AS YOU PLEASE. IT IS NOT LIKE I REINVENTED THE WHEEL.
|
data/README.rdoc
CHANGED
@@ -1,25 +1,53 @@
|
|
1
1
|
= debilinguifier
|
2
2
|
|
3
|
+
<em>Keep in mind that this</em> <b>only</b> <em>works with uppercase characters. For clarity and better readbility, due to the nature of what we are trying to accomplish, most of the examples are in lowercase. Always consider the uppercase equivalent of any such string (in ruby terms: </em> <tt>string.upcase</tt> <em>)</em>
|
4
|
+
|
5
|
+
=== Explanation (The short version):
|
3
6
|
This is a module to help me sanitize an already populated db. The db contains company names and product descriptions in capital letters. Users populating the db had been careless enough to allow both [greek, latin] characters to be used in a single word or phrase, making it difficult to alphabetically sort them or search for them.
|
4
7
|
|
5
8
|
The purpose of this gem is to help me import the data into a new db (for a completely different app) in a more deterministic way.
|
6
9
|
|
10
|
+
=== Explanation (The long version):
|
7
11
|
|
8
|
-
|
12
|
+
debilinguifier (in greek: αποδιγλωσσοποιητής): A word that does not exist in neither english or greek and attempts to describe the following behaviour:
|
9
13
|
|
10
|
-
|
14
|
+
Latin and greek capital letters have confused the users of an existing system, by their identical looks. The users have -over the years- manually populated a database with entries, which are lacking consistency due to those looks. By convention, only capital letters are used in the system, and as a result, one can find entries as the following (please consider the following in uppercase!): αlpha, vιτα, γama and so on. In every one of these cases we have come across, the solution is relatively simple:
|
15
|
+
1. The phrase (or word) is already written in greek-only or latin-only characters. In this case return it as-is.
|
16
|
+
2. The phrase (or word) is written using characters from both charsets, but can be written using only one (or even both) charsets. In this case, replace the similar looking characters and return the result (for example if it contains characters like 'Φ' XOR 'C', which have no equivalent in the other charset but the rest of the characters are common looking in both charsets, return it in the 'correct' charset. If it can be returned in both charsets (eg. αυto), return it in greek charset (in this example αυτο). <em>This is actually cases 2 and 3 in our code.</em> And here comes the tricky part (there was no such example in our case, but "what if...?"!):
|
17
|
+
3. The word (or phrase) contains characters from both charsets, that have no equivalent in the other charset (eg. contains both 'Ψ' and 'C'). If it a single phrase like 'cv φωta' there is no real problem: you simply split the phrase, apply the above rules to each word and end up with 'cv φωτα'. And everybody is happy.
|
18
|
+
4. But what if a word is like 'c3ψima'? What kind of query would end up finding this entry in the db? (By the way, the whole idea is that queries will be executed in an AJAX way). The solution was to make our dbl(abbr. for de-bi-linguifier) return it using a bias, which by default is 'greek': it will return 'c3ψιμα'. (You can also choose to use a 'latin' bias, or set it to anything other than that to return the initial word as-is). <em> 3 and 4 correspond to case 4 in our code.</em>
|
11
19
|
|
12
|
-
|
20
|
+
<em>As you may have figured out, the 4th case above attempts to solve a problem that currently does not exist and although it (probably) does not fail miserably, it is outside the scope of our specifications. As such, it only attempts to provide a possible solution for a problem that will probably never appear: it is nearly impossible to find a reason to write a word using simultaneously characters from both charsets on purpose. In any case, if such a problem comes up and the implemented solution is not satisfying, we will revise the code. </em>
|
21
|
+
|
22
|
+
The above 4 cases do not correspond to the 4 cases in the module
|
23
|
+
|
24
|
+
What we are accomplishing: we can now search in our new db for something that looked like 'ATIMA', but was written as 'aτιμa'.upcase. To be able to succefully query the db we have to check a couple of things before each query:
|
25
|
+
* If the phrase can be written in both greek and latin charsets (in other words, can_write_only_greek?(input) AND can_write_only_latin?(input) returns true); then the query must be a union of the results of two subqueries: one with the result of return_in_greek(input) and one with the result of return_in_latin(input).
|
26
|
+
* If the phrase can be written in only one of the two charsets (in other words, can_write_only_greek?(input) XOR can_write_only_latin?(input) returns true); just run through dbl the phrase (just in case the user is using mixed charset) and you are good to go with your query.
|
27
|
+
* If the phrase cannot be written with a single charset (in other words can_write_only_greek?(input) OR can_write_only_latin?(input) returns false): Run the phrase through dbl with the same bias as the one used when importing the data!
|
13
28
|
|
29
|
+
<em>(Note to myself: I will implement the above functionality in a method later on, when it will be needed. For the time being, I have only documented how this will be done)</em>
|
14
30
|
|
31
|
+
---
|
32
|
+
=== Notes:
|
33
|
+
|
34
|
+
* Ruby version 2.4.0 is the minimum required due to: {String supports Unicode case mappings}[https://bugs.ruby-lang.org/issues/10085] found in {Ruby 2.4.0 release announcement}[https://www.ruby-lang.org/en/news/2016/12/25/ruby-2-4-0-released/]
|
35
|
+
|
36
|
+
* To use this gem, just add it (require 'debilinguifier') and call: DeBiLinguifier.dbl(your_string_to_de-bi-linguified)
|
37
|
+
* Please note that this only works with capital and detoned characters.
|
38
|
+
* This is my very first gem and I am very proud of it!
|
39
|
+
* I used {Juwelier}[https://github.com/flajann2/juwelier] to create it.
|
15
40
|
|
16
|
-
|
41
|
+
---
|
17
42
|
|
43
|
+
To install it, just run: <tt>gem install debilinguifier</tt>
|
18
44
|
|
45
|
+
To add it to your project: <tt>require 'debilinguifier'</tt>
|
19
46
|
|
47
|
+
To use it: <tt>DeBiLinguifier.dbl(<em>your_string</em>)</tt> <em>It will return you the dbl'ed string</em>
|
20
48
|
|
21
49
|
|
22
|
-
|
50
|
+
=== Contributing to debilinguifier
|
23
51
|
|
24
52
|
* Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet.
|
25
53
|
* Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it.
|
data/Rakefile
CHANGED
@@ -11,6 +11,8 @@ rescue Bundler::BundlerError => e
|
|
11
11
|
end
|
12
12
|
require 'rake'
|
13
13
|
|
14
|
+
|
15
|
+
|
14
16
|
require 'juwelier'
|
15
17
|
Juwelier::Tasks.new do |gem|
|
16
18
|
# gem is a Gem::Specification... see http://guides.rubygems.org/specification-reference/ for more options
|
@@ -18,9 +20,11 @@ Juwelier::Tasks.new do |gem|
|
|
18
20
|
gem.homepage = "http://github.com/apapamichalis/debilinguifier"
|
19
21
|
gem.license = "MIT"
|
20
22
|
gem.summary = %Q{A [greek, latin] debilinguifier}
|
21
|
-
gem.description = %Q{The purpose of this gem is to return a phrase written using two charsets
|
23
|
+
gem.description = %Q{The purpose of this gem is to return a phrase written using two charsets [greek, latin] (uppercase characters only) due to user's mistake.}
|
22
24
|
gem.email = "dimxer@hotmail.com"
|
23
25
|
gem.authors = ["apapamichalis"]
|
26
|
+
# This gem will work with 1.8.6 or greater...
|
27
|
+
gem.required_ruby_version = '>= 2.4.0'
|
24
28
|
|
25
29
|
# dependencies defined in Gemfile
|
26
30
|
end
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
|
1
|
+
1.0.0
|
data/debilinguifier.gemspec
CHANGED
@@ -2,17 +2,17 @@
|
|
2
2
|
# DO NOT EDIT THIS FILE DIRECTLY
|
3
3
|
# Instead, edit Juwelier::Tasks in Rakefile, and run 'rake gemspec'
|
4
4
|
# -*- encoding: utf-8 -*-
|
5
|
-
# stub: debilinguifier
|
5
|
+
# stub: debilinguifier 1.0.0 ruby lib
|
6
6
|
|
7
7
|
Gem::Specification.new do |s|
|
8
8
|
s.name = "debilinguifier".freeze
|
9
|
-
s.version = "
|
9
|
+
s.version = "1.0.0"
|
10
10
|
|
11
11
|
s.required_rubygems_version = Gem::Requirement.new(">= 0".freeze) if s.respond_to? :required_rubygems_version=
|
12
12
|
s.require_paths = ["lib".freeze]
|
13
13
|
s.authors = ["apapamichalis".freeze]
|
14
|
-
s.date = "2017-03
|
15
|
-
s.description = "The purpose of this gem is to return a phrase written using two charsets
|
14
|
+
s.date = "2017-04-03"
|
15
|
+
s.description = "The purpose of this gem is to return a phrase written using two charsets [greek, latin] (uppercase characters only) due to user's mistake.".freeze
|
16
16
|
s.email = "dimxer@hotmail.com".freeze
|
17
17
|
s.extra_rdoc_files = [
|
18
18
|
"LICENSE.txt",
|
@@ -35,6 +35,7 @@ Gem::Specification.new do |s|
|
|
35
35
|
]
|
36
36
|
s.homepage = "http://github.com/apapamichalis/debilinguifier".freeze
|
37
37
|
s.licenses = ["MIT".freeze]
|
38
|
+
s.required_ruby_version = Gem::Requirement.new(">= 2.4.0".freeze)
|
38
39
|
s.rubygems_version = "2.6.8".freeze
|
39
40
|
s.summary = "A [greek, latin] debilinguifier".freeze
|
40
41
|
|
data/lib/debilinguifier.rb
CHANGED
@@ -15,6 +15,7 @@ module DeBiLinguifier
|
|
15
15
|
GREEK_ALPHABET_PLUS_SYMBOLS = Regexp.new("(^[Α-Ω#{SYMBOLS}]+)+$").freeze
|
16
16
|
|
17
17
|
|
18
|
+
##
|
18
19
|
# Only works with latin and greek charsets.
|
19
20
|
# An input phrase can only be one of five things:
|
20
21
|
# 1) Already only in greek or only in latin charset.
|
@@ -22,14 +23,19 @@ module DeBiLinguifier
|
|
22
23
|
# 3) Written in a mixed charset, but can be written with just the latin charset.
|
23
24
|
# 4) Written in a mixed charset, but cannot be written with only one of the [greek, latin] charsets.
|
24
25
|
# In this case we split the phrase into words and apply the above rules to each word seperately.
|
25
|
-
# If case 4 applies to a single word,
|
26
|
-
#
|
26
|
+
# If case 4 applies to a single word, then we have to return it greek-ified or latin-ified.
|
27
|
+
# This way we will be able to produce SQL queries in a more deterministic way.
|
28
|
+
# (Actually, when searching for a phrase that has been processed by our dbl before writting to the db,
|
29
|
+
# we will also have to process through dbl the phrase we are looking for before quering the db).
|
30
|
+
#
|
31
|
+
# 5) Written in a mixed charset, but can be written either with just the greek charset or just the latin charset
|
32
|
+
# (greek bias is the default and only behavior in this case)
|
27
33
|
#
|
28
34
|
# Note: We are deliberately ignoring case 5, as it is of no use at the moment as a separate case.
|
29
35
|
# It is actually the initersection of cases 2 and 3. Using case 2 instead.
|
30
36
|
# @params input [String] the string we want to de-bi-linguify (!)
|
31
37
|
# @return [String] the de-bi-linguized string
|
32
|
-
def dbl(input)
|
38
|
+
def dbl(input, bias='greek')
|
33
39
|
if(is_greek_only?(input) || is_latin_only?(input)) # Case 1
|
34
40
|
input
|
35
41
|
elsif(can_write_only_greek?(input)) # Case 2
|
@@ -37,7 +43,7 @@ module DeBiLinguifier
|
|
37
43
|
elsif(can_write_only_latin?(input)) # Case 3
|
38
44
|
return_in_latin(input)
|
39
45
|
else # Case 4
|
40
|
-
return_in_mixed_charset(input)
|
46
|
+
return_in_mixed_charset(input, bias)
|
41
47
|
end
|
42
48
|
end
|
43
49
|
|
@@ -72,13 +78,21 @@ module DeBiLinguifier
|
|
72
78
|
end
|
73
79
|
|
74
80
|
# Return the phrase using both charsets
|
75
|
-
def return_in_mixed_charset(input)
|
81
|
+
def return_in_mixed_charset(input, bias)
|
76
82
|
# Split the phrase in words and recursively try to return each word in the "correct" charset
|
77
|
-
# If that is not possible (e.g. a word contains both "Φ" and "C",
|
83
|
+
# If that is not possible (e.g. a word contains both "Φ" and "C", the word must either be greek-ified (default)
|
84
|
+
# or latin-ified. The reason for this is that we will be able to do SQL queries, as long as the word - or phrase
|
85
|
+
# we are looking for has been passed through dbl.
|
78
86
|
# We first split the input phrase, based on the SYMBOLS delimiters
|
79
87
|
words_arr = input.split(/(?<=[#{SYMBOLS}])/)
|
80
|
-
if words_arr.length == 1 # If it was only one word, return it.
|
81
|
-
|
88
|
+
if words_arr.length == 1 # If it was only one word, return it, according to the bias.
|
89
|
+
if bias == 'greek'
|
90
|
+
return return_in_greek(words_arr.join.to_s) # If the bias is 'greek', return the word 'greek-ified'
|
91
|
+
elsif bias == 'latin'
|
92
|
+
return return_in_latin(words_arr.join.to_s) # Else if bias is 'latin' return it 'latinified'.
|
93
|
+
else
|
94
|
+
return (words_arr.join.to_s) # Else return it as-is (not advisable!)
|
95
|
+
end
|
82
96
|
else # Else apply dbl to each word we got after splitting input
|
83
97
|
words_arr2 =[]
|
84
98
|
words_arr.each do |word|
|
data/spec/debilinguifier_spec.rb
CHANGED
@@ -12,7 +12,25 @@ describe DeBiLinguifier do
|
|
12
12
|
expect(DeBiLinguifier.dbl(input.upcase)).to eq(output.upcase)
|
13
13
|
end
|
14
14
|
end
|
15
|
+
end
|
16
|
+
|
17
|
+
context 'When using the bias' do
|
18
|
+
it 'should work with greek' do
|
19
|
+
expect(DeBiLinguifier.dbl('fψaι'.upcase, 'greek')).to eq('fψαι'.upcase)
|
20
|
+
end
|
21
|
+
|
22
|
+
it 'should work with latin' do
|
23
|
+
expect(DeBiLinguifier.dbl('fψaι'.upcase, 'latin')).to eq('fψai'.upcase)
|
24
|
+
end
|
15
25
|
|
26
|
+
it 'should work with nil' do
|
27
|
+
expect(DeBiLinguifier.dbl('fψaι'.upcase, nil)).to eq('fψaι'.upcase)
|
28
|
+
end
|
29
|
+
|
30
|
+
it 'should work with anything other than ["greek", "latin"] like it was nil' do
|
31
|
+
expect(DeBiLinguifier.dbl('fψaι'.upcase, 'asdf')).to eq('fψaι'.upcase)
|
32
|
+
end
|
16
33
|
end
|
34
|
+
|
17
35
|
end
|
18
36
|
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: debilinguifier
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: 1.0.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- apapamichalis
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2017-03
|
11
|
+
date: 2017-04-03 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: shoulda
|
@@ -95,11 +95,7 @@ dependencies:
|
|
95
95
|
- !ruby/object:Gem::Version
|
96
96
|
version: '3.5'
|
97
97
|
description: The purpose of this gem is to return a phrase written using two charsets
|
98
|
-
due to user's mistake.
|
99
|
-
populated with such entries and we want to somehow sanitize it. The db contains
|
100
|
-
company and product names in capital letters (e.g. the user might have written "komπολοι".upcase
|
101
|
-
instead of "κομπολοι".upcase", resulting in a string that in capital letters seems
|
102
|
-
to be the same, but in practice is not)
|
98
|
+
[greek, latin] (uppercase characters only) due to user's mistake.
|
103
99
|
email: dimxer@hotmail.com
|
104
100
|
executables: []
|
105
101
|
extensions: []
|
@@ -132,7 +128,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
132
128
|
requirements:
|
133
129
|
- - ">="
|
134
130
|
- !ruby/object:Gem::Version
|
135
|
-
version:
|
131
|
+
version: 2.4.0
|
136
132
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
137
133
|
requirements:
|
138
134
|
- - ">="
|