turkish_stemmer 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: b12eb0732e274df49f05ff4158522ccfd89f4a7e
4
+ data.tar.gz: 9b0723baf1eb0fb864c82bd17cec6e8966e4fde5
5
+ SHA512:
6
+ metadata.gz: de27efe76f051fd747cf0ee9b59fb47ed61e3664d8b20c234adc00d194b67419a16de3a2eb5c6d5f8135a4cb659f7b59d5114a6e719cb507bac574a2bca08a3d
7
+ data.tar.gz: 3004837f529b305bad6e309a4989603b146aec667059e6e1fd0a281b0ed3e8ee38fe6d6ee13d96c25e9d2343dffdd2802b11bb52ba5940fa1d8b7062969c08ea
data/.gitignore ADDED
@@ -0,0 +1,18 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ Gemfile.lock
7
+ InstalledFiles
8
+ _yardoc
9
+ coverage
10
+ doc/
11
+ lib/bundler/man
12
+ pkg
13
+ rdoc
14
+ spec/reports
15
+ test/tmp
16
+ test/version_tmp
17
+ tmp
18
+ bin/
data/.rspec ADDED
@@ -0,0 +1,2 @@
1
+ --color
2
+ --format doc
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in turkish_stemmer.gemspec
4
+ gemspec
data/LICENSE.txt ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2014 Skroutz SA
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,282 @@
1
+ # TurkishStemmer
2
+
3
+ Stemmer algorithm for Τurkish language.
4
+
5
+ ## Introduction to Turkish language morphology
6
+
7
+ > Turkish is an agglutinative language and has a very rich morphological
8
+ stucture. In Turkish, you can form many different words from a single stem by
9
+ appending a sequence of suffixes. For example The word "doktoruymuşsunuz"
10
+ means "You had been the doctor of him". The stem of the word is "doktor" and
11
+ it takes three different suffixes -sU, -ymUş, and -sUnUz.
12
+
13
+ From "Snowball Description":
14
+
15
+ > Words are usually composed of a stem and of at least two or three affixes
16
+ appended to it.
17
+
18
+ > We can analyze noun suffixes in Turkish in two groups. Noun suffixes (eg.
19
+ "doktor-um" meaning "my doctor") and nominal verb suffixes (eg. "doktor-dur"
20
+ meaning ‘is a doctor’). The words ending with nominal verb suffixes can be
21
+ used as verbs in sentences. There are over thirty different suffixes
22
+ classified in these two general groups of suffixes.
23
+
24
+ > In Turkish, the suffixes are affixed to the stem according to definite
25
+ ordering rules.
26
+
27
+ From "An affix stripping morphological analyzer for Turkish" paper:
28
+
29
+ > Turkish has a special place within the natural languages not only being a
30
+ fully concatenative language but also having the suffixes as the only affix
31
+ type. Another feature of the language is that, someone who knows Turkish can
32
+ easily analyze a word even if he/she does not know its stem.
33
+
34
+ > The phonological rules of Turkish are significant factors that influence
35
+ this feature.
36
+ Ex: (any word)lerim => (any word)-ler-im
37
+ "ler" plural suffix, "im" 1st singular person possessive.
38
+
39
+ ### Rules
40
+
41
+ 1. The only affix type in Turkish is the suffix.
42
+
43
+ 2. A plural suffix cannot follow a possesive suffix.
44
+
45
+ 3. A suffix in Turkish can have multiple allomorphs in order to provide sound
46
+ harmony in the word to which it is affixed.
47
+
48
+ 4. In Turkish each vowel indicates a distinct syllable.
49
+
50
+ 5. In Turkish, single syllable words are mostly the stem itself
51
+
52
+ 6. If a word has nominal __verb__ suffixes, they always appear at the end of
53
+ the word. They follow __noun__ suffixes or the stem itself at the absence
54
+ of noun suffixes
55
+
56
+ 7. In Turkish, “-lAr” suffix can be used both as a nominal verb suffix (third
57
+ person plural present tense) and as a noun suffix (plural inflection).
58
+
59
+ 8. In Turkish, words do not end with consonants 'b', 'c', 'd', and 'ğ'.
60
+ However, when a suffix starting with a vowel is affixed to a word ending
61
+ with 'p', 'ç', 't' or 'k', the last consonant is transformed into 'b', 'c',
62
+ 'd', or 'ğ' respectively. The postlude routine transforms last consonants
63
+ 'b', 'c','d', or 'ğ'' back to 'p', 'ç', 't' or 'k', respectively, after
64
+ stemming is complete.
65
+
66
+ ### Suffix Classes
67
+
68
+ Class | Type
69
+ -----------------------------|----------------
70
+ Nominal verb suffixes | Inflectional
71
+ Derivational suffixes | Derivational
72
+ Noun suffixes | Inflectional
73
+ Tense & person verb suffixes | Inflectional
74
+ Verb suffixes | Inflectional
75
+
76
+ ### Suffix allomorphs
77
+
78
+ Suffix allomorphs are used to create a good sound harmony. They do not change
79
+ the meaning of the word. If a suffix has a capital letter then it has an
80
+ allomorh. If a suffix has a letter in parentheses then it can be omitted.
81
+ Possible allomorphs are given below:
82
+
83
+ Letter | Allomorph
84
+ -------|------------
85
+ U | ı,i,u,ü
86
+ C | c,ç
87
+ A | a,e
88
+ D | d,t
89
+ I | ı,I
90
+
91
+ ### Nominal Verb Suffixes
92
+
93
+ a/a | Suffix
94
+ ----|------------------
95
+ 1 | –(y)Um
96
+ 2 | –sUn
97
+ 3 | –(y)Uz
98
+ 4 | –sUnUz
99
+ 5 | –lAr
100
+ 6 | –md
101
+ 7 | –n
102
+ 8 | –k
103
+ 9 | –nUz
104
+ 10 | –DUr
105
+ 11 | –cAsInA
106
+ 12 | –(y)DU
107
+ 13 | –(y)sA
108
+ 14 | –(y)mUş
109
+ 15 | –(y)ken
110
+
111
+ Suffix transition ordering for nominal verbs can be seen in References[5]
112
+
113
+ ### Noun Suffixes
114
+
115
+ a/a | Suffixes
116
+ ----|-------------
117
+ 1 | –lAr
118
+ 2 | –(U)m
119
+ 3 | –(U)mUz
120
+ 4 | –(U)n
121
+ 5 | –(U)nUz
122
+ 6 | –(s)U
123
+ 7 | –lArI
124
+ 8 | –(y)U
125
+ 9 | –nU
126
+ 10 | –(n)Un
127
+ 11 | –(y)A
128
+ 12 | –nA
129
+ 13 | –DA
130
+ 14 | –nDA
131
+ 15 | –DAn
132
+ 16 | –nDAn
133
+ 17 | –(y)lA
134
+ 18 | –ki
135
+ 19 | –(n)cA
136
+
137
+ Suffix transition ordering for nouns can be seen in References[5]
138
+
139
+ ### Derivational Suffixes
140
+
141
+ a/a | Suffixes
142
+ ----|----------
143
+ 1 | –lUk
144
+ 2 | –CU
145
+ 3 | –CUk
146
+ 4 | –lAş
147
+ 5 | –lA
148
+ 6 | –lAn
149
+ 7 | –CA
150
+ 8 | –lU
151
+ 9 | –sUz
152
+
153
+ Initially, we will handle only a small subset of the above suffixes which are
154
+ more common in our domain.
155
+
156
+ ### Vowel Harmony
157
+
158
+ This routine checks whether __the last two__ vowels of the word obey vowel
159
+ harmony rules. A brief description of Turkish vowel harmony follows.
160
+
161
+ Turkish vowel harmony is a two dimensional vowel harmony system, where vowels
162
+ are characterised by two features named frontness and roundness. There are
163
+ vowel harmony rules for each feature.
164
+
165
+ 1. Vowel harmony rule for frontness: Vowels in Turkish are grouped into two
166
+ according to where they are produced. Front produced vowels are formed at
167
+ the front of the mouth ('e', 'i', 'ö', 'ü') and back produced vowels are
168
+ produced nearer to throat ('a', 'ı', 'o', 'u'). According to the vowel
169
+ harmony rule, words cannot contain both front and back vowels. This is one
170
+ of the reasons why suffixes containing vowels can take different forms to
171
+ obey vowel harmony.
172
+
173
+ 2. Vowel harmony rule for roundness: Vowels in Turkish are grouped into two
174
+ according to whether lips are rounded while producing it. 'o', 'ö', 'u' and
175
+ 'ü' are rounded vowels whereas 'a', 'e', 'ı' and 'i' are unrounded.
176
+ According to the vowel harmony rules, if the vowel of a syllable is
177
+ unrounded, the following vowel is unrounded as well. If the vowel of a
178
+ syllable is rounded, the following vowels are 'a', 'e', 'u' or 'ü'.
179
+
180
+ ### Last consonant
181
+
182
+ Another interesting case in detecting suffixes in Turkish is that, for some
183
+ suffixes, if the word ends with a vowel, a consonant is inserted between the
184
+ rest of the word and the suffix. These merging consonants can be 'y', 'n' or
185
+ 's'. When a merging consonant can be inserted before the suffix, the
186
+ representation of the suffix starts with the optional consonant surrounded by
187
+ paranthesis (eg. –(y)Um, -(n)cA). For these kinds of suffixes, if existence of
188
+ a merging consonant is considered, the candidate stem is checked whether it
189
+ ends with a vowel.
190
+
191
+ If there is no 'y' consonant before the suffix, only the real part of the
192
+ suffix (eg. -Um) is marked for stemming. If there is a 'y' consonant and it is
193
+ preceded by a vowel, 'y' is treated as a merging consonant and both 'y' and
194
+ the candidate suffix (eg. -Um) is marked for stemming. If there is a consonant
195
+ just before 'y', the decision is that the consonant 'y' and the candidate
196
+ suffix are really a part of the stem. In such a case, cursor is not advanced
197
+ to prevent over-stemming. The last case can occur especially when the stem
198
+ originates from another language like in 'lityum' (meaning the element
199
+ Lithium). If the check for vowel harmony was not made, the word would be
200
+ stemmed to 'lit', for '–(y)Um' would be treated as a suffix affixed to it. But
201
+ according to morphological rules of Turkish, the final word would be 'litim',
202
+ not 'lityum' if 'lit' were really the stem of the word and the suffix '–(y)Um'
203
+ were affixed to it. So detecting 'lit' as the stem of the word would be an over
204
+ -stemming.
205
+
206
+ ### Merging Vowel
207
+
208
+ Similar to merging consonants, there are merging vowels for some suffixes
209
+ starting with consonants. They can be preceded by merging vowels like in '-(U)
210
+ mUz' suffix when they are affixed to a stem ending with a consonant. In such a
211
+ case, a U vowel ('ı', 'i', 'u' or 'ü' depending on vowel harmony) is inserted
212
+ between the stem and real suffix (e.g. '-mUz') for ease of pronunciation.
213
+
214
+ ### Some examples
215
+
216
+ Word / Analysis | Meaning / Stem
217
+ ------------------------------ |--------------------------------
218
+ Kalelerimizdekilerden | From the ones at one of our castles
219
+ Kale-lAr-UmUz-DA-ki-lAr-DAn | Kale
220
+ Çocuğuymuşumcasına | As if I were her child
221
+ Çocuk-(s)U-(y)mUş-(y)Um-cAsInA | Çocuk
222
+ Kedileriyle | With their cats
223
+ Kedi-lAr-(s)U-(y)lA | Kedi
224
+ Çocuklarımmış | Someone told me that they were my children
225
+ çocuk-lAr-(U)m-(y)mUş | Çocuk
226
+ Kitabımızdı | It was our book
227
+ kitap-UmUz-(y)DU | Kitap
228
+
229
+ ## Future Work
230
+
231
+ * Add more verbs suffixes.
232
+ * Add more derivational suffixes.
233
+
234
+ ## References
235
+
236
+ 1. [Turkish Stemmer used in Lucene](http://snowball.tartarus.org/algorithms/turkish/stemmer.html)
237
+ 2. [Java Implementation](http://snowball.tartarus.org/archives/snowball-discuss/att-0875/02-TurkishStemmer.java)
238
+ 3. [Snowball Implementation](http://snowball.tartarus.org/algorithms/turkish/stem_Unicode.sbl)
239
+ 4. [Snowball Description](http://snowball.tartarus.org/algorithms/turkish/accompanying_paper.doc)
240
+ 5. [An affix stripping morphological analyzer for Turkish](http://web.itu.edu.tr/~gulsenc/papers/iasted.pdf)
241
+ 6. [Lead Generation](https://en.wikipedia.org/wiki/Lead_generation)
242
+ 7. [Vowel Harmony](https://en.wikipedia.org/wiki/Vowel_harmony#Turkish)
243
+ 8. [Turkish Suffixes](https://en.wiktionary.org/wiki/Appendix:Turkish_suffixes)
244
+ 9. [Turkish Grammar](https://en.wikipedia.org/wiki/Turkish_grammar)
245
+ 10. [Turkish Language](https://en.wikipedia.org/wiki/Turkish_language)
246
+ 11. [Tartarus](http://tartarus.org/)
247
+ 12. [Information Retrieval on Turkish Texts](http://www.users.muohio.edu/canf/papers/JASIST2008offPrint.pdf)
248
+
249
+ ## Installation
250
+
251
+ Add this line to your application's Gemfile:
252
+
253
+ gem 'turkish_stemmer'
254
+
255
+ And then execute:
256
+
257
+ $ bundle
258
+
259
+ Or install it yourself as:
260
+
261
+ $ gem install turkish_stemmer
262
+
263
+ ## Usage
264
+
265
+ ```ruby
266
+ require 'turkish_stemmer'
267
+
268
+ TurkishStemmer.stem("gözlükler") # => "gözlük"
269
+ ```
270
+
271
+ ## Contributing
272
+
273
+ 1. Fork it ( http://github.com/<my-github-username>/turkish_stemmer/fork )
274
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
275
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
276
+ 4. Push to the branch (`git push origin my-new-feature`)
277
+ 5. Create new Pull Request
278
+
279
+ ## License
280
+
281
+ turkish_stemmer is licensed under MIT. See [LICENSE](LICENSE.txt).
282
+
data/Rakefile ADDED
@@ -0,0 +1,21 @@
1
+ # coding: utf-8
2
+ require "bundler/gem_tasks"
3
+
4
+ desc "Update the stems of the sample words"
5
+ task :update_stemming_samples do
6
+ require 'turkish_stemmer'
7
+ words = []
8
+ filename = "benchmarks/stemming_samples.txt"
9
+ File.open(filename, "r") do |sample|
10
+ while(line = sample.gets)
11
+ word, _ = line.split(",")
12
+ words << word
13
+ end
14
+ end
15
+
16
+ File.open(filename, "w") do |sample|
17
+ words.each do |word|
18
+ sample.puts "#{word},#{TurkishStemmer.stem(word)}"
19
+ end
20
+ end
21
+ end
@@ -0,0 +1,16 @@
1
+ require 'benchmark'
2
+ require 'turkish_stemmer'
3
+ require 'lingua/stemmer'
4
+
5
+ Benchmark.bmbm(7) do |x|
6
+
7
+ lingua_stemmer = Lingua::Stemmer.new(:language => "tr")
8
+
9
+ x.report('Stem using turkish_stemmer gem') do
10
+ 1_000.times { TurkishStemmer.stem("telephonlar") }
11
+ end
12
+
13
+ x.report('Stem using ruby-stemmer gem') do
14
+ 1_000.times { lingua_stemmer.stem("telephonlar") }
15
+ end
16
+ end