turkish_stemmer 0.1.2 → 0.1.4

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 4329d09e97cff22cb43a831f47e8f64ca0e5e0ae
4
- data.tar.gz: 005c00062f4545e5169ad286cf9843cf11c9c194
3
+ metadata.gz: 8e751c59ef166e54c09608770ca44a9b64592cc9
4
+ data.tar.gz: 7e01352a9394b33effcead5bd849ecf8d051c568
5
5
  SHA512:
6
- metadata.gz: b55ebf06d0c3431fc751993c6bb15c067f56fede1711cc304cffd74e55338259ad7a5251125f47b3c5fed0de2b1914e02aab55c61eb9b8ff81a51dafd7b16d15
7
- data.tar.gz: 3048aff4dd75a1ab76a7e9c065e56be863a750ca881f6d0dcf205cc2c8d15b2247167fcb36a6b6e9fd2a7facf74e214b4ae8405b2cad2688c1ff664dbe1923da
6
+ metadata.gz: 4fdebc0a91152e7fc60ec60eb76961fdd36272e9d6caea913b43ae4ac9948974df1e4e6822ff7033bbd51041470fbdb560268978858ae2cf51f94da02f0715f8
7
+ data.tar.gz: ef9219aa61d37a0098e1ee23d5ad8e360a1c04278335562cf56f7c885b4fd9c69b469493f7ed5af0749bde0ee1a366a6689cc3b75e61323f870ee5f4e9cc4d3b
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: turkish_stemmer
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.2
4
+ version: 0.1.4
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tasos Stathopoulos
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2014-04-02 00:00:00.000000000 Z
12
+ date: 2015-06-16 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: activesupport
@@ -102,33 +102,7 @@ email:
102
102
  executables: []
103
103
  extensions: []
104
104
  extra_rdoc_files: []
105
- files:
106
- - ".gitignore"
107
- - ".rspec"
108
- - Gemfile
109
- - LICENSE.txt
110
- - README.md
111
- - Rakefile
112
- - benchmarks/stemmers_comparison.rb
113
- - benchmarks/stemming_samples.txt
114
- - benchmarks/turkish_word_recognition.rb
115
- - config/derivational_states.yml
116
- - config/derivational_suffixes.yml
117
- - config/nominal_verb_states.yml
118
- - config/nominal_verb_suffixes.yml
119
- - config/noun_states.yml
120
- - config/noun_suffixes.yml
121
- - config/stemmer.yml
122
- - lib/turkish_stemmer.rb
123
- - lib/turkish_stemmer/version.rb
124
- - spec/fixtures/simple_state.yml
125
- - spec/fixtures/simple_state_02.yml
126
- - spec/fixtures/simple_suffix.yml
127
- - spec/fixtures/simple_transition.yml
128
- - spec/spec_helper.rb
129
- - spec/support/fixtures.csv
130
- - spec/turkish_stemmer_spec.rb
131
- - turkish_stemmer.gemspec
105
+ files: []
132
106
  homepage: https://github.com/skroutz/turkish_stemmer
133
107
  licenses:
134
108
  - MIT
@@ -149,15 +123,9 @@ required_rubygems_version: !ruby/object:Gem::Requirement
149
123
  version: '0'
150
124
  requirements: []
151
125
  rubyforge_project:
152
- rubygems_version: 2.1.11
126
+ rubygems_version: 2.4.7
153
127
  signing_key:
154
128
  specification_version: 4
155
129
  summary: A simple Turkish stemmer
156
- test_files:
157
- - spec/fixtures/simple_state.yml
158
- - spec/fixtures/simple_state_02.yml
159
- - spec/fixtures/simple_suffix.yml
160
- - spec/fixtures/simple_transition.yml
161
- - spec/spec_helper.rb
162
- - spec/support/fixtures.csv
163
- - spec/turkish_stemmer_spec.rb
130
+ test_files: []
131
+ has_rdoc:
data/.gitignore DELETED
@@ -1,18 +0,0 @@
1
- *.gem
2
- *.rbc
3
- .bundle
4
- .config
5
- .yardoc
6
- Gemfile.lock
7
- InstalledFiles
8
- _yardoc
9
- coverage
10
- doc/
11
- lib/bundler/man
12
- pkg
13
- rdoc
14
- spec/reports
15
- test/tmp
16
- test/version_tmp
17
- tmp
18
- bin/
data/.rspec DELETED
@@ -1,2 +0,0 @@
1
- --color
2
- --format doc
data/Gemfile DELETED
@@ -1,4 +0,0 @@
1
- source 'https://rubygems.org'
2
-
3
- # Specify your gem's dependencies in turkish_stemmer.gemspec
4
- gemspec
data/LICENSE.txt DELETED
@@ -1,22 +0,0 @@
1
- Copyright (c) 2014 Skroutz SA
2
-
3
- MIT License
4
-
5
- Permission is hereby granted, free of charge, to any person obtaining
6
- a copy of this software and associated documentation files (the
7
- "Software"), to deal in the Software without restriction, including
8
- without limitation the rights to use, copy, modify, merge, publish,
9
- distribute, sublicense, and/or sell copies of the Software, and to
10
- permit persons to whom the Software is furnished to do so, subject to
11
- the following conditions:
12
-
13
- The above copyright notice and this permission notice shall be
14
- included in all copies or substantial portions of the Software.
15
-
16
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
- EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
- MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
- NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
- LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
- OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
- WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md DELETED
@@ -1,282 +0,0 @@
1
- # TurkishStemmer
2
-
3
- Stemmer algorithm for Τurkish language.
4
-
5
- ## Introduction to Turkish language morphology
6
-
7
- > Turkish is an agglutinative language and has a very rich morphological
8
- stucture. In Turkish, you can form many different words from a single stem by
9
- appending a sequence of suffixes. For example The word "doktoruymuşsunuz"
10
- means "You had been the doctor of him". The stem of the word is "doktor" and
11
- it takes three different suffixes -sU, -ymUş, and -sUnUz.
12
-
13
- From "Snowball Description":
14
-
15
- > Words are usually composed of a stem and of at least two or three affixes
16
- appended to it.
17
-
18
- > We can analyze noun suffixes in Turkish in two groups. Noun suffixes (eg.
19
- "doktor-um" meaning "my doctor") and nominal verb suffixes (eg. "doktor-dur"
20
- meaning ‘is a doctor’). The words ending with nominal verb suffixes can be
21
- used as verbs in sentences. There are over thirty different suffixes
22
- classified in these two general groups of suffixes.
23
-
24
- > In Turkish, the suffixes are affixed to the stem according to definite
25
- ordering rules.
26
-
27
- From "An affix stripping morphological analyzer for Turkish" paper:
28
-
29
- > Turkish has a special place within the natural languages not only being a
30
- fully concatenative language but also having the suffixes as the only affix
31
- type. Another feature of the language is that, someone who knows Turkish can
32
- easily analyze a word even if he/she does not know its stem.
33
-
34
- > The phonological rules of Turkish are significant factors that influence
35
- this feature.
36
- Ex: (any word)lerim => (any word)-ler-im
37
- "ler" plural suffix, "im" 1st singular person possessive.
38
-
39
- ### Rules
40
-
41
- 1. The only affix type in Turkish is the suffix.
42
-
43
- 2. A plural suffix cannot follow a possesive suffix.
44
-
45
- 3. A suffix in Turkish can have multiple allomorphs in order to provide sound
46
- harmony in the word to which it is affixed.
47
-
48
- 4. In Turkish each vowel indicates a distinct syllable.
49
-
50
- 5. In Turkish, single syllable words are mostly the stem itself
51
-
52
- 6. If a word has nominal __verb__ suffixes, they always appear at the end of
53
- the word. They follow __noun__ suffixes or the stem itself at the absence
54
- of noun suffixes
55
-
56
- 7. In Turkish, “-lAr” suffix can be used both as a nominal verb suffix (third
57
- person plural present tense) and as a noun suffix (plural inflection).
58
-
59
- 8. In Turkish, words do not end with consonants 'b', 'c', 'd', and 'ğ'.
60
- However, when a suffix starting with a vowel is affixed to a word ending
61
- with 'p', 'ç', 't' or 'k', the last consonant is transformed into 'b', 'c',
62
- 'd', or 'ğ' respectively. The postlude routine transforms last consonants
63
- 'b', 'c','d', or 'ğ'' back to 'p', 'ç', 't' or 'k', respectively, after
64
- stemming is complete.
65
-
66
- ### Suffix Classes
67
-
68
- Class | Type
69
- -----------------------------|----------------
70
- Nominal verb suffixes | Inflectional
71
- Derivational suffixes | Derivational
72
- Noun suffixes | Inflectional
73
- Tense & person verb suffixes | Inflectional
74
- Verb suffixes | Inflectional
75
-
76
- ### Suffix allomorphs
77
-
78
- Suffix allomorphs are used to create a good sound harmony. They do not change
79
- the meaning of the word. If a suffix has a capital letter then it has an
80
- allomorh. If a suffix has a letter in parentheses then it can be omitted.
81
- Possible allomorphs are given below:
82
-
83
- Letter | Allomorph
84
- -------|------------
85
- U | ı,i,u,ü
86
- C | c,ç
87
- A | a,e
88
- D | d,t
89
- I | ı,I
90
-
91
- ### Nominal Verb Suffixes
92
-
93
- a/a | Suffix
94
- ----|------------------
95
- 1 | –(y)Um
96
- 2 | –sUn
97
- 3 | –(y)Uz
98
- 4 | –sUnUz
99
- 5 | –lAr
100
- 6 | –md
101
- 7 | –n
102
- 8 | –k
103
- 9 | –nUz
104
- 10 | –DUr
105
- 11 | –cAsInA
106
- 12 | –(y)DU
107
- 13 | –(y)sA
108
- 14 | –(y)mUş
109
- 15 | –(y)ken
110
-
111
- Suffix transition ordering for nominal verbs can be seen in References[5]
112
-
113
- ### Noun Suffixes
114
-
115
- a/a | Suffixes
116
- ----|-------------
117
- 1 | –lAr
118
- 2 | –(U)m
119
- 3 | –(U)mUz
120
- 4 | –(U)n
121
- 5 | –(U)nUz
122
- 6 | –(s)U
123
- 7 | –lArI
124
- 8 | –(y)U
125
- 9 | –nU
126
- 10 | –(n)Un
127
- 11 | –(y)A
128
- 12 | –nA
129
- 13 | –DA
130
- 14 | –nDA
131
- 15 | –DAn
132
- 16 | –nDAn
133
- 17 | –(y)lA
134
- 18 | –ki
135
- 19 | –(n)cA
136
-
137
- Suffix transition ordering for nouns can be seen in References[5]
138
-
139
- ### Derivational Suffixes
140
-
141
- a/a | Suffixes
142
- ----|----------
143
- 1 | –lUk
144
- 2 | –CU
145
- 3 | –CUk
146
- 4 | –lAş
147
- 5 | –lA
148
- 6 | –lAn
149
- 7 | –CA
150
- 8 | –lU
151
- 9 | –sUz
152
-
153
- Initially, we will handle only a small subset of the above suffixes which are
154
- more common in our domain.
155
-
156
- ### Vowel Harmony
157
-
158
- This routine checks whether __the last two__ vowels of the word obey vowel
159
- harmony rules. A brief description of Turkish vowel harmony follows.
160
-
161
- Turkish vowel harmony is a two dimensional vowel harmony system, where vowels
162
- are characterised by two features named frontness and roundness. There are
163
- vowel harmony rules for each feature.
164
-
165
- 1. Vowel harmony rule for frontness: Vowels in Turkish are grouped into two
166
- according to where they are produced. Front produced vowels are formed at
167
- the front of the mouth ('e', 'i', 'ö', 'ü') and back produced vowels are
168
- produced nearer to throat ('a', 'ı', 'o', 'u'). According to the vowel
169
- harmony rule, words cannot contain both front and back vowels. This is one
170
- of the reasons why suffixes containing vowels can take different forms to
171
- obey vowel harmony.
172
-
173
- 2. Vowel harmony rule for roundness: Vowels in Turkish are grouped into two
174
- according to whether lips are rounded while producing it. 'o', 'ö', 'u' and
175
- 'ü' are rounded vowels whereas 'a', 'e', 'ı' and 'i' are unrounded.
176
- According to the vowel harmony rules, if the vowel of a syllable is
177
- unrounded, the following vowel is unrounded as well. If the vowel of a
178
- syllable is rounded, the following vowels are 'a', 'e', 'u' or 'ü'.
179
-
180
- ### Last consonant
181
-
182
- Another interesting case in detecting suffixes in Turkish is that, for some
183
- suffixes, if the word ends with a vowel, a consonant is inserted between the
184
- rest of the word and the suffix. These merging consonants can be 'y', 'n' or
185
- 's'. When a merging consonant can be inserted before the suffix, the
186
- representation of the suffix starts with the optional consonant surrounded by
187
- paranthesis (eg. –(y)Um, -(n)cA). For these kinds of suffixes, if existence of
188
- a merging consonant is considered, the candidate stem is checked whether it
189
- ends with a vowel.
190
-
191
- If there is no 'y' consonant before the suffix, only the real part of the
192
- suffix (eg. -Um) is marked for stemming. If there is a 'y' consonant and it is
193
- preceded by a vowel, 'y' is treated as a merging consonant and both 'y' and
194
- the candidate suffix (eg. -Um) is marked for stemming. If there is a consonant
195
- just before 'y', the decision is that the consonant 'y' and the candidate
196
- suffix are really a part of the stem. In such a case, cursor is not advanced
197
- to prevent over-stemming. The last case can occur especially when the stem
198
- originates from another language like in 'lityum' (meaning the element
199
- Lithium). If the check for vowel harmony was not made, the word would be
200
- stemmed to 'lit', for '–(y)Um' would be treated as a suffix affixed to it. But
201
- according to morphological rules of Turkish, the final word would be 'litim',
202
- not 'lityum' if 'lit' were really the stem of the word and the suffix '–(y)Um'
203
- were affixed to it. So detecting 'lit' as the stem of the word would be an over
204
- -stemming.
205
-
206
- ### Merging Vowel
207
-
208
- Similar to merging consonants, there are merging vowels for some suffixes
209
- starting with consonants. They can be preceded by merging vowels like in '-(U)
210
- mUz' suffix when they are affixed to a stem ending with a consonant. In such a
211
- case, a U vowel ('ı', 'i', 'u' or 'ü' depending on vowel harmony) is inserted
212
- between the stem and real suffix (e.g. '-mUz') for ease of pronunciation.
213
-
214
- ### Some examples
215
-
216
- Word / Analysis | Meaning / Stem
217
- ------------------------------ |--------------------------------
218
- Kalelerimizdekilerden | From the ones at one of our castles
219
- Kale-lAr-UmUz-DA-ki-lAr-DAn | Kale
220
- Çocuğuymuşumcasına | As if I were her child
221
- Çocuk-(s)U-(y)mUş-(y)Um-cAsInA | Çocuk
222
- Kedileriyle | With their cats
223
- Kedi-lAr-(s)U-(y)lA | Kedi
224
- Çocuklarımmış | Someone told me that they were my children
225
- çocuk-lAr-(U)m-(y)mUş | Çocuk
226
- Kitabımızdı | It was our book
227
- kitap-UmUz-(y)DU | Kitap
228
-
229
- ## Future Work
230
-
231
- * Add more verbs suffixes.
232
- * Add more derivational suffixes.
233
-
234
- ## References
235
-
236
- 1. [Turkish Stemmer used in Lucene](http://snowball.tartarus.org/algorithms/turkish/stemmer.html)
237
- 2. [Java Implementation](http://snowball.tartarus.org/archives/snowball-discuss/att-0875/02-TurkishStemmer.java)
238
- 3. [Snowball Implementation](http://snowball.tartarus.org/algorithms/turkish/stem_Unicode.sbl)
239
- 4. [Snowball Description](http://snowball.tartarus.org/algorithms/turkish/accompanying_paper.doc)
240
- 5. [An affix stripping morphological analyzer for Turkish](http://web.itu.edu.tr/~gulsenc/papers/iasted.pdf)
241
- 6. [Lead Generation](https://en.wikipedia.org/wiki/Lead_generation)
242
- 7. [Vowel Harmony](https://en.wikipedia.org/wiki/Vowel_harmony#Turkish)
243
- 8. [Turkish Suffixes](https://en.wiktionary.org/wiki/Appendix:Turkish_suffixes)
244
- 9. [Turkish Grammar](https://en.wikipedia.org/wiki/Turkish_grammar)
245
- 10. [Turkish Language](https://en.wikipedia.org/wiki/Turkish_language)
246
- 11. [Tartarus](http://tartarus.org/)
247
- 12. [Information Retrieval on Turkish Texts](http://www.users.muohio.edu/canf/papers/JASIST2008offPrint.pdf)
248
-
249
- ## Installation
250
-
251
- Add this line to your application's Gemfile:
252
-
253
- gem 'turkish_stemmer'
254
-
255
- And then execute:
256
-
257
- $ bundle
258
-
259
- Or install it yourself as:
260
-
261
- $ gem install turkish_stemmer
262
-
263
- ## Usage
264
-
265
- ```ruby
266
- require 'turkish_stemmer'
267
-
268
- TurkishStemmer.stem("gözlükler") # => "gözlük"
269
- ```
270
-
271
- ## Contributing
272
-
273
- 1. Fork it ( http://github.com/<my-github-username>/turkish_stemmer/fork )
274
- 2. Create your feature branch (`git checkout -b my-new-feature`)
275
- 3. Commit your changes (`git commit -am 'Add some feature'`)
276
- 4. Push to the branch (`git push origin my-new-feature`)
277
- 5. Create new Pull Request
278
-
279
- ## License
280
-
281
- turkish_stemmer is licensed under MIT. See [LICENSE](LICENSE.txt).
282
-
data/Rakefile DELETED
@@ -1,21 +0,0 @@
1
- # coding: utf-8
2
- require "bundler/gem_tasks"
3
-
4
- desc "Update the stems of the sample words"
5
- task :update_stemming_samples do
6
- require 'turkish_stemmer'
7
- words = []
8
- filename = "benchmarks/stemming_samples.txt"
9
- File.open(filename, "r") do |sample|
10
- while(line = sample.gets)
11
- word, _ = line.split(",")
12
- words << word
13
- end
14
- end
15
-
16
- File.open(filename, "w") do |sample|
17
- words.each do |word|
18
- sample.puts "#{word},#{TurkishStemmer.stem(word)}"
19
- end
20
- end
21
- end
@@ -1,16 +0,0 @@
1
- require 'benchmark'
2
- require 'turkish_stemmer'
3
- require 'lingua/stemmer'
4
-
5
- Benchmark.bmbm(7) do |x|
6
-
7
- lingua_stemmer = Lingua::Stemmer.new(:language => "tr")
8
-
9
- x.report('Stem using turkish_stemmer gem') do
10
- 1_000.times { TurkishStemmer.stem("telephonlar") }
11
- end
12
-
13
- x.report('Stem using ruby-stemmer gem') do
14
- 1_000.times { lingua_stemmer.stem("telephonlar") }
15
- end
16
- end