estem 0.2.4 → 0.2.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/ChangeLog +37 -10
- data/README.rdoc +7 -38
- data/examples/usage.rb +0 -2
- data/examples/usage.rb~ +11 -0
- data/lib/estem.rb +59 -60
- data/lib/estem.rb~ +271 -0
- data/test/diffs_ISO88591.txt +28390 -0
- data/test/{diffs.txt → diffs_UTF8.txt} +0 -0
- data/test/test_estem.rb +15 -10
- data/test/test_estem.rb~ +27 -0
- metadata +10 -5
- data/bin/es_stem.rb +0 -179
data/ChangeLog
CHANGED
@@ -1,3 +1,30 @@
|
|
1
|
+
Version 0.2.5
|
2
|
+
|
3
|
+
2012-09-02 MaG <maguilamo.c@gmail.com>
|
4
|
+
*
|
5
|
+
- bin/ directory, removed.
|
6
|
+
|
7
|
+
* README.rdoc:
|
8
|
+
- cleanups
|
9
|
+
- Thanks section, removed.
|
10
|
+
|
11
|
+
* examples/usage.rb:
|
12
|
+
- cleanups
|
13
|
+
|
14
|
+
* estem.rb:
|
15
|
+
- (es_stem): rewritten.
|
16
|
+
- (safe_es_stem): deprecated Iconv, removed.
|
17
|
+
|
18
|
+
* bin/es_stem.rb:
|
19
|
+
- removed.
|
20
|
+
|
21
|
+
* test/:
|
22
|
+
- new test file added.
|
23
|
+
- rename file diffs.txt.
|
24
|
+
|
25
|
+
* test/test_estem.rb:
|
26
|
+
- one more test added.
|
27
|
+
|
1
28
|
Version 0.2.4
|
2
29
|
|
3
30
|
2012-06-25 MaG <maguilamo.c@gmail.com>
|
@@ -9,19 +36,19 @@ Version 0.2.4
|
|
9
36
|
- examples/usage.rb: new file
|
10
37
|
|
11
38
|
* README.rdoc:
|
12
|
-
|
13
|
-
|
14
|
-
|
39
|
+
- max 80 cols per line.
|
40
|
+
- recomendation about using safe_es_stem().
|
41
|
+
- Fix Spanish typos.
|
15
42
|
|
16
43
|
* estem.gemspec:
|
17
|
-
|
18
|
-
|
44
|
+
- cleanups.
|
45
|
+
- (required_ruby_version): Ruby 1.9.1.
|
19
46
|
|
20
47
|
* bin/es_stem.rb:
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
48
|
+
- chmod a+x .
|
49
|
+
- (es_stem.rb:80): fix case sensitive comparation.
|
50
|
+
- (es_stem.rb:25): removed .rb ext.
|
51
|
+
- (es_stem.rb:29): new version.
|
25
52
|
|
26
53
|
* estem.rb:
|
27
|
-
|
54
|
+
- (safe_es_stem): new method.
|
data/README.rdoc
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
= Spanish Stem Gem
|
2
2
|
|
3
3
|
== Description
|
4
|
-
This gem
|
4
|
+
This gem reduces Spanish words to their respective roots. It uses an algorithm
|
5
5
|
based on Martin Porter's specifications.
|
6
6
|
|
7
7
|
For more information, visit:
|
@@ -21,15 +21,14 @@ or
|
|
21
21
|
$ gem install estem
|
22
22
|
|
23
23
|
== Usage
|
24
|
-
As a reminder, take in consideration that the Spanish language
|
24
|
+
As a reminder, take in consideration that the Spanish language has several non
|
25
25
|
US-ASCII characters, and because of that, the same data may varied from one
|
26
26
|
codeset to another.
|
27
27
|
|
28
28
|
Please remember to use a UTF-8 compatible encoding while using EStem. Please do
|
29
|
-
not use String#force_encoding
|
30
|
-
|
31
|
-
|
32
|
-
type is unknown.
|
29
|
+
not use String#force_encoding to convert from one codeset to another, you may
|
30
|
+
try using String#encode alone but, instead, consider using String#safe_es_stem
|
31
|
+
when handling incompatibles codesets or the codeset type varies.
|
33
32
|
|
34
33
|
require 'estem'
|
35
34
|
|
@@ -41,19 +40,6 @@ type is unknown.
|
|
41
40
|
puts "HaBiTaCiOnEs".es_stem # ==> "HaBiT"
|
42
41
|
puts "Hacinamiento".es_stem # ==> "Hacin"
|
43
42
|
|
44
|
-
You can use <tt>EStem</tt> as a command line tool:
|
45
|
-
$ es_stem --in-enc ISO-8859-1 -f input_file.txt
|
46
|
-
|
47
|
-
for more information type
|
48
|
-
$ es_stem --help
|
49
|
-
|
50
|
-
The <tt>es_stem</tt> program do his best trying to tokenized the lines from
|
51
|
-
the file, you might consider finding an Spanish tokenizer, either way this
|
52
|
-
program do what it is suppose to do, stem Spanish words.
|
53
|
-
|
54
|
-
NOTE: For excellent results, consider replacing one word per line on the files
|
55
|
-
the program handles.
|
56
|
-
|
57
43
|
== Uso
|
58
44
|
Como recordatorio, ten en cosideración que el Castellano posee muchos
|
59
45
|
carácteres que están fuera del código ASCII, y por esta razón, los datos pueden
|
@@ -62,9 +48,8 @@ variar de un conjunto de codificación a otro.
|
|
62
48
|
Por favor recuerda utilizar sistemas de condificación compatibles con UTF-8
|
63
49
|
cuando se trabaje con EStem. Por favor no use String#force_encoding para
|
64
50
|
convertir de un conjunto de codificación a otro, podría utilizar String#encode
|
65
|
-
pero
|
66
|
-
|
67
|
-
o se desconoce el tipo.
|
51
|
+
solo, pero en su lugar, considere utilizar String#safe_es_stem() si está
|
52
|
+
manejando conjuntos de codificación incompatibles o se desconoce el tipo.
|
68
53
|
|
69
54
|
require 'estem'
|
70
55
|
|
@@ -75,17 +60,6 @@ o se desconoce el tipo.
|
|
75
60
|
puts "ALbeRGues".es_stem # ==> "ALbeRG"
|
76
61
|
puts "HaBiTaCiOnEs".es_stem # ==> "HaBiT"
|
77
62
|
puts "Hacinamiento".es_stem # ==> "Hacin"
|
78
|
-
|
79
|
-
Para más información ejecuta:
|
80
|
-
$ es_stem --help
|
81
|
-
|
82
|
-
El programa <tt>es_stem</tt> hará lo posible para separar las palabras de cada
|
83
|
-
línea del fichero. Sería sensato utilizar otro programa más especializado para
|
84
|
-
este propósito, de todas maneras, es_stem hace lo que se supone debe hacer,
|
85
|
-
optener las raíces de las palabras.
|
86
|
-
|
87
|
-
NOTA: Para resultados excelentes, considere poner una palabra por línea en los
|
88
|
-
ficheros que pasará el programa.
|
89
63
|
|
90
64
|
== Test
|
91
65
|
|
@@ -101,11 +75,6 @@ Incluye 28390 palabras de prueba con sus resultado esperados. Para realizar
|
|
101
75
|
la prueba, ejecuta:
|
102
76
|
rake test
|
103
77
|
|
104
|
-
== Thanks -- Agradecimientos
|
105
|
-
|
106
|
-
Ray Pereda https://github.com/raypereda/stemmify/ I used his gem as a guide to
|
107
|
-
package mine. http://guides.rubygems.org/make-your-own-gem/ as well.
|
108
|
-
|
109
78
|
== License -- Licencia
|
110
79
|
|
111
80
|
Copyright (c) 2012 Manuel A. Güílamo
|
data/examples/usage.rb
CHANGED
data/examples/usage.rb~
ADDED
@@ -0,0 +1,11 @@
|
|
1
|
+
require 'estem'
|
2
|
+
|
3
|
+
hsh = Hash.new
|
4
|
+
|
5
|
+
words = ['albergues','habitaciones','Albergues','ALbeRGues','HaBiTaCiOnEs',
|
6
|
+
'Hacinamiento','mujeres','muchedumbre','ocasionalmente']
|
7
|
+
|
8
|
+
words.each do|w|
|
9
|
+
stem = w.es_stem
|
10
|
+
puts "Word: #{w}\nStem: #{stem}\n\n"
|
11
|
+
end
|
data/lib/estem.rb
CHANGED
@@ -22,8 +22,6 @@
|
|
22
22
|
# * Manuel A. Güílamo maguilamo.c@gmail.com
|
23
23
|
#
|
24
24
|
|
25
|
-
require 'iconv'
|
26
|
-
|
27
25
|
module EStem
|
28
26
|
##
|
29
27
|
# For more information, please refer to <b>String#es_stem</b> method, also <b>EStem</b>.
|
@@ -38,61 +36,59 @@ module EStem
|
|
38
36
|
# "HaBiTaCiOnEs".es_stem # ==> "HaBiT"
|
39
37
|
# "Hacinamiento".es_stem # ==> "Hacin"
|
40
38
|
#
|
41
|
-
#If you are not aware of the codeset the data
|
39
|
+
#If you are not aware of the codeset the data have, try using
|
42
40
|
#String#safe_es_stem instead.
|
43
41
|
#
|
44
42
|
#:call-seq:
|
45
43
|
# str.es_stem => "new_str"
|
46
44
|
def es_stem
|
47
45
|
str = self.dup
|
48
|
-
|
49
|
-
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
unless tmp = step2a(str)
|
54
|
-
tmp = step2b(str)
|
55
|
-
str = tmp ? tmp : str
|
56
|
-
else
|
57
|
-
str = tmp
|
58
|
-
end
|
46
|
+
case str.length
|
47
|
+
when 0
|
48
|
+
return str
|
49
|
+
when 1
|
50
|
+
return remove_accent(str)
|
59
51
|
end
|
60
|
-
|
61
|
-
str
|
52
|
+
|
53
|
+
step0(str)
|
54
|
+
unless step1(str)
|
55
|
+
step2b(str) unless step2a(str)
|
56
|
+
end
|
57
|
+
|
58
|
+
step3(str)
|
62
59
|
remove_accent(str)
|
63
60
|
end
|
64
61
|
|
65
62
|
##
|
66
63
|
#Use this method in case you are not aware of the codeset the data being
|
67
|
-
#handle
|
68
|
-
#the original. Be aware that this method is slower than String#es_stem
|
64
|
+
#handle have. This method returns a new string with the same codeset as
|
65
|
+
#the original. Be aware that this method is a bit slower than String#es_stem
|
69
66
|
#:call-seq:
|
70
67
|
# str.safe_es_stem => "new_str"
|
71
68
|
def safe_es_stem
|
72
|
-
|
73
|
-
|
74
|
-
|
75
|
-
|
76
|
-
str = self.dup.force_encoding('UTF-8')
|
77
|
-
|
78
|
-
if str.valid_encoding?
|
79
|
-
begin
|
80
|
-
tmp = str.es_stem
|
81
|
-
return tmp.force_encoding(default_enc)
|
82
|
-
rescue
|
83
|
-
end
|
69
|
+
if self.encoding == Encoding::UTF_8
|
70
|
+
# remove invalid characters
|
71
|
+
return self.chars.select{|c| c.valid_encoding? }.join.es_stem
|
84
72
|
end
|
85
73
|
|
86
|
-
|
87
|
-
|
88
|
-
|
89
|
-
|
74
|
+
unless self.valid_encoding?
|
75
|
+
tmp = self.dup
|
76
|
+
if tmp.force_encoding('UTF-8').valid_encoding?
|
77
|
+
begin
|
78
|
+
return tmp.es_stem
|
79
|
+
rescue
|
80
|
+
end
|
90
81
|
end
|
91
82
|
end
|
92
83
|
|
84
|
+
default_enc = self.encoding.name
|
85
|
+
str = self.chars.select{|c| c.valid_encoding? }.join
|
86
|
+
|
87
|
+
return nil if str.empty?
|
88
|
+
|
93
89
|
begin
|
94
|
-
tmp =
|
95
|
-
return
|
90
|
+
tmp = str.encode('UTF-8', str.encoding.name).es_stem
|
91
|
+
return tmp.encode(default_enc, 'UTF-8');
|
96
92
|
rescue
|
97
93
|
return nil
|
98
94
|
end
|
@@ -145,8 +141,9 @@ module EStem
|
|
145
141
|
[r1,r2]
|
146
142
|
end
|
147
143
|
|
144
|
+
#=> true or false
|
148
145
|
def step0(str)
|
149
|
-
return
|
146
|
+
return false unless str =~ /(se(l[ao]s?)?|l([aeo]s?)|me|nos)$/i
|
150
147
|
|
151
148
|
suffix = $&
|
152
149
|
rv_text = str[rv(str)..-1]
|
@@ -154,21 +151,21 @@ module EStem
|
|
154
151
|
case rv_text
|
155
152
|
when %r{((?<=i[éÉ]ndo|[áÁ]ndo|[áéíÁÉÍ]r)#{suffix})$}ui
|
156
153
|
str[%r{#$&$}]=''
|
157
|
-
str
|
158
|
-
return
|
154
|
+
str.replace(remove_accent(str))
|
155
|
+
return true
|
159
156
|
when %r{((?<=iendo|ando|[aei]r)#{suffix})$}i
|
160
157
|
str[%r{#$&$}]=''
|
161
|
-
return
|
158
|
+
return true
|
162
159
|
end
|
163
160
|
|
164
161
|
if rv_text =~ /yendo/i and str =~ /uyendo/i
|
165
162
|
str[suffix]=''
|
166
|
-
return
|
163
|
+
return true
|
167
164
|
end
|
168
|
-
|
165
|
+
false
|
169
166
|
end
|
170
167
|
|
171
|
-
#=>
|
168
|
+
#=> true or false
|
172
169
|
def step1(str)
|
173
170
|
r1,r2 = r12(str)
|
174
171
|
r1_text = str[r1..-1]
|
@@ -177,46 +174,46 @@ module EStem
|
|
177
174
|
case r2_text
|
178
175
|
when /(anzas?|ic[oa]s?|ismos?|[ai]bles?|istas?|os[oa]s?|[ai]mientos?)$/i
|
179
176
|
str[%r{#$&$}]=''
|
180
|
-
return
|
177
|
+
return true
|
181
178
|
when /(ic)?(ador([ae]s?)?|aci[óÓ]n|aciones|antes?|ancias?)$/ui
|
182
179
|
str[%r{#$&$}]=''
|
183
|
-
return
|
180
|
+
return true
|
184
181
|
when /log[íÍ]as?/ui
|
185
182
|
str[%r{#$&$}]='log'
|
186
|
-
return
|
183
|
+
return true
|
187
184
|
when /(uci([óÓ]n|ones))$/ui
|
188
185
|
str[%r{#$&$}]='u'
|
189
|
-
return
|
186
|
+
return true
|
190
187
|
when /(encias?)$/i
|
191
188
|
str[%r{#$&$}]='ente'
|
192
|
-
return
|
189
|
+
return true
|
193
190
|
end
|
194
191
|
|
195
192
|
if r2_text =~ /(ativ|iv|os|ic|ad)amente$/i or r1_text =~ /amente$/i
|
196
193
|
str[%r{#$&$}]=''
|
197
|
-
return
|
194
|
+
return true
|
198
195
|
end
|
199
196
|
|
200
197
|
case r2_text
|
201
198
|
when /((ante|[ai]ble)?mente)$/i, /((abil|i[cv])?idad(es)?)$/i, /((at)?iv[ao]s?)$/i
|
202
199
|
str[%r{#$&$}]=''
|
203
|
-
return
|
200
|
+
return true
|
204
201
|
end
|
205
|
-
|
202
|
+
false
|
206
203
|
end
|
207
204
|
|
208
|
-
#=>
|
205
|
+
#=> true or false
|
209
206
|
def step2a(str)
|
210
207
|
rv_pos = rv(str)
|
211
208
|
idx = str[rv_pos..-1] =~ /(y[oóÓ]|ye(ron|ndo)|y[ae][ns]?|ya(is|mos))$/ui
|
212
209
|
|
213
|
-
return
|
210
|
+
return false unless idx
|
214
211
|
|
215
212
|
if 'u' == str[rv_pos+idx-1].downcase
|
216
213
|
str[%r{#$&$}] = ''
|
217
|
-
return
|
214
|
+
return true
|
218
215
|
end
|
219
|
-
|
216
|
+
false
|
220
217
|
end
|
221
218
|
|
222
219
|
STEP2B_REGEXP = /(
|
@@ -229,6 +226,7 @@ module EStem
|
|
229
226
|
en|es|[éÉ]is|emos
|
230
227
|
)$/xiu
|
231
228
|
|
229
|
+
#=> true or false
|
232
230
|
def step2b(str)
|
233
231
|
rv_pos = rv(str)
|
234
232
|
|
@@ -240,27 +238,28 @@ module EStem
|
|
240
238
|
else
|
241
239
|
str[%r{#{suffix}$}]=''
|
242
240
|
end
|
243
|
-
return
|
241
|
+
return true
|
244
242
|
end
|
245
|
-
|
243
|
+
false
|
246
244
|
end
|
247
245
|
|
246
|
+
#=> true or false
|
248
247
|
def step3(str)
|
249
248
|
rv_pos = rv(str)
|
250
249
|
rv_text = str[rv_pos..-1]
|
251
250
|
|
252
251
|
if rv_text =~ /(os|[aoáíóÁÍÓ])$/ui
|
253
252
|
str[%r{#$&$}]=''
|
254
|
-
return
|
253
|
+
return true
|
255
254
|
elsif idx = rv_text =~ /(u?[eéÉ])$/i
|
256
255
|
if $&[0].downcase == 'u' and str[rv_pos+idx-1].downcase == 'g'
|
257
256
|
str[%r{#$&$}]=''
|
258
257
|
else
|
259
258
|
str.chop!
|
260
259
|
end
|
261
|
-
return
|
260
|
+
return true
|
262
261
|
end
|
263
|
-
|
262
|
+
false
|
264
263
|
end
|
265
264
|
|
266
265
|
VOWEL = 'aeiouáéíóúüAEIOUÁÉÍÓÚÜ'
|
data/lib/estem.rb~
ADDED
@@ -0,0 +1,271 @@
|
|
1
|
+
# encoding: UTF-8
|
2
|
+
#
|
3
|
+
# :title: Spanish Stemming
|
4
|
+
# = Description
|
5
|
+
# This gem is for reducing Spanish words to their roots. It uses an algorithm
|
6
|
+
# based on Martin Porter's specifications.
|
7
|
+
#
|
8
|
+
# For more information, visit:
|
9
|
+
# http://snowball.tartarus.org/algorithms/spanish/stemmer.html
|
10
|
+
#
|
11
|
+
# = Descripción
|
12
|
+
# Esta gema está para reducir las palabras del Español en sus respectivas raíces,
|
13
|
+
# para ello ultiliza un algoritmo basado en las especificaciones de Martin Porter
|
14
|
+
#
|
15
|
+
# Para más información, visite:
|
16
|
+
# http://snowball.tartarus.org/algorithms/spanish/stemmer.html
|
17
|
+
#
|
18
|
+
# = License -- Licencia
|
19
|
+
# This code is provided under the terms of the {MIT License.}[http://www.opensource.org/licenses/mit-license.php]
|
20
|
+
#
|
21
|
+
# = Authors
|
22
|
+
# * Manuel A. Güílamo maguilamo.c@gmail.com
|
23
|
+
#
|
24
|
+
|
25
|
+
module EStem
|
26
|
+
##
|
27
|
+
# For more information, please refer to <b>String#es_stem</b> method, also <b>EStem</b>.
|
28
|
+
# :method: estem
|
29
|
+
|
30
|
+
##
|
31
|
+
#This method stem Spanish words.
|
32
|
+
#
|
33
|
+
# "albergues".es_stem # ==> "alberg"
|
34
|
+
# "habitaciones".es_stem # ==> "habit"
|
35
|
+
# "ALbeRGues".es_stem # ==> "ALbeRG"
|
36
|
+
# "HaBiTaCiOnEs".es_stem # ==> "HaBiT"
|
37
|
+
# "Hacinamiento".es_stem # ==> "Hacin"
|
38
|
+
#
|
39
|
+
#If you are not aware of the codeset the data have, try using
|
40
|
+
#String#safe_es_stem instead.
|
41
|
+
#
|
42
|
+
#:call-seq:
|
43
|
+
# str.es_stem => "new_str"
|
44
|
+
def es_stem
|
45
|
+
str = self.dup
|
46
|
+
case str.length
|
47
|
+
when 0
|
48
|
+
return str
|
49
|
+
when 1
|
50
|
+
return remove_accent(str)
|
51
|
+
end
|
52
|
+
|
53
|
+
step0(str)
|
54
|
+
unless step1(str)
|
55
|
+
step2b(str) unless step2a(str)
|
56
|
+
end
|
57
|
+
|
58
|
+
step3(str)
|
59
|
+
remove_accent(str)
|
60
|
+
end
|
61
|
+
|
62
|
+
##
|
63
|
+
#Use this method in case you are not aware of the codeset the data being
|
64
|
+
#handle have. This method returns a new string with the same codeset as
|
65
|
+
#the original. Be aware that this method is a bit slower than String#es_stem
|
66
|
+
#:call-seq:
|
67
|
+
# str.safe_es_stem => "new_str"
|
68
|
+
def safe_es_stem
|
69
|
+
if str.encoding == Encoding::UTF_8
|
70
|
+
# remove invalid characters
|
71
|
+
return self.chars.select{|c| c.valid_encoding? }.join.es_stem
|
72
|
+
end
|
73
|
+
|
74
|
+
unless self.valid_encoding?
|
75
|
+
tmp = self.dup
|
76
|
+
if tmp.force_encoding('UTF-8').valid_encoding?
|
77
|
+
begin
|
78
|
+
return tmp.es_stem
|
79
|
+
rescue
|
80
|
+
end
|
81
|
+
end
|
82
|
+
end
|
83
|
+
|
84
|
+
default_enc = self.encoding.name
|
85
|
+
str = self.chars.select{|c| c.valid_encoding? }.join
|
86
|
+
|
87
|
+
return nil if str.empty?
|
88
|
+
|
89
|
+
begin
|
90
|
+
tmp = str.encode('UTF-8', str.encoding.name).es_stem
|
91
|
+
return tmp.encode(default_enc, 'UTF-8');
|
92
|
+
rescue
|
93
|
+
return nil
|
94
|
+
end
|
95
|
+
end
|
96
|
+
|
97
|
+
# :stopdoc:
|
98
|
+
|
99
|
+
private
|
100
|
+
|
101
|
+
def vowel?(c)
|
102
|
+
VOWEL.include?(c)
|
103
|
+
end
|
104
|
+
|
105
|
+
def consonant?(c)
|
106
|
+
CONSONANT.include?(c)
|
107
|
+
end
|
108
|
+
|
109
|
+
def remove_accent(str)
|
110
|
+
str.tr('áéíóúÁÉÍÓÚ','aeiouAEIOU')
|
111
|
+
end
|
112
|
+
|
113
|
+
def rv(str)
|
114
|
+
if consonant? str[1]
|
115
|
+
i=2
|
116
|
+
i+=1 while str[i] and consonant? str[i]
|
117
|
+
return str.nil? ? str.length-1 : i+1
|
118
|
+
end
|
119
|
+
|
120
|
+
if vowel? str[0] and vowel? str[1]
|
121
|
+
i=2
|
122
|
+
i+=1 while str[i] and vowel? str[i]
|
123
|
+
return str.nil? ? str.length-1 : i+1
|
124
|
+
end
|
125
|
+
|
126
|
+
return 3 if consonant? str[0] and vowel? str[1]
|
127
|
+
|
128
|
+
str.length - 1
|
129
|
+
end
|
130
|
+
|
131
|
+
def r(str, i=0)
|
132
|
+
i+=1 while str[i] and consonant?(str[i])
|
133
|
+
i+=1
|
134
|
+
i+=1 while str[i] and vowel? str[i]
|
135
|
+
str[i].nil? ? str.length : i+1
|
136
|
+
end
|
137
|
+
|
138
|
+
def r12(str)
|
139
|
+
r1 = r(str)
|
140
|
+
r2 = r(str,r1)
|
141
|
+
[r1,r2]
|
142
|
+
end
|
143
|
+
|
144
|
+
#=> true or false
|
145
|
+
def step0(str)
|
146
|
+
return false unless str =~ /(se(l[ao]s?)?|l([aeo]s?)|me|nos)$/i
|
147
|
+
|
148
|
+
suffix = $&
|
149
|
+
rv_text = str[rv(str)..-1]
|
150
|
+
|
151
|
+
case rv_text
|
152
|
+
when %r{((?<=i[éÉ]ndo|[áÁ]ndo|[áéíÁÉÍ]r)#{suffix})$}ui
|
153
|
+
str[%r{#$&$}]=''
|
154
|
+
str.replace(remove_accent(str))
|
155
|
+
return true
|
156
|
+
when %r{((?<=iendo|ando|[aei]r)#{suffix})$}i
|
157
|
+
str[%r{#$&$}]=''
|
158
|
+
return true
|
159
|
+
end
|
160
|
+
|
161
|
+
if rv_text =~ /yendo/i and str =~ /uyendo/i
|
162
|
+
str[suffix]=''
|
163
|
+
return true
|
164
|
+
end
|
165
|
+
false
|
166
|
+
end
|
167
|
+
|
168
|
+
#=> true or false
|
169
|
+
def step1(str)
|
170
|
+
r1,r2 = r12(str)
|
171
|
+
r1_text = str[r1..-1]
|
172
|
+
r2_text = str[r2..-1]
|
173
|
+
|
174
|
+
case r2_text
|
175
|
+
when /(anzas?|ic[oa]s?|ismos?|[ai]bles?|istas?|os[oa]s?|[ai]mientos?)$/i
|
176
|
+
str[%r{#$&$}]=''
|
177
|
+
return true
|
178
|
+
when /(ic)?(ador([ae]s?)?|aci[óÓ]n|aciones|antes?|ancias?)$/ui
|
179
|
+
str[%r{#$&$}]=''
|
180
|
+
return true
|
181
|
+
when /log[íÍ]as?/ui
|
182
|
+
str[%r{#$&$}]='log'
|
183
|
+
return true
|
184
|
+
when /(uci([óÓ]n|ones))$/ui
|
185
|
+
str[%r{#$&$}]='u'
|
186
|
+
return true
|
187
|
+
when /(encias?)$/i
|
188
|
+
str[%r{#$&$}]='ente'
|
189
|
+
return true
|
190
|
+
end
|
191
|
+
|
192
|
+
if r2_text =~ /(ativ|iv|os|ic|ad)amente$/i or r1_text =~ /amente$/i
|
193
|
+
str[%r{#$&$}]=''
|
194
|
+
return true
|
195
|
+
end
|
196
|
+
|
197
|
+
case r2_text
|
198
|
+
when /((ante|[ai]ble)?mente)$/i, /((abil|i[cv])?idad(es)?)$/i, /((at)?iv[ao]s?)$/i
|
199
|
+
str[%r{#$&$}]=''
|
200
|
+
return true
|
201
|
+
end
|
202
|
+
false
|
203
|
+
end
|
204
|
+
|
205
|
+
#=> true or false
|
206
|
+
def step2a(str)
|
207
|
+
rv_pos = rv(str)
|
208
|
+
idx = str[rv_pos..-1] =~ /(y[oóÓ]|ye(ron|ndo)|y[ae][ns]?|ya(is|mos))$/ui
|
209
|
+
|
210
|
+
return false unless idx
|
211
|
+
|
212
|
+
if 'u' == str[rv_pos+idx-1].downcase
|
213
|
+
str[%r{#$&$}] = ''
|
214
|
+
return true
|
215
|
+
end
|
216
|
+
false
|
217
|
+
end
|
218
|
+
|
219
|
+
STEP2B_REGEXP = /(
|
220
|
+
ar([áÁ][ns]?|a(n|s|is)?|on)? | ar([éÉ]is|emos|é|É) | ar[íÍ]a(n|s|is|mos)? |
|
221
|
+
er([áÁ][sn]?|[éÉ](is)?|emos|[íÍ]a(n|s|is|mos)?)? |
|
222
|
+
ir([íÍ]a(s|n|is|mos)?|[áÁ][ns]?|emos|[éÉ]|éis)? | aba(s|n|is)? |
|
223
|
+
ad([ao]s?)? | ed | id(a|as|o|os)? | [íÍ]a(n|s|is|mos)? | [íÍ]s |
|
224
|
+
as(e[ns]?|te|eis|teis)? | [áÁ](is|bamos|semos|ramos) | a(n|ndo|mos) |
|
225
|
+
ie(ra|se|ran|sen|ron|ndo|ras|ses|rais|seis) | i(ste|steis|[óÓ]|mos|[éÉ]ramos|[éÉ]semos) |
|
226
|
+
en|es|[éÉ]is|emos
|
227
|
+
)$/xiu
|
228
|
+
|
229
|
+
#=> true or false
|
230
|
+
def step2b(str)
|
231
|
+
rv_pos = rv(str)
|
232
|
+
|
233
|
+
if idx = str[rv_pos..-1] =~ STEP2B_REGEXP
|
234
|
+
suffix = $&
|
235
|
+
if suffix =~ /^(en|es|[éÉ]is|emos)$/ui
|
236
|
+
str[%r{#{suffix}$}]=''
|
237
|
+
str[rv_pos+idx-1]='' if str[rv_pos+idx-2] =~ /g/i and str[rv_pos+idx-1] =~ /u/i
|
238
|
+
else
|
239
|
+
str[%r{#{suffix}$}]=''
|
240
|
+
end
|
241
|
+
return true
|
242
|
+
end
|
243
|
+
false
|
244
|
+
end
|
245
|
+
|
246
|
+
#=> true or false
|
247
|
+
def step3(str)
|
248
|
+
rv_pos = rv(str)
|
249
|
+
rv_text = str[rv_pos..-1]
|
250
|
+
|
251
|
+
if rv_text =~ /(os|[aoáíóÁÍÓ])$/ui
|
252
|
+
str[%r{#$&$}]=''
|
253
|
+
return true
|
254
|
+
elsif idx = rv_text =~ /(u?[eéÉ])$/i
|
255
|
+
if $&[0].downcase == 'u' and str[rv_pos+idx-1].downcase == 'g'
|
256
|
+
str[%r{#$&$}]=''
|
257
|
+
else
|
258
|
+
str.chop!
|
259
|
+
end
|
260
|
+
return true
|
261
|
+
end
|
262
|
+
false
|
263
|
+
end
|
264
|
+
|
265
|
+
VOWEL = 'aeiouáéíóúüAEIOUÁÉÍÓÚÜ'
|
266
|
+
CONSONANT = "bcdfghjklmnñpqrstvwxyzABCDEFGHIJKLMNÑOPQRSTUVWXYZ"
|
267
|
+
end
|
268
|
+
|
269
|
+
class String
|
270
|
+
include EStem
|
271
|
+
end
|