estem 0.2.3 → 0.2.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/COPYRIGHT ADDED
@@ -0,0 +1,19 @@
1
+ Copyright (c) 2012 Manuel A. Güílamo
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy of
4
+ this software and associated documentation files (the "Software"), to deal in
5
+ the Software without restriction, including without limitation the rights to
6
+ use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
7
+ of the Software, and to permit persons to whom the Software is furnished to do
8
+ so, subject to the following conditions:
9
+
10
+ The above copyright notice and this permission notice shall be included in all
11
+ copies or substantial portions of the Software.
12
+
13
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
19
+ SOFTWARE.
data/ChangeLog ADDED
@@ -0,0 +1,27 @@
1
+ Version 0.2.4
2
+
3
+ 2012-06-25 MaG <maguilamo.c@gmail.com>
4
+
5
+ *
6
+ - ChangeLog new file.
7
+ - fix README.rdoc added.
8
+ - fix COPYRIGHT added.
9
+ - examples/usage.rb: new file
10
+
11
+ * README.rdoc:
12
+ - max 80 cols per line.
13
+ - recomendation about using safe_es_stem().
14
+ - Fix Spanish typos.
15
+
16
+ * estem.gemspec:
17
+ - cleanups.
18
+ - (required_ruby_version): Ruby 1.9.1.
19
+
20
+ * bin/es_stem.rb:
21
+ - chmod a+x .
22
+ - (es_stem.rb:80): fix case sensitive comparation.
23
+ - (es_stem.rb:25): removed .rb ext.
24
+ - (es_stem.rb:29): new version.
25
+
26
+ * estem.rb:
27
+ - (safe_es_stem): new method.
data/README.rdoc ADDED
@@ -0,0 +1,130 @@
1
+ = Spanish Stem Gem
2
+
3
+ == Description
4
+ This gem is for reducing Spanish words to their roots. It uses an algorithm
5
+ based on Martin Porter's specifications.
6
+
7
+ For more information, visit:
8
+ http://snowball.tartarus.org/algorithms/spanish/stemmer.html
9
+
10
+ == Descripción
11
+ Esta gema está para reducir las palabras del Español en sus respectivas raíces,
12
+ para ello ultiliza un algoritmo basado en las especificaciones de Martin Porter
13
+
14
+ Para más información, visite:
15
+ http://snowball.tartarus.org/algorithms/spanish/stemmer.html
16
+
17
+ == Install -- Instalar
18
+
19
+ $ sudo gem install estem
20
+ or
21
+ $ gem install estem
22
+
23
+ == Usage
24
+ As a reminder, take in consideration that the Spanish language have several non
25
+ US-ASCII characters, and because of that, the same data may varied from one
26
+ codeset to another.
27
+
28
+ Please remember to use a UTF-8 compatible encoding while using EStem. Please do
29
+ not use String#force_encoding() to convert from one codeset to another, you
30
+ might try using String#encode() but this later is more likely to fail, consider
31
+ using String#safe_es_stem() when handling incompatibles codesets or the codeset
32
+ type is unknown.
33
+
34
+ require 'estem'
35
+
36
+ puts "albergues".es_stem # ==> "alberg"
37
+ puts "habitaciones".es_stem # ==> "habit"
38
+
39
+ # EStem will never make unnecessary changes to your input data.
40
+ puts "ALbeRGues".es_stem # ==> "ALbeRG"
41
+ puts "HaBiTaCiOnEs".es_stem # ==> "HaBiT"
42
+ puts "Hacinamiento".es_stem # ==> "Hacin"
43
+
44
+ You can use <tt>EStem</tt> as a command line tool:
45
+ $ es_stem --in-enc ISO-8859-1 -f input_file.txt
46
+
47
+ for more information type
48
+ $ es_stem --help
49
+
50
+ The <tt>es_stem</tt> program do his best trying to tokenized the lines from
51
+ the file, you might consider finding an Spanish tokenizer, either way this
52
+ program do what it is suppose to do, stem Spanish words.
53
+
54
+ NOTE: For excellent results, consider replacing one word per line on the files
55
+ the program handles.
56
+
57
+ == Uso
58
+ Como recordatorio, ten en cosideración que el Castellano posee muchos
59
+ carácteres que están fuera del código ASCII, y por esta razón, los datos pueden
60
+ variar de un conjunto de codificación a otro.
61
+
62
+ Por favor recuerda utilizar sistemas de condificación compatibles con UTF-8
63
+ cuando se trabaje con EStem. Por favor no use String#force_encoding para
64
+ convertir de un conjunto de codificación a otro, podría utilizar String#encode
65
+ pero este último es más probable que falle en el intento, considere utilizar
66
+ String#safe_es_stem() si está manejando conjuntos de codificación incompatibles
67
+ o se desconoce el tipo.
68
+
69
+ require 'estem'
70
+
71
+ puts "albergues".es_stem # ==> "alberg"
72
+ puts "habitaciones".es_stem # ==> "habit"
73
+
74
+ # EStem nunca hará cambios innecesarios a tus datos.
75
+ puts "ALbeRGues".es_stem # ==> "ALbeRG"
76
+ puts "HaBiTaCiOnEs".es_stem # ==> "HaBiT"
77
+ puts "Hacinamiento".es_stem # ==> "Hacin"
78
+
79
+ Para más información ejecuta:
80
+ $ es_stem --help
81
+
82
+ El programa <tt>es_stem</tt> hará lo posible para separar las palabras de cada
83
+ línea del fichero. Sería sensato utilizar otro programa más especializado para
84
+ este propósito, de todas maneras, es_stem hace lo que se supone debe hacer,
85
+ optener las raíces de las palabras.
86
+
87
+ NOTA: Para resultados excelentes, considere poner una palabra por línea en los
88
+ ficheros que pasará el programa.
89
+
90
+ == Test
91
+
92
+ This test is based on the sample input and output text from Martin Porter
93
+ website. It includes 28390 test words and their expected stem results.
94
+ To run the test, just type:
95
+ rake test
96
+
97
+ == Pruebas
98
+
99
+ Esta prueba está basada en un archivo de prueba provisto por Martin Porter.
100
+ Incluye 28390 palabras de prueba con sus resultado esperados. Para realizar
101
+ la prueba, ejecuta:
102
+ rake test
103
+
104
+ == Thanks -- Agradecimientos
105
+
106
+ Ray Pereda https://github.com/raypereda/stemmify/ I used his gem as a guide to
107
+ package mine. http://guides.rubygems.org/make-your-own-gem/ as well.
108
+
109
+ == License -- Licencia
110
+
111
+ Copyright (c) 2012 Manuel A. Güílamo
112
+
113
+ Permission is hereby granted, free of charge, to any person obtaining
114
+ a copy of this software and associated documentation files (the
115
+ "Software"), to deal in the Software without restriction, including
116
+ without limitation the rights to use, copy, modify, merge, publish,
117
+ distribute, sublicense, and/or sell copies of the Software, and to
118
+ permit persons to whom the Software is furnished to do so, subject to
119
+ the following conditions:
120
+
121
+ The above copyright notice and this permission notice shall be
122
+ included in all copies or substantial portions of the Software.
123
+
124
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
125
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
126
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
127
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
128
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
129
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
130
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/bin/es_stem.rb CHANGED
@@ -1,5 +1,6 @@
1
1
  #!/usr/bin/env ruby
2
2
  # encoding: UTF-8
3
+ # :stopdoc:
3
4
 
4
5
  # Copyright (c) 2012 Manuel A. Güílamo
5
6
  #
@@ -21,11 +22,11 @@
21
22
  # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22
23
  # SOFTWARE.
23
24
 
24
- require 'estem.rb'
25
+ require 'estem'
25
26
  require 'getoptlong'
26
27
  require 'iconv'
27
28
 
28
- $version = "0.1.9"
29
+ $version = "0.1.10"
29
30
 
30
31
  def usage(error=false)
31
32
  out = error ? $stderr : $stdout
@@ -76,7 +77,7 @@ end
76
77
 
77
78
  if filename
78
79
  begin
79
- if ienc and ienc!='UTF-8'
80
+ if ienc and ienc.upcase !='UTF-8'
80
81
  file = File.open(filename, "r:#{ienc}:UTF-8")
81
82
  else
82
83
  file = File.open(filename, 'r:UTF-8')
data/examples/usage.rb ADDED
@@ -0,0 +1,11 @@
1
+ require 'estem'
2
+
3
+ hsh = Hash.new
4
+
5
+ words = ['albergues','habitaciones','Albergues','ALbeRGues','HaBiTaCiOnEs',
6
+ 'Hacinamiento','mujeres','muchedumbre','ocasionalmente']
7
+
8
+ words.each do|w|
9
+ stem = w.es_stem
10
+ puts "Word: #{w}\nStem: #{stem}\n\n"
11
+ end
data/lib/estem.rb CHANGED
@@ -19,27 +19,30 @@
19
19
  # This code is provided under the terms of the {MIT License.}[http://www.opensource.org/licenses/mit-license.php]
20
20
  #
21
21
  # = Authors
22
- # * Manuel A. Güílamo
22
+ # * Manuel A. Güílamo maguilamo.c@gmail.com
23
23
  #
24
24
 
25
+ require 'iconv'
26
+
25
27
  module EStem
26
28
  ##
29
+ # For more information, please refer to <b>String#es_stem</b> method, also <b>EStem</b>.
27
30
  # :method: estem
28
- # For more information, please see <b>String#es_stem</b> method, also <b>EStem</b>.
29
-
30
-
31
- ##
32
- #This method stem Spanish words.
33
- #
34
- # "albergues".es_stem # ==> "alberg"
35
- # "habitaciones".es_stem # ==> "habit"
36
- # "ALbeRGues".es_stem # ==> "ALbeRG"
37
- # "HaBiTaCiOnEs".es_stem # ==> "HaBiT"
38
- # "Hacinamiento".es_stem # ==> "Hacin"
39
- #
40
- #:call-seq:
41
- # str.es_stem => "new_str"
42
31
 
32
+ ##
33
+ #This method stem Spanish words.
34
+ #
35
+ # "albergues".es_stem # ==> "alberg"
36
+ # "habitaciones".es_stem # ==> "habit"
37
+ # "ALbeRGues".es_stem # ==> "ALbeRG"
38
+ # "HaBiTaCiOnEs".es_stem # ==> "HaBiT"
39
+ # "Hacinamiento".es_stem # ==> "Hacin"
40
+ #
41
+ #If you are not aware of the codeset the data has, then use
42
+ #String#safe_es_stem instead.
43
+ #
44
+ #:call-seq:
45
+ # str.es_stem => "new_str"
43
46
  def es_stem
44
47
  str = self.dup
45
48
  return remove_accent(str) if str.length == 1
@@ -59,6 +62,42 @@ module EStem
59
62
  remove_accent(str)
60
63
  end
61
64
 
65
+ ##
66
+ #Use this method in case you are not aware of the codeset the data being
67
+ #handle has. This method returns a new string with the same codeset as
68
+ #the original. Be aware that this method is slower than String#es_stem()
69
+ #:call-seq:
70
+ # str.safe_es_stem => "new_str"
71
+ def safe_es_stem
72
+ return self.es_stem if self.encoding == Encoding::UTF_8
73
+
74
+ default_enc = self.encoding.name
75
+
76
+ str = self.dup.force_encoding('UTF-8')
77
+
78
+ if str.valid_encoding?
79
+ begin
80
+ tmp = str.es_stem
81
+ return tmp.force_encoding(default_enc)
82
+ rescue
83
+ end
84
+ end
85
+
86
+ if enc = Encoding.compatible?(self, VOWEL)
87
+ begin
88
+ return self.encode(enc).es_stem
89
+ rescue
90
+ end
91
+ end
92
+
93
+ begin
94
+ tmp = Iconv.conv('UTF-8', self.encoding.name, self).es_stem
95
+ return Iconv.conv(default_enc, 'UTF-8', tmp);
96
+ rescue
97
+ return nil
98
+ end
99
+ end
100
+
62
101
  # :stopdoc:
63
102
 
64
103
  private
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: estem
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.3
4
+ version: 0.2.4
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-05-20 00:00:00.000000000 Z
12
+ date: 2012-06-25 00:00:00.000000000 Z
13
13
  dependencies: []
14
14
  description: Spanish stemming. Based on Martin Porter's specifications. See README
15
15
  file for more information.
@@ -21,7 +21,10 @@ files:
21
21
  - Rakefile
22
22
  - bin/es_stem.rb
23
23
  - lib/estem.rb
24
- - lib/estem.rb~
24
+ - examples/usage.rb
25
+ - COPYRIGHT
26
+ - README.rdoc
27
+ - ChangeLog
25
28
  - test/diffs.txt
26
29
  - test/test_estem.rb
27
30
  homepage: https://github.com/MaG21/estem
@@ -35,7 +38,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
35
38
  requirements:
36
39
  - - ! '>='
37
40
  - !ruby/object:Gem::Version
38
- version: 1.9.2
41
+ version: 1.9.1
39
42
  required_rubygems_version: !ruby/object:Gem::Requirement
40
43
  none: false
41
44
  requirements:
data/lib/estem.rb~ DELETED
@@ -1,233 +0,0 @@
1
- # encoding: UTF-8
2
- #
3
- # :title: Spanish Stemming
4
- # = Description
5
- # This gem is for reducing Spanish words to their roots. It uses an algorithm
6
- # based on Martin Porter's specifications.
7
- #
8
- # For more information, visit:
9
- # http://snowball.tartarus.org/algorithms/spanish/stemmer.html
10
- #
11
- # = Descripción
12
- # Esta gema está para reducir las palabras del Español en sus respectivas raíces,
13
- # para ello ultiliza un algoritmo basado en las especificaciones de Martin Porter
14
- #
15
- # Para más información, visite:
16
- # http://snowball.tartarus.org/algorithms/spanish/stemmer.html
17
- #
18
- # = License -- Licencia
19
- # This code is provided under the terms of the {MIT License.}[http://www.opensource.org/licenses/mit-license.php]
20
- #
21
- # = Authors
22
- # * Manuel A. Güílamo
23
- #
24
-
25
- module EStem
26
- ##
27
- # :method: estem
28
- # For more information, please see <b>String#es_stem</b> method, also <b>EStem</b>.
29
-
30
-
31
- ##
32
- #This method reduces Spanish words to their root.
33
- #
34
- # "albergues".es_stem # ==> "alberg"
35
- # "habitaciones".es_stem # ==> "habit"
36
- # "ALbeRGues".es_stem # ==> "ALbeRG"
37
- # "HaBiTaCiOnEs".es_stem # ==> "HaBiT"
38
- # "Hacinamiento".es_stem # ==> "Hacin"
39
- #
40
- #:call-seq:
41
- # str.es_stem => "new_str"
42
-
43
- def es_stem
44
- str = self.dup
45
- return remove_accent(str) if str.length == 1
46
- tmp = step0(str)
47
- str = tmp ? tmp : str
48
-
49
- unless tmp = step1(str)
50
- unless tmp = step2a(str)
51
- tmp = step2b(str)
52
- str = tmp ? tmp : str
53
- else
54
- str = tmp
55
- end
56
- end
57
- tmp = step3(str)
58
- str = tmp.nil? ? str : tmp
59
- remove_accent(str)
60
- end
61
-
62
- # :stopdoc:
63
-
64
- private
65
-
66
- def vowel?(c)
67
- VOWEL.include?(c)
68
- end
69
-
70
- def consonant?(c)
71
- CONSONANT.include?(c)
72
- end
73
-
74
- def remove_accent(str)
75
- str.tr('áéíóúÁÉÍÓÚ','aeiouAEIOU')
76
- end
77
-
78
- def rv(str)
79
- if consonant? str[1]
80
- i=2
81
- i+=1 while str[i] and consonant? str[i]
82
- return str.nil? ? str.length-1 : i+1
83
- end
84
-
85
- if vowel? str[0] and vowel? str[1]
86
- i=2
87
- i+=1 while str[i] and vowel? str[i]
88
- return str.nil? ? str.length-1 : i+1
89
- end
90
-
91
- return 3 if consonant? str[0] and vowel? str[1]
92
-
93
- str.length - 1
94
- end
95
-
96
- def r(str, i=0)
97
- i+=1 while str[i] and consonant?(str[i])
98
- i+=1
99
- i+=1 while str[i] and vowel? str[i]
100
- str[i].nil? ? str.length : i+1
101
- end
102
-
103
- def r12(str)
104
- r1 = r(str)
105
- r2 = r(str,r1)
106
- [r1,r2]
107
- end
108
-
109
- def step0(str)
110
- return nil unless str =~ /(se(l[ao]s?)?|l([aeo]s?)|me|nos)$/i
111
-
112
- suffix = $&
113
- rv_text = str[rv(str)..-1]
114
-
115
- case rv_text
116
- when %r{((?<=i[éÉ]ndo|[áÁ]ndo|[áéíÁÉÍ]r)#{suffix})$}ui
117
- str[%r{#$&$}]=''
118
- str = remove_accent(str)
119
- return str
120
- when %r{((?<=iendo|ando|[aei]r)#{suffix})$}i
121
- str[%r{#$&$}]=''
122
- return str
123
- end
124
-
125
- if rv_text =~ /yendo/i and str =~ /uyendo/i
126
- str[suffix]=''
127
- return str
128
- end
129
- nil
130
- end
131
-
132
- #=> new_str or nil
133
- def step1(str)
134
- r1,r2 = r12(str)
135
- r1_text = str[r1..-1]
136
- r2_text = str[r2..-1]
137
-
138
- case r2_text
139
- when /(anzas?|ic[oa]s?|ismos?|[ai]bles?|istas?|os[oa]s?|[ai]mientos?)$/i
140
- str[%r{#$&$}]=''
141
- return str
142
- when /(ic)?(ador([ae]s?)?|aci[óÓ]n|aciones|antes?|ancias?)$/ui
143
- str[%r{#$&$}]=''
144
- return str
145
- when /log[íÍ]as?/ui
146
- str[%r{#$&$}]='log'
147
- return str
148
- when /(uci([óÓ]n|ones))$/ui
149
- str[%r{#$&$}]='u'
150
- return str
151
- when /(encias?)$/i
152
- str[%r{#$&$}]='ente'
153
- return str
154
- end
155
-
156
- if r2_text =~ /(ativ|iv|os|ic|ad)amente$/i or r1_text =~ /amente$/i
157
- str[%r{#$&$}]=''
158
- return str
159
- end
160
-
161
- case r2_text
162
- when /((ante|[ai]ble)?mente)$/i, /((abil|i[cv])?idad(es)?)$/i, /((at)?iv[ao]s?)$/i
163
- str[%r{#$&$}]=''
164
- return str
165
- end
166
- nil
167
- end
168
-
169
- #=> nil or new_str
170
- def step2a(str)
171
- rv_pos = rv(str)
172
- idx = str[rv_pos..-1] =~ /(y[oóÓ]|ye(ron|ndo)|y[ae][ns]?|ya(is|mos))$/ui
173
-
174
- return nil unless idx
175
-
176
- if 'u' == str[rv_pos+idx-1].downcase
177
- str[%r{#$&$}] = ''
178
- return str
179
- end
180
- nil
181
- end
182
-
183
- STEP2B_REGEXP = /(
184
- ar([áÁ][ns]?|a(n|s|is)?|on)? | ar([éÉ]is|emos|é|É) | ar[íÍ]a(n|s|is|mos)? |
185
- er([áÁ][sn]?|[éÉ](is)?|emos|[íÍ]a(n|s|is|mos)?)? |
186
- ir([íÍ]a(s|n|is|mos)?|[áÁ][ns]?|emos|[éÉ]|éis)? | aba(s|n|is)? |
187
- ad([ao]s?)? | ed | id(a|as|o|os)? | [íÍ]a(n|s|is|mos)? | [íÍ]s |
188
- as(e[ns]?|te|eis|teis)? | [áÁ](is|bamos|semos|ramos) | a(n|ndo|mos) |
189
- ie(ra|se|ran|sen|ron|ndo|ras|ses|rais|seis) | i(ste|steis|[óÓ]|mos|[éÉ]ramos|[éÉ]semos) |
190
- en|es|[éÉ]is|emos
191
- )$/xiu
192
-
193
- def step2b(str)
194
- rv_pos = rv(str)
195
-
196
- if idx = str[rv_pos..-1] =~ STEP2B_REGEXP
197
- suffix = $&
198
- if suffix =~ /^(en|es|[éÉ]is|emos)$/ui
199
- str[%r{#{suffix}$}]=''
200
- str[rv_pos+idx-1]='' if str[rv_pos+idx-2] =~ /g/i and str[rv_pos+idx-1] =~ /u/i
201
- else
202
- str[%r{#{suffix}$}]=''
203
- end
204
- return str
205
- end
206
- nil
207
- end
208
-
209
- def step3(str)
210
- rv_pos = rv(str)
211
- rv_text = str[rv_pos..-1]
212
-
213
- if rv_text =~ /(os|[aoáíóÁÍÓ])$/ui
214
- str[%r{#$&$}]=''
215
- return str
216
- elsif idx = rv_text =~ /(u?[eéÉ])$/i
217
- if $&[0].downcase == 'u' and str[rv_pos+idx-1].downcase == 'g'
218
- str[%r{#$&$}]=''
219
- else
220
- str.chop!
221
- end
222
- return str
223
- end
224
- nil
225
- end
226
-
227
- VOWEL = 'aeiouáéíóúüAEIOUÁÉÍÓÚÜ'
228
- CONSONANT = "bcdfghjklmnñpqrstvwxyzABCDEFGHIJKLMNÑOPQRSTUVWXYZ"
229
- end
230
-
231
- class String
232
- include EStem
233
- end