estem 0.2.3 → 0.2.4

Sign up to get free protection for your applications and to get access to all the features.
data/COPYRIGHT ADDED
@@ -0,0 +1,19 @@
1
+ Copyright (c) 2012 Manuel A. Güílamo
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy of
4
+ this software and associated documentation files (the "Software"), to deal in
5
+ the Software without restriction, including without limitation the rights to
6
+ use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
7
+ of the Software, and to permit persons to whom the Software is furnished to do
8
+ so, subject to the following conditions:
9
+
10
+ The above copyright notice and this permission notice shall be included in all
11
+ copies or substantial portions of the Software.
12
+
13
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
19
+ SOFTWARE.
data/ChangeLog ADDED
@@ -0,0 +1,27 @@
1
+ Version 0.2.4
2
+
3
+ 2012-06-25 MaG <maguilamo.c@gmail.com>
4
+
5
+ *
6
+ - ChangeLog new file.
7
+ - fix README.rdoc added.
8
+ - fix COPYRIGHT added.
9
+ - examples/usage.rb: new file
10
+
11
+ * README.rdoc:
12
+ - max 80 cols per line.
13
+ - recomendation about using safe_es_stem().
14
+ - Fix Spanish typos.
15
+
16
+ * estem.gemspec:
17
+ - cleanups.
18
+ - (required_ruby_version): Ruby 1.9.1.
19
+
20
+ * bin/es_stem.rb:
21
+ - chmod a+x .
22
+ - (es_stem.rb:80): fix case sensitive comparation.
23
+ - (es_stem.rb:25): removed .rb ext.
24
+ - (es_stem.rb:29): new version.
25
+
26
+ * estem.rb:
27
+ - (safe_es_stem): new method.
data/README.rdoc ADDED
@@ -0,0 +1,130 @@
1
+ = Spanish Stem Gem
2
+
3
+ == Description
4
+ This gem is for reducing Spanish words to their roots. It uses an algorithm
5
+ based on Martin Porter's specifications.
6
+
7
+ For more information, visit:
8
+ http://snowball.tartarus.org/algorithms/spanish/stemmer.html
9
+
10
+ == Descripción
11
+ Esta gema está para reducir las palabras del Español en sus respectivas raíces,
12
+ para ello ultiliza un algoritmo basado en las especificaciones de Martin Porter
13
+
14
+ Para más información, visite:
15
+ http://snowball.tartarus.org/algorithms/spanish/stemmer.html
16
+
17
+ == Install -- Instalar
18
+
19
+ $ sudo gem install estem
20
+ or
21
+ $ gem install estem
22
+
23
+ == Usage
24
+ As a reminder, take in consideration that the Spanish language have several non
25
+ US-ASCII characters, and because of that, the same data may varied from one
26
+ codeset to another.
27
+
28
+ Please remember to use a UTF-8 compatible encoding while using EStem. Please do
29
+ not use String#force_encoding() to convert from one codeset to another, you
30
+ might try using String#encode() but this later is more likely to fail, consider
31
+ using String#safe_es_stem() when handling incompatibles codesets or the codeset
32
+ type is unknown.
33
+
34
+ require 'estem'
35
+
36
+ puts "albergues".es_stem # ==> "alberg"
37
+ puts "habitaciones".es_stem # ==> "habit"
38
+
39
+ # EStem will never make unnecessary changes to your input data.
40
+ puts "ALbeRGues".es_stem # ==> "ALbeRG"
41
+ puts "HaBiTaCiOnEs".es_stem # ==> "HaBiT"
42
+ puts "Hacinamiento".es_stem # ==> "Hacin"
43
+
44
+ You can use <tt>EStem</tt> as a command line tool:
45
+ $ es_stem --in-enc ISO-8859-1 -f input_file.txt
46
+
47
+ for more information type
48
+ $ es_stem --help
49
+
50
+ The <tt>es_stem</tt> program do his best trying to tokenized the lines from
51
+ the file, you might consider finding an Spanish tokenizer, either way this
52
+ program do what it is suppose to do, stem Spanish words.
53
+
54
+ NOTE: For excellent results, consider replacing one word per line on the files
55
+ the program handles.
56
+
57
+ == Uso
58
+ Como recordatorio, ten en cosideración que el Castellano posee muchos
59
+ carácteres que están fuera del código ASCII, y por esta razón, los datos pueden
60
+ variar de un conjunto de codificación a otro.
61
+
62
+ Por favor recuerda utilizar sistemas de condificación compatibles con UTF-8
63
+ cuando se trabaje con EStem. Por favor no use String#force_encoding para
64
+ convertir de un conjunto de codificación a otro, podría utilizar String#encode
65
+ pero este último es más probable que falle en el intento, considere utilizar
66
+ String#safe_es_stem() si está manejando conjuntos de codificación incompatibles
67
+ o se desconoce el tipo.
68
+
69
+ require 'estem'
70
+
71
+ puts "albergues".es_stem # ==> "alberg"
72
+ puts "habitaciones".es_stem # ==> "habit"
73
+
74
+ # EStem nunca hará cambios innecesarios a tus datos.
75
+ puts "ALbeRGues".es_stem # ==> "ALbeRG"
76
+ puts "HaBiTaCiOnEs".es_stem # ==> "HaBiT"
77
+ puts "Hacinamiento".es_stem # ==> "Hacin"
78
+
79
+ Para más información ejecuta:
80
+ $ es_stem --help
81
+
82
+ El programa <tt>es_stem</tt> hará lo posible para separar las palabras de cada
83
+ línea del fichero. Sería sensato utilizar otro programa más especializado para
84
+ este propósito, de todas maneras, es_stem hace lo que se supone debe hacer,
85
+ optener las raíces de las palabras.
86
+
87
+ NOTA: Para resultados excelentes, considere poner una palabra por línea en los
88
+ ficheros que pasará el programa.
89
+
90
+ == Test
91
+
92
+ This test is based on the sample input and output text from Martin Porter
93
+ website. It includes 28390 test words and their expected stem results.
94
+ To run the test, just type:
95
+ rake test
96
+
97
+ == Pruebas
98
+
99
+ Esta prueba está basada en un archivo de prueba provisto por Martin Porter.
100
+ Incluye 28390 palabras de prueba con sus resultado esperados. Para realizar
101
+ la prueba, ejecuta:
102
+ rake test
103
+
104
+ == Thanks -- Agradecimientos
105
+
106
+ Ray Pereda https://github.com/raypereda/stemmify/ I used his gem as a guide to
107
+ package mine. http://guides.rubygems.org/make-your-own-gem/ as well.
108
+
109
+ == License -- Licencia
110
+
111
+ Copyright (c) 2012 Manuel A. Güílamo
112
+
113
+ Permission is hereby granted, free of charge, to any person obtaining
114
+ a copy of this software and associated documentation files (the
115
+ "Software"), to deal in the Software without restriction, including
116
+ without limitation the rights to use, copy, modify, merge, publish,
117
+ distribute, sublicense, and/or sell copies of the Software, and to
118
+ permit persons to whom the Software is furnished to do so, subject to
119
+ the following conditions:
120
+
121
+ The above copyright notice and this permission notice shall be
122
+ included in all copies or substantial portions of the Software.
123
+
124
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
125
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
126
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
127
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
128
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
129
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
130
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/bin/es_stem.rb CHANGED
@@ -1,5 +1,6 @@
1
1
  #!/usr/bin/env ruby
2
2
  # encoding: UTF-8
3
+ # :stopdoc:
3
4
 
4
5
  # Copyright (c) 2012 Manuel A. Güílamo
5
6
  #
@@ -21,11 +22,11 @@
21
22
  # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22
23
  # SOFTWARE.
23
24
 
24
- require 'estem.rb'
25
+ require 'estem'
25
26
  require 'getoptlong'
26
27
  require 'iconv'
27
28
 
28
- $version = "0.1.9"
29
+ $version = "0.1.10"
29
30
 
30
31
  def usage(error=false)
31
32
  out = error ? $stderr : $stdout
@@ -76,7 +77,7 @@ end
76
77
 
77
78
  if filename
78
79
  begin
79
- if ienc and ienc!='UTF-8'
80
+ if ienc and ienc.upcase !='UTF-8'
80
81
  file = File.open(filename, "r:#{ienc}:UTF-8")
81
82
  else
82
83
  file = File.open(filename, 'r:UTF-8')
data/examples/usage.rb ADDED
@@ -0,0 +1,11 @@
1
+ require 'estem'
2
+
3
+ hsh = Hash.new
4
+
5
+ words = ['albergues','habitaciones','Albergues','ALbeRGues','HaBiTaCiOnEs',
6
+ 'Hacinamiento','mujeres','muchedumbre','ocasionalmente']
7
+
8
+ words.each do|w|
9
+ stem = w.es_stem
10
+ puts "Word: #{w}\nStem: #{stem}\n\n"
11
+ end
data/lib/estem.rb CHANGED
@@ -19,27 +19,30 @@
19
19
  # This code is provided under the terms of the {MIT License.}[http://www.opensource.org/licenses/mit-license.php]
20
20
  #
21
21
  # = Authors
22
- # * Manuel A. Güílamo
22
+ # * Manuel A. Güílamo maguilamo.c@gmail.com
23
23
  #
24
24
 
25
+ require 'iconv'
26
+
25
27
  module EStem
26
28
  ##
29
+ # For more information, please refer to <b>String#es_stem</b> method, also <b>EStem</b>.
27
30
  # :method: estem
28
- # For more information, please see <b>String#es_stem</b> method, also <b>EStem</b>.
29
-
30
-
31
- ##
32
- #This method stem Spanish words.
33
- #
34
- # "albergues".es_stem # ==> "alberg"
35
- # "habitaciones".es_stem # ==> "habit"
36
- # "ALbeRGues".es_stem # ==> "ALbeRG"
37
- # "HaBiTaCiOnEs".es_stem # ==> "HaBiT"
38
- # "Hacinamiento".es_stem # ==> "Hacin"
39
- #
40
- #:call-seq:
41
- # str.es_stem => "new_str"
42
31
 
32
+ ##
33
+ #This method stem Spanish words.
34
+ #
35
+ # "albergues".es_stem # ==> "alberg"
36
+ # "habitaciones".es_stem # ==> "habit"
37
+ # "ALbeRGues".es_stem # ==> "ALbeRG"
38
+ # "HaBiTaCiOnEs".es_stem # ==> "HaBiT"
39
+ # "Hacinamiento".es_stem # ==> "Hacin"
40
+ #
41
+ #If you are not aware of the codeset the data has, then use
42
+ #String#safe_es_stem instead.
43
+ #
44
+ #:call-seq:
45
+ # str.es_stem => "new_str"
43
46
  def es_stem
44
47
  str = self.dup
45
48
  return remove_accent(str) if str.length == 1
@@ -59,6 +62,42 @@ module EStem
59
62
  remove_accent(str)
60
63
  end
61
64
 
65
+ ##
66
+ #Use this method in case you are not aware of the codeset the data being
67
+ #handle has. This method returns a new string with the same codeset as
68
+ #the original. Be aware that this method is slower than String#es_stem()
69
+ #:call-seq:
70
+ # str.safe_es_stem => "new_str"
71
+ def safe_es_stem
72
+ return self.es_stem if self.encoding == Encoding::UTF_8
73
+
74
+ default_enc = self.encoding.name
75
+
76
+ str = self.dup.force_encoding('UTF-8')
77
+
78
+ if str.valid_encoding?
79
+ begin
80
+ tmp = str.es_stem
81
+ return tmp.force_encoding(default_enc)
82
+ rescue
83
+ end
84
+ end
85
+
86
+ if enc = Encoding.compatible?(self, VOWEL)
87
+ begin
88
+ return self.encode(enc).es_stem
89
+ rescue
90
+ end
91
+ end
92
+
93
+ begin
94
+ tmp = Iconv.conv('UTF-8', self.encoding.name, self).es_stem
95
+ return Iconv.conv(default_enc, 'UTF-8', tmp);
96
+ rescue
97
+ return nil
98
+ end
99
+ end
100
+
62
101
  # :stopdoc:
63
102
 
64
103
  private
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: estem
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.3
4
+ version: 0.2.4
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-05-20 00:00:00.000000000 Z
12
+ date: 2012-06-25 00:00:00.000000000 Z
13
13
  dependencies: []
14
14
  description: Spanish stemming. Based on Martin Porter's specifications. See README
15
15
  file for more information.
@@ -21,7 +21,10 @@ files:
21
21
  - Rakefile
22
22
  - bin/es_stem.rb
23
23
  - lib/estem.rb
24
- - lib/estem.rb~
24
+ - examples/usage.rb
25
+ - COPYRIGHT
26
+ - README.rdoc
27
+ - ChangeLog
25
28
  - test/diffs.txt
26
29
  - test/test_estem.rb
27
30
  homepage: https://github.com/MaG21/estem
@@ -35,7 +38,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
35
38
  requirements:
36
39
  - - ! '>='
37
40
  - !ruby/object:Gem::Version
38
- version: 1.9.2
41
+ version: 1.9.1
39
42
  required_rubygems_version: !ruby/object:Gem::Requirement
40
43
  none: false
41
44
  requirements:
data/lib/estem.rb~ DELETED
@@ -1,233 +0,0 @@
1
- # encoding: UTF-8
2
- #
3
- # :title: Spanish Stemming
4
- # = Description
5
- # This gem is for reducing Spanish words to their roots. It uses an algorithm
6
- # based on Martin Porter's specifications.
7
- #
8
- # For more information, visit:
9
- # http://snowball.tartarus.org/algorithms/spanish/stemmer.html
10
- #
11
- # = Descripción
12
- # Esta gema está para reducir las palabras del Español en sus respectivas raíces,
13
- # para ello ultiliza un algoritmo basado en las especificaciones de Martin Porter
14
- #
15
- # Para más información, visite:
16
- # http://snowball.tartarus.org/algorithms/spanish/stemmer.html
17
- #
18
- # = License -- Licencia
19
- # This code is provided under the terms of the {MIT License.}[http://www.opensource.org/licenses/mit-license.php]
20
- #
21
- # = Authors
22
- # * Manuel A. Güílamo
23
- #
24
-
25
- module EStem
26
- ##
27
- # :method: estem
28
- # For more information, please see <b>String#es_stem</b> method, also <b>EStem</b>.
29
-
30
-
31
- ##
32
- #This method reduces Spanish words to their root.
33
- #
34
- # "albergues".es_stem # ==> "alberg"
35
- # "habitaciones".es_stem # ==> "habit"
36
- # "ALbeRGues".es_stem # ==> "ALbeRG"
37
- # "HaBiTaCiOnEs".es_stem # ==> "HaBiT"
38
- # "Hacinamiento".es_stem # ==> "Hacin"
39
- #
40
- #:call-seq:
41
- # str.es_stem => "new_str"
42
-
43
- def es_stem
44
- str = self.dup
45
- return remove_accent(str) if str.length == 1
46
- tmp = step0(str)
47
- str = tmp ? tmp : str
48
-
49
- unless tmp = step1(str)
50
- unless tmp = step2a(str)
51
- tmp = step2b(str)
52
- str = tmp ? tmp : str
53
- else
54
- str = tmp
55
- end
56
- end
57
- tmp = step3(str)
58
- str = tmp.nil? ? str : tmp
59
- remove_accent(str)
60
- end
61
-
62
- # :stopdoc:
63
-
64
- private
65
-
66
- def vowel?(c)
67
- VOWEL.include?(c)
68
- end
69
-
70
- def consonant?(c)
71
- CONSONANT.include?(c)
72
- end
73
-
74
- def remove_accent(str)
75
- str.tr('áéíóúÁÉÍÓÚ','aeiouAEIOU')
76
- end
77
-
78
- def rv(str)
79
- if consonant? str[1]
80
- i=2
81
- i+=1 while str[i] and consonant? str[i]
82
- return str.nil? ? str.length-1 : i+1
83
- end
84
-
85
- if vowel? str[0] and vowel? str[1]
86
- i=2
87
- i+=1 while str[i] and vowel? str[i]
88
- return str.nil? ? str.length-1 : i+1
89
- end
90
-
91
- return 3 if consonant? str[0] and vowel? str[1]
92
-
93
- str.length - 1
94
- end
95
-
96
- def r(str, i=0)
97
- i+=1 while str[i] and consonant?(str[i])
98
- i+=1
99
- i+=1 while str[i] and vowel? str[i]
100
- str[i].nil? ? str.length : i+1
101
- end
102
-
103
- def r12(str)
104
- r1 = r(str)
105
- r2 = r(str,r1)
106
- [r1,r2]
107
- end
108
-
109
- def step0(str)
110
- return nil unless str =~ /(se(l[ao]s?)?|l([aeo]s?)|me|nos)$/i
111
-
112
- suffix = $&
113
- rv_text = str[rv(str)..-1]
114
-
115
- case rv_text
116
- when %r{((?<=i[éÉ]ndo|[áÁ]ndo|[áéíÁÉÍ]r)#{suffix})$}ui
117
- str[%r{#$&$}]=''
118
- str = remove_accent(str)
119
- return str
120
- when %r{((?<=iendo|ando|[aei]r)#{suffix})$}i
121
- str[%r{#$&$}]=''
122
- return str
123
- end
124
-
125
- if rv_text =~ /yendo/i and str =~ /uyendo/i
126
- str[suffix]=''
127
- return str
128
- end
129
- nil
130
- end
131
-
132
- #=> new_str or nil
133
- def step1(str)
134
- r1,r2 = r12(str)
135
- r1_text = str[r1..-1]
136
- r2_text = str[r2..-1]
137
-
138
- case r2_text
139
- when /(anzas?|ic[oa]s?|ismos?|[ai]bles?|istas?|os[oa]s?|[ai]mientos?)$/i
140
- str[%r{#$&$}]=''
141
- return str
142
- when /(ic)?(ador([ae]s?)?|aci[óÓ]n|aciones|antes?|ancias?)$/ui
143
- str[%r{#$&$}]=''
144
- return str
145
- when /log[íÍ]as?/ui
146
- str[%r{#$&$}]='log'
147
- return str
148
- when /(uci([óÓ]n|ones))$/ui
149
- str[%r{#$&$}]='u'
150
- return str
151
- when /(encias?)$/i
152
- str[%r{#$&$}]='ente'
153
- return str
154
- end
155
-
156
- if r2_text =~ /(ativ|iv|os|ic|ad)amente$/i or r1_text =~ /amente$/i
157
- str[%r{#$&$}]=''
158
- return str
159
- end
160
-
161
- case r2_text
162
- when /((ante|[ai]ble)?mente)$/i, /((abil|i[cv])?idad(es)?)$/i, /((at)?iv[ao]s?)$/i
163
- str[%r{#$&$}]=''
164
- return str
165
- end
166
- nil
167
- end
168
-
169
- #=> nil or new_str
170
- def step2a(str)
171
- rv_pos = rv(str)
172
- idx = str[rv_pos..-1] =~ /(y[oóÓ]|ye(ron|ndo)|y[ae][ns]?|ya(is|mos))$/ui
173
-
174
- return nil unless idx
175
-
176
- if 'u' == str[rv_pos+idx-1].downcase
177
- str[%r{#$&$}] = ''
178
- return str
179
- end
180
- nil
181
- end
182
-
183
- STEP2B_REGEXP = /(
184
- ar([áÁ][ns]?|a(n|s|is)?|on)? | ar([éÉ]is|emos|é|É) | ar[íÍ]a(n|s|is|mos)? |
185
- er([áÁ][sn]?|[éÉ](is)?|emos|[íÍ]a(n|s|is|mos)?)? |
186
- ir([íÍ]a(s|n|is|mos)?|[áÁ][ns]?|emos|[éÉ]|éis)? | aba(s|n|is)? |
187
- ad([ao]s?)? | ed | id(a|as|o|os)? | [íÍ]a(n|s|is|mos)? | [íÍ]s |
188
- as(e[ns]?|te|eis|teis)? | [áÁ](is|bamos|semos|ramos) | a(n|ndo|mos) |
189
- ie(ra|se|ran|sen|ron|ndo|ras|ses|rais|seis) | i(ste|steis|[óÓ]|mos|[éÉ]ramos|[éÉ]semos) |
190
- en|es|[éÉ]is|emos
191
- )$/xiu
192
-
193
- def step2b(str)
194
- rv_pos = rv(str)
195
-
196
- if idx = str[rv_pos..-1] =~ STEP2B_REGEXP
197
- suffix = $&
198
- if suffix =~ /^(en|es|[éÉ]is|emos)$/ui
199
- str[%r{#{suffix}$}]=''
200
- str[rv_pos+idx-1]='' if str[rv_pos+idx-2] =~ /g/i and str[rv_pos+idx-1] =~ /u/i
201
- else
202
- str[%r{#{suffix}$}]=''
203
- end
204
- return str
205
- end
206
- nil
207
- end
208
-
209
- def step3(str)
210
- rv_pos = rv(str)
211
- rv_text = str[rv_pos..-1]
212
-
213
- if rv_text =~ /(os|[aoáíóÁÍÓ])$/ui
214
- str[%r{#$&$}]=''
215
- return str
216
- elsif idx = rv_text =~ /(u?[eéÉ])$/i
217
- if $&[0].downcase == 'u' and str[rv_pos+idx-1].downcase == 'g'
218
- str[%r{#$&$}]=''
219
- else
220
- str.chop!
221
- end
222
- return str
223
- end
224
- nil
225
- end
226
-
227
- VOWEL = 'aeiouáéíóúüAEIOUÁÉÍÓÚÜ'
228
- CONSONANT = "bcdfghjklmnñpqrstvwxyzABCDEFGHIJKLMNÑOPQRSTUVWXYZ"
229
- end
230
-
231
- class String
232
- include EStem
233
- end