RubyGems - estem - Versions diffs - 0.2.3 → 0.2.4 - Mend

estem 0.2.3 → 0.2.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

data/COPYRIGHT ADDED Viewed

@@ -0,0 +1,19 @@
+Copyright (c) 2012 Manuel A. Güílamo
+Permission is hereby granted, free of charge, to any person obtaining a copy of
+this software and associated documentation files (the "Software"), to deal in
+the Software without restriction, including without limitation the rights to
+use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
+of the Software, and to permit persons to whom the Software is furnished to do
+so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

data/ChangeLog ADDED Viewed

@@ -0,0 +1,27 @@
+Version 0.2.4
+2012-06-25  MaG <maguilamo.c@gmail.com>
+*
+  - ChangeLog new file.
+  - fix README.rdoc added.
+  - fix COPYRIGHT added.
+  - examples/usage.rb: new file
+* README.rdoc:
+	- max 80 cols per line.
+	- recomendation about using safe_es_stem().
+	- Fix Spanish typos.
+* estem.gemspec:
+	- cleanups.
+	- (required_ruby_version): Ruby 1.9.1.
+* bin/es_stem.rb:
+	- chmod a+x .
+	- (es_stem.rb:80): fix case sensitive comparation.
+	- (es_stem.rb:25): removed .rb ext.
+	- (es_stem.rb:29): new version.
+* estem.rb:
+	- (safe_es_stem): new method.

data/README.rdoc ADDED Viewed

@@ -0,0 +1,130 @@
+= Spanish Stem Gem
+== Description
+This gem is for reducing Spanish words to their roots. It uses an algorithm
+based on Martin Porter's specifications.
+For more information, visit:
+http://snowball.tartarus.org/algorithms/spanish/stemmer.html
+== Descripción
+Esta gema está para reducir las palabras del Español en sus respectivas raíces,
+para ello ultiliza un algoritmo basado en las especificaciones de Martin Porter
+Para más información, visite:
+http://snowball.tartarus.org/algorithms/spanish/stemmer.html
+== Install -- Instalar
+  $ sudo gem install estem
+or
+  $ gem install estem
+== Usage
+As a reminder, take in consideration that the Spanish language have several non
+US-ASCII characters, and because of that, the same data may varied from one
+codeset to another.
+Please remember to use a UTF-8 compatible encoding while using EStem. Please do
+not use String#force_encoding() to convert from one codeset to another, you
+might try using String#encode() but this later is more likely to fail, consider
+using String#safe_es_stem() when handling incompatibles codesets or the codeset
+type is unknown.
+  require 'estem'
+  puts "albergues".es_stem      # ==> "alberg"
+  puts "habitaciones".es_stem   # ==> "habit"
+  # EStem will never make unnecessary changes to your input data.
+  puts "ALbeRGues".es_stem      # ==> "ALbeRG"
+  puts "HaBiTaCiOnEs".es_stem   # ==> "HaBiT"
+  puts "Hacinamiento".es_stem   # ==> "Hacin"
+You can use <tt>EStem</tt> as a command line tool:
+  $ es_stem --in-enc ISO-8859-1 -f input_file.txt
+for more information type
+  $ es_stem --help
+The <tt>es_stem</tt> program do his best trying to tokenized the lines from
+the file, you might consider finding an Spanish tokenizer, either way this
+program do what it is suppose to do, stem Spanish words.
+NOTE: For excellent results, consider replacing one word per line on the files
+the program handles.
+== Uso
+Como recordatorio, ten en cosideración que el Castellano posee muchos
+carácteres que están fuera del código ASCII, y por esta razón, los datos pueden
+variar de un conjunto de codificación a otro.
+Por favor recuerda utilizar sistemas de condificación compatibles con UTF-8
+cuando se trabaje con EStem. Por favor no use String#force_encoding para
+convertir de un conjunto de codificación a otro, podría utilizar String#encode
+pero este último es más probable que falle en el intento, considere utilizar
+String#safe_es_stem() si está manejando conjuntos de codificación incompatibles
+o se desconoce el tipo.
+  require 'estem'
+  puts "albergues".es_stem      # ==> "alberg"
+  puts "habitaciones".es_stem   # ==> "habit"
+  # EStem nunca hará cambios innecesarios a tus datos.
+  puts "ALbeRGues".es_stem      # ==> "ALbeRG"
+  puts "HaBiTaCiOnEs".es_stem   # ==> "HaBiT"
+  puts "Hacinamiento".es_stem   # ==> "Hacin"
+Para más información ejecuta:
+  $ es_stem --help
+El programa <tt>es_stem</tt> hará lo posible para separar las palabras de cada
+línea del fichero. Sería sensato utilizar otro programa más especializado para
+este propósito, de todas maneras, es_stem hace lo que se supone debe hacer,
+optener las raíces de las palabras.
+NOTA: Para resultados excelentes, considere poner una palabra por línea en los
+ficheros que pasará el programa.
+== Test
+This test is based on the sample input and output text from Martin Porter
+website. It includes 28390 test words and their expected stem results.
+To run the test, just type:
+  rake test
+== Pruebas
+Esta prueba está basada en un archivo de prueba provisto por Martin Porter.
+Incluye 28390 palabras de prueba con sus resultado esperados. Para realizar
+la prueba, ejecuta:
+  rake test
+== Thanks -- Agradecimientos
+Ray Pereda https://github.com/raypereda/stemmify/ I used his gem as a guide to
+package mine. http://guides.rubygems.org/make-your-own-gem/ as well.
+== License -- Licencia
+Copyright (c) 2012 Manuel A. Güílamo
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/bin/es_stem.rb CHANGED Viewed

@@ -1,5 +1,6 @@
 #!/usr/bin/env ruby
 # encoding: UTF-8
+# :stopdoc:
 # Copyright (c) 2012 Manuel A. Güílamo
 #
@@ -21,11 +22,11 @@
 # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 # SOFTWARE.
-require 'estem.rb'
+require 'estem'
 require 'getoptlong'
 require 'iconv'
-$version = "0.1.9"
+$version = "0.1.10"
 def usage(error=false)
 	out = error ? $stderr : $stdout
@@ -76,7 +77,7 @@ end
 if filename
 	begin
-		if ienc and ienc!='UTF-8'
+		if ienc and ienc.upcase !='UTF-8'
 			file = File.open(filename, "r:#{ienc}:UTF-8")
 		else
 			file = File.open(filename, 'r:UTF-8')

data/examples/usage.rb ADDED Viewed

@@ -0,0 +1,11 @@
+require 'estem'
+hsh = Hash.new
+words = ['albergues','habitaciones','Albergues','ALbeRGues','HaBiTaCiOnEs',
+         'Hacinamiento','mujeres','muchedumbre','ocasionalmente']
+words.each do|w|
+	stem = w.es_stem
+	puts "Word: #{w}\nStem: #{stem}\n\n"
+end

data/lib/estem.rb CHANGED Viewed

@@ -19,27 +19,30 @@
 # This code is provided under the terms of the {MIT License.}[http://www.opensource.org/licenses/mit-license.php]
 #
 # = Authors
-#   * Manuel A. Güílamo
+#   * Manuel A. Güílamo maguilamo.c@gmail.com
 #
+require 'iconv'
 module EStem
 	##
+	# For more information, please refer to <b>String#es_stem</b> method, also <b>EStem</b>.
 	# :method: estem
-	# For more information, please see <b>String#es_stem</b> method, also <b>EStem</b>.
-##
-#This method stem Spanish words.
-#
-#   "albergues".es_stem      # ==> "alberg"
-#   "habitaciones".es_stem   # ==> "habit"
-#   "ALbeRGues".es_stem      # ==> "ALbeRG"
-#   "HaBiTaCiOnEs".es_stem   # ==> "HaBiT"
-#   "Hacinamiento".es_stem   # ==> "Hacin"
-#
-#:call-seq:
-# str.es_stem    => "new_str"
+	##
+	#This method stem Spanish words.
+	#
+	#   "albergues".es_stem      # ==> "alberg"
+	#   "habitaciones".es_stem   # ==> "habit"
+	#   "ALbeRGues".es_stem      # ==> "ALbeRG"
+	#   "HaBiTaCiOnEs".es_stem   # ==> "HaBiT"
+	#   "Hacinamiento".es_stem   # ==> "Hacin"
+	#
+	#If you are not aware of the codeset the data has, then use
+	#String#safe_es_stem instead.
+	#
+	#:call-seq:
+	# str.es_stem    => "new_str"
 	def es_stem
 		str = self.dup
 		return remove_accent(str) if str.length == 1
@@ -59,6 +62,42 @@ module EStem
 		remove_accent(str)
 	end
+	##
+	#Use this method in case you are not aware of the codeset the data being
+	#handle has. This method returns a new string with the same codeset as
+	#the original. Be aware that this method is slower than String#es_stem()
+	#:call-seq:
+	# str.safe_es_stem    => "new_str"
+	def safe_es_stem
+		return self.es_stem if self.encoding == Encoding::UTF_8
+		default_enc = self.encoding.name
+		str = self.dup.force_encoding('UTF-8')
+		if str.valid_encoding?
+			begin
+				tmp = str.es_stem
+				return tmp.force_encoding(default_enc)
+			rescue
+			end
+		end
+		if enc = Encoding.compatible?(self, VOWEL)
+			begin
+				return self.encode(enc).es_stem
+			rescue
+			end
+		end
+		begin
+			tmp = Iconv.conv('UTF-8', self.encoding.name, self).es_stem
+			return Iconv.conv(default_enc, 'UTF-8', tmp);
+		rescue
+			return nil
+		end
+	end
 # :stopdoc:
 	private

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: estem
 version: !ruby/object:Gem::Version
-  version: 0.2.3
+  version: 0.2.4
   prerelease:
 platform: ruby
 authors:
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-05-20 00:00:00.000000000 Z
+date: 2012-06-25 00:00:00.000000000 Z
 dependencies: []
 description: Spanish stemming. Based on Martin Porter's specifications. See README
   file for more information.
@@ -21,7 +21,10 @@ files:
 - Rakefile
 - bin/es_stem.rb
 - lib/estem.rb
-- lib/estem.rb~
+- examples/usage.rb
+- COPYRIGHT
+- README.rdoc
+- ChangeLog
 - test/diffs.txt
 - test/test_estem.rb
 homepage: https://github.com/MaG21/estem
@@ -35,7 +38,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
   - - ! '>='
     - !ruby/object:Gem::Version
-      version: 1.9.2
+      version: 1.9.1
 required_rubygems_version: !ruby/object:Gem::Requirement
   none: false
   requirements:

data/lib/estem.rb~ DELETED Viewed

@@ -1,233 +0,0 @@
-# encoding: UTF-8
-#
-# :title: Spanish Stemming
-# = Description
-# This gem is for reducing Spanish words to their roots. It uses an algorithm
-# based on Martin Porter's specifications.
-#
-# For more information, visit:
-# http://snowball.tartarus.org/algorithms/spanish/stemmer.html
-#
-# = Descripción
-# Esta gema está para reducir las palabras del Español en sus respectivas raíces,
-# para ello ultiliza un algoritmo basado en las especificaciones de Martin Porter
-#
-# Para más información, visite:
-# http://snowball.tartarus.org/algorithms/spanish/stemmer.html
-#
-# = License -- Licencia
-# This code is provided under the terms of the {MIT License.}[http://www.opensource.org/licenses/mit-license.php]
-#
-# = Authors
-#   * Manuel A. Güílamo
-#
-module EStem
-	##
-	# :method: estem
-	# For more information, please see <b>String#es_stem</b> method, also <b>EStem</b>.
-##
-#This method reduces Spanish words to their root.
-#
-#   "albergues".es_stem      # ==> "alberg"
-#   "habitaciones".es_stem   # ==> "habit"
-#   "ALbeRGues".es_stem      # ==> "ALbeRG"
-#   "HaBiTaCiOnEs".es_stem   # ==> "HaBiT"
-#   "Hacinamiento".es_stem   # ==> "Hacin"
-#
-#:call-seq:
-# str.es_stem    => "new_str"
-	def es_stem
-		str = self.dup
-		return remove_accent(str) if str.length == 1
-		tmp = step0(str)
-		str = tmp ? tmp : str
-		unless tmp = step1(str)
-			unless tmp = step2a(str)
-				tmp = step2b(str)
-				str = tmp ? tmp : str
-			else
-				str = tmp
-			end
-		end
-		tmp = step3(str)
-		str = tmp.nil? ? str : tmp
-		remove_accent(str)
-	end
-# :stopdoc:
-	private
-	def vowel?(c)
-		VOWEL.include?(c)
-	end
-	def consonant?(c)
-		CONSONANT.include?(c)
-	end
-	def remove_accent(str)
-		str.tr('áéíóúÁÉÍÓÚ','aeiouAEIOU')
-	end
-	def rv(str)
-		if consonant? str[1]
-			i=2
-			i+=1 while str[i] and consonant? str[i]
-			return str.nil? ? str.length-1 : i+1
-		end
-		if vowel? str[0] and vowel? str[1]
-			i=2
-			i+=1 while str[i] and vowel? str[i]
-			return str.nil? ? str.length-1 : i+1
-		end
-		return 3 if consonant? str[0] and vowel? str[1]
-		str.length - 1
-	end
-	def r(str, i=0)
-		i+=1 while str[i] and consonant?(str[i])
-		i+=1
-		i+=1 while str[i] and vowel? str[i]
-		str[i].nil? ?  str.length : i+1
-	end
-	def r12(str)
-		r1 = r(str)
-		r2 = r(str,r1)
-		[r1,r2]
-	end
-	def step0(str)
-		return nil unless str =~ /(se(l[ao]s?)?|l([aeo]s?)|me|nos)$/i
-		suffix = $&
-		rv_text = str[rv(str)..-1]
-		case rv_text
-		when %r{((?<=i[éÉ]ndo|[áÁ]ndo|[áéíÁÉÍ]r)#{suffix})$}ui
-			str[%r{#$&$}]=''
-			str = remove_accent(str)
-			return str
-		when %r{((?<=iendo|ando|[aei]r)#{suffix})$}i
-			str[%r{#$&$}]=''
-			return str
-		end
-		if rv_text =~ /yendo/i and str =~ /uyendo/i
-		      str[suffix]=''
-		      return str
-		end
-		nil
-	end
-	#=> new_str or nil
-	def step1(str)
-		r1,r2 = r12(str)
-		r1_text = str[r1..-1]
-		r2_text = str[r2..-1]
-		case r2_text
-		when /(anzas?|ic[oa]s?|ismos?|[ai]bles?|istas?|os[oa]s?|[ai]mientos?)$/i
-			str[%r{#$&$}]=''
-			return str
-		when /(ic)?(ador([ae]s?)?|aci[óÓ]n|aciones|antes?|ancias?)$/ui
-			str[%r{#$&$}]=''
-			return str
-		when /log[íÍ]as?/ui
-			str[%r{#$&$}]='log'
-			return str
-		when /(uci([óÓ]n|ones))$/ui
-			str[%r{#$&$}]='u'
-			return str
-		when /(encias?)$/i
-			str[%r{#$&$}]='ente'
-			return str
-		end
-		if r2_text =~ /(ativ|iv|os|ic|ad)amente$/i or r1_text =~ /amente$/i
-			str[%r{#$&$}]=''
-			return str
-		end
-		case r2_text
-		when /((ante|[ai]ble)?mente)$/i, /((abil|i[cv])?idad(es)?)$/i, /((at)?iv[ao]s?)$/i
-			str[%r{#$&$}]=''
-			return str
-		end
-		nil
-	end
-	#=> nil or new_str
-	def step2a(str)
-		rv_pos = rv(str)
-		idx = str[rv_pos..-1] =~ /(y[oóÓ]|ye(ron|ndo)|y[ae][ns]?|ya(is|mos))$/ui
-		return nil unless idx
-		if 'u' == str[rv_pos+idx-1].downcase
-			str[%r{#$&$}] = ''
-			return str
-		end
-		nil
-	end
-	STEP2B_REGEXP = /(
-		ar([áÁ][ns]?|a(n|s|is)?|on)? | ar([éÉ]is|emos|é|É) | ar[íÍ]a(n|s|is|mos)? |
-		er([áÁ][sn]?|[éÉ](is)?|emos|[íÍ]a(n|s|is|mos)?)? |
-		ir([íÍ]a(s|n|is|mos)?|[áÁ][ns]?|emos|[éÉ]|éis)? | aba(s|n|is)? |
-		ad([ao]s?)? | ed | id(a|as|o|os)? | [íÍ]a(n|s|is|mos)? | [íÍ]s |
-		as(e[ns]?|te|eis|teis)? | [áÁ](is|bamos|semos|ramos) | a(n|ndo|mos) |
-		ie(ra|se|ran|sen|ron|ndo|ras|ses|rais|seis) | i(ste|steis|[óÓ]|mos|[éÉ]ramos|[éÉ]semos) |
-		en|es|[éÉ]is|emos
-	)$/xiu
-	def step2b(str)
-		rv_pos =  rv(str)
-		if idx = str[rv_pos..-1] =~ STEP2B_REGEXP
-			suffix = $&
-			if suffix =~ /^(en|es|[éÉ]is|emos)$/ui
-				str[%r{#{suffix}$}]=''
-				str[rv_pos+idx-1]='' if str[rv_pos+idx-2] =~ /g/i and  str[rv_pos+idx-1] =~ /u/i
-			else
-				str[%r{#{suffix}$}]=''
-			end
-			return str
-		end
-		nil
-	end
-	def step3(str)
-		rv_pos = rv(str)
-		rv_text = str[rv_pos..-1]
-		if rv_text =~ /(os|[aoáíóÁÍÓ])$/ui
-			str[%r{#$&$}]=''
-			return str
-		elsif idx = rv_text =~ /(u?[eéÉ])$/i
-			if $&[0].downcase == 'u' and str[rv_pos+idx-1].downcase == 'g'
-				str[%r{#$&$}]=''
-			else
-				str.chop!
-			end
-			return str
-		end
-		nil
-	end
-	VOWEL = 'aeiouáéíóúüAEIOUÁÉÍÓÚÜ'
-	CONSONANT = "bcdfghjklmnñpqrstvwxyzABCDEFGHIJKLMNÑOPQRSTUVWXYZ"
-end
-class String
-	include EStem
-end