RubyGems - estem - Versions diffs - 0.2.4 → 0.2.5 - Mend

estem 0.2.4 → 0.2.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

data/ChangeLog +37 -10
data/README.rdoc +7 -38
data/examples/usage.rb +0 -2
data/examples/usage.rb~ +11 -0
data/lib/estem.rb +59 -60
data/lib/estem.rb~ +271 -0
data/test/diffs_ISO88591.txt +28390 -0
data/test/{diffs.txt → diffs_UTF8.txt} +0 -0
data/test/test_estem.rb +15 -10
data/test/test_estem.rb~ +27 -0
metadata +10 -5
data/bin/es_stem.rb +0 -179

data/ChangeLog CHANGED Viewed

@@ -1,3 +1,30 @@
+Version 0.2.5
+2012-09-02  MaG <maguilamo.c@gmail.com>
+*
+  - bin/ directory, removed.
+* README.rdoc:
+  - cleanups
+  - Thanks section, removed.
+* examples/usage.rb:
+  - cleanups
+* estem.rb:
+  - (es_stem): rewritten.
+  - (safe_es_stem): deprecated Iconv, removed.
+* bin/es_stem.rb:
+  - removed.
+* test/:
+  - new test file added.
+  - rename file diffs.txt.
+* test/test_estem.rb:
+  - one more test added.
 Version 0.2.4
 2012-06-25  MaG <maguilamo.c@gmail.com>
@@ -9,19 +36,19 @@ Version 0.2.4
   - examples/usage.rb: new file
 * README.rdoc:
-	- max 80 cols per line.
-	- recomendation about using safe_es_stem().
-	- Fix Spanish typos.
+  - max 80 cols per line.
+  - recomendation about using safe_es_stem().
+  - Fix Spanish typos.
 * estem.gemspec:
-	- cleanups.
-	- (required_ruby_version): Ruby 1.9.1.
+  - cleanups.
+  - (required_ruby_version): Ruby 1.9.1.
 * bin/es_stem.rb:
-	- chmod a+x .
-	- (es_stem.rb:80): fix case sensitive comparation.
-	- (es_stem.rb:25): removed .rb ext.
-	- (es_stem.rb:29): new version.
+  - chmod a+x .
+  - (es_stem.rb:80): fix case sensitive comparation.
+  - (es_stem.rb:25): removed .rb ext.
+  - (es_stem.rb:29): new version.
 * estem.rb:
-	- (safe_es_stem): new method.
+  - (safe_es_stem): new method.

data/README.rdoc CHANGED Viewed

@@ -1,7 +1,7 @@
 = Spanish Stem Gem
 == Description
-This gem is for reducing Spanish words to their roots. It uses an algorithm
+This gem reduces Spanish words to their respective roots. It uses an algorithm
 based on Martin Porter's specifications.
 For more information, visit:
@@ -21,15 +21,14 @@ or
   $ gem install estem
 == Usage
-As a reminder, take in consideration that the Spanish language have several non
+As a reminder, take in consideration that the Spanish language has several non
 US-ASCII characters, and because of that, the same data may varied from one
 codeset to another.
 Please remember to use a UTF-8 compatible encoding while using EStem. Please do
-not use String#force_encoding() to convert from one codeset to another, you
-might try using String#encode() but this later is more likely to fail, consider
-using String#safe_es_stem() when handling incompatibles codesets or the codeset
-type is unknown.
+not use String#force_encoding to convert from one codeset to another, you may
+try using String#encode alone but, instead, consider using String#safe_es_stem
+when handling incompatibles codesets or the codeset type varies.
   require 'estem'
@@ -41,19 +40,6 @@ type is unknown.
   puts "HaBiTaCiOnEs".es_stem   # ==> "HaBiT"
   puts "Hacinamiento".es_stem   # ==> "Hacin"
-You can use <tt>EStem</tt> as a command line tool:
-  $ es_stem --in-enc ISO-8859-1 -f input_file.txt
-for more information type
-  $ es_stem --help
-The <tt>es_stem</tt> program do his best trying to tokenized the lines from
-the file, you might consider finding an Spanish tokenizer, either way this
-program do what it is suppose to do, stem Spanish words.
-NOTE: For excellent results, consider replacing one word per line on the files
-the program handles.
 == Uso
 Como recordatorio, ten en cosideración que el Castellano posee muchos
 carácteres que están fuera del código ASCII, y por esta razón, los datos pueden
@@ -62,9 +48,8 @@ variar de un conjunto de codificación a otro.
 Por favor recuerda utilizar sistemas de condificación compatibles con UTF-8
 cuando se trabaje con EStem. Por favor no use String#force_encoding para
 convertir de un conjunto de codificación a otro, podría utilizar String#encode
-pero este último es más probable que falle en el intento, considere utilizar
-String#safe_es_stem() si está manejando conjuntos de codificación incompatibles
-o se desconoce el tipo.
+solo, pero en su lugar, considere utilizar String#safe_es_stem() si está
+manejando conjuntos de codificación incompatibles o se desconoce el tipo.
   require 'estem'
@@ -75,17 +60,6 @@ o se desconoce el tipo.
   puts "ALbeRGues".es_stem      # ==> "ALbeRG"
   puts "HaBiTaCiOnEs".es_stem   # ==> "HaBiT"
   puts "Hacinamiento".es_stem   # ==> "Hacin"
-Para más información ejecuta:
-  $ es_stem --help
-El programa <tt>es_stem</tt> hará lo posible para separar las palabras de cada
-línea del fichero. Sería sensato utilizar otro programa más especializado para
-este propósito, de todas maneras, es_stem hace lo que se supone debe hacer,
-optener las raíces de las palabras.
-NOTA: Para resultados excelentes, considere poner una palabra por línea en los
-ficheros que pasará el programa.
 == Test
@@ -101,11 +75,6 @@ Incluye 28390 palabras de prueba con sus resultado esperados. Para realizar
 la prueba, ejecuta:
   rake test
-== Thanks -- Agradecimientos
-Ray Pereda https://github.com/raypereda/stemmify/ I used his gem as a guide to
-package mine. http://guides.rubygems.org/make-your-own-gem/ as well.
 == License -- Licencia
 Copyright (c) 2012 Manuel A. Güílamo

data/examples/usage.rb CHANGED Viewed

@@ -1,7 +1,5 @@
 require 'estem'
-hsh = Hash.new
 words = ['albergues','habitaciones','Albergues','ALbeRGues','HaBiTaCiOnEs',
          'Hacinamiento','mujeres','muchedumbre','ocasionalmente']

data/examples/usage.rb~ ADDED Viewed

@@ -0,0 +1,11 @@
+require 'estem'
+hsh = Hash.new
+words = ['albergues','habitaciones','Albergues','ALbeRGues','HaBiTaCiOnEs',
+         'Hacinamiento','mujeres','muchedumbre','ocasionalmente']
+words.each do|w|
+	stem = w.es_stem
+	puts "Word: #{w}\nStem: #{stem}\n\n"
+end

data/lib/estem.rb CHANGED Viewed

@@ -22,8 +22,6 @@
 #   * Manuel A. Güílamo maguilamo.c@gmail.com
 #
-require 'iconv'
 module EStem
 	##
 	# For more information, please refer to <b>String#es_stem</b> method, also <b>EStem</b>.
@@ -38,61 +36,59 @@ module EStem
 	#   "HaBiTaCiOnEs".es_stem   # ==> "HaBiT"
 	#   "Hacinamiento".es_stem   # ==> "Hacin"
 	#
-	#If you are not aware of the codeset the data has, then use
+	#If you are not aware of the codeset the data have, try using
 	#String#safe_es_stem instead.
 	#
 	#:call-seq:
 	# str.es_stem    => "new_str"
 	def es_stem
 		str = self.dup
-		return remove_accent(str) if str.length == 1
-		tmp = step0(str)
-		str = tmp ? tmp : str
-		unless tmp = step1(str)
-			unless tmp = step2a(str)
-				tmp = step2b(str)
-				str = tmp ? tmp : str
-			else
-				str = tmp
-			end
+		case str.length
+		when 0
+			return str
+		when 1
+			return remove_accent(str)
 		end
-		tmp = step3(str)
-		str = tmp.nil? ? str : tmp
+		step0(str)
+		unless step1(str)
+			step2b(str) unless step2a(str)
+		end
+		step3(str)
 		remove_accent(str)
 	end
 	##
 	#Use this method in case you are not aware of the codeset the data being
-	#handle has. This method returns a new string with the same codeset as
-	#the original. Be aware that this method is slower than String#es_stem()
+	#handle have. This method returns a new string with the same codeset as
+	#the original. Be aware that this method is a bit slower than String#es_stem
 	#:call-seq:
 	# str.safe_es_stem    => "new_str"
 	def safe_es_stem
-		return self.es_stem if self.encoding == Encoding::UTF_8
-		default_enc = self.encoding.name
-		str = self.dup.force_encoding('UTF-8')
-		if str.valid_encoding?
-			begin
-				tmp = str.es_stem
-				return tmp.force_encoding(default_enc)
-			rescue
-			end
+		if self.encoding == Encoding::UTF_8
+			# remove invalid characters
+			return self.chars.select{|c| c.valid_encoding? }.join.es_stem
 		end
-		if enc = Encoding.compatible?(self, VOWEL)
-			begin
-				return self.encode(enc).es_stem
-			rescue
+		unless self.valid_encoding?
+			tmp = self.dup
+			if tmp.force_encoding('UTF-8').valid_encoding?
+				begin
+					return tmp.es_stem
+				rescue
+				end
 			end
 		end
+		default_enc = self.encoding.name
+		str = self.chars.select{|c| c.valid_encoding? }.join
+		return nil if str.empty?
 		begin
-			tmp = Iconv.conv('UTF-8', self.encoding.name, self).es_stem
-			return Iconv.conv(default_enc, 'UTF-8', tmp);
+			tmp = str.encode('UTF-8', str.encoding.name).es_stem
+			return tmp.encode(default_enc, 'UTF-8');
 		rescue
 			return nil
 		end
@@ -145,8 +141,9 @@ module EStem
 		[r1,r2]
 	end
+	#=> true or false
 	def step0(str)
-		return nil unless str =~ /(se(l[ao]s?)?|l([aeo]s?)|me|nos)$/i
+		return false unless str =~ /(se(l[ao]s?)?|l([aeo]s?)|me|nos)$/i
 		suffix = $&
 		rv_text = str[rv(str)..-1]
@@ -154,21 +151,21 @@ module EStem
 		case rv_text
 		when %r{((?<=i[éÉ]ndo|[áÁ]ndo|[áéíÁÉÍ]r)#{suffix})$}ui
 			str[%r{#$&$}]=''
-			str = remove_accent(str)
-			return str
+			str.replace(remove_accent(str))
+			return true
 		when %r{((?<=iendo|ando|[aei]r)#{suffix})$}i
 			str[%r{#$&$}]=''
-			return str
+			return true
 		end
 		if rv_text =~ /yendo/i and str =~ /uyendo/i
 		      str[suffix]=''
-		      return str
+		      return true
 		end
-		nil
+		false
 	end
-	#=> new_str or nil
+	#=> true or false
 	def step1(str)
 		r1,r2 = r12(str)
 		r1_text = str[r1..-1]
@@ -177,46 +174,46 @@ module EStem
 		case r2_text
 		when /(anzas?|ic[oa]s?|ismos?|[ai]bles?|istas?|os[oa]s?|[ai]mientos?)$/i
 			str[%r{#$&$}]=''
-			return str
+			return true
 		when /(ic)?(ador([ae]s?)?|aci[óÓ]n|aciones|antes?|ancias?)$/ui
 			str[%r{#$&$}]=''
-			return str
+			return true
 		when /log[íÍ]as?/ui
 			str[%r{#$&$}]='log'
-			return str
+			return true
 		when /(uci([óÓ]n|ones))$/ui
 			str[%r{#$&$}]='u'
-			return str
+			return true
 		when /(encias?)$/i
 			str[%r{#$&$}]='ente'
-			return str
+			return true
 		end
 		if r2_text =~ /(ativ|iv|os|ic|ad)amente$/i or r1_text =~ /amente$/i
 			str[%r{#$&$}]=''
-			return str
+			return true
 		end
 		case r2_text
 		when /((ante|[ai]ble)?mente)$/i, /((abil|i[cv])?idad(es)?)$/i, /((at)?iv[ao]s?)$/i
 			str[%r{#$&$}]=''
-			return str
+			return true
 		end
-		nil
+		false
 	end
-	#=> nil or new_str
+	#=> true or false
 	def step2a(str)
 		rv_pos = rv(str)
 		idx = str[rv_pos..-1] =~ /(y[oóÓ]|ye(ron|ndo)|y[ae][ns]?|ya(is|mos))$/ui
-		return nil unless idx
+		return false unless idx
 		if 'u' == str[rv_pos+idx-1].downcase
 			str[%r{#$&$}] = ''
-			return str
+			return true
 		end
-		nil
+		false
 	end
 	STEP2B_REGEXP = /(
@@ -229,6 +226,7 @@ module EStem
 		en|es|[éÉ]is|emos
 	)$/xiu
+	#=> true or false
 	def step2b(str)
 		rv_pos =  rv(str)
@@ -240,27 +238,28 @@ module EStem
 			else
 				str[%r{#{suffix}$}]=''
 			end
-			return str
+			return true
 		end
-		nil
+		false
 	end
+	#=> true or false
 	def step3(str)
 		rv_pos = rv(str)
 		rv_text = str[rv_pos..-1]
 		if rv_text =~ /(os|[aoáíóÁÍÓ])$/ui
 			str[%r{#$&$}]=''
-			return str
+			return true
 		elsif idx = rv_text =~ /(u?[eéÉ])$/i
 			if $&[0].downcase == 'u' and str[rv_pos+idx-1].downcase == 'g'
 				str[%r{#$&$}]=''
 			else
 				str.chop!
 			end
-			return str
+			return true
 		end
-		nil
+		false
 	end
 	VOWEL = 'aeiouáéíóúüAEIOUÁÉÍÓÚÜ'

data/lib/estem.rb~ ADDED Viewed

@@ -0,0 +1,271 @@
+# encoding: UTF-8
+#
+# :title: Spanish Stemming
+# = Description
+# This gem is for reducing Spanish words to their roots. It uses an algorithm
+# based on Martin Porter's specifications.
+#
+# For more information, visit:
+# http://snowball.tartarus.org/algorithms/spanish/stemmer.html
+#
+# = Descripción
+# Esta gema está para reducir las palabras del Español en sus respectivas raíces,
+# para ello ultiliza un algoritmo basado en las especificaciones de Martin Porter
+#
+# Para más información, visite:
+# http://snowball.tartarus.org/algorithms/spanish/stemmer.html
+#
+# = License -- Licencia
+# This code is provided under the terms of the {MIT License.}[http://www.opensource.org/licenses/mit-license.php]
+#
+# = Authors
+#   * Manuel A. Güílamo maguilamo.c@gmail.com
+#
+module EStem
+	##
+	# For more information, please refer to <b>String#es_stem</b> method, also <b>EStem</b>.
+	# :method: estem
+	##
+	#This method stem Spanish words.
+	#
+	#   "albergues".es_stem      # ==> "alberg"
+	#   "habitaciones".es_stem   # ==> "habit"
+	#   "ALbeRGues".es_stem      # ==> "ALbeRG"
+	#   "HaBiTaCiOnEs".es_stem   # ==> "HaBiT"
+	#   "Hacinamiento".es_stem   # ==> "Hacin"
+	#
+	#If you are not aware of the codeset the data have, try using
+	#String#safe_es_stem instead.
+	#
+	#:call-seq:
+	# str.es_stem    => "new_str"
+	def es_stem
+		str = self.dup
+		case str.length
+		when 0
+			return str
+		when 1
+			return remove_accent(str)
+		end
+		step0(str)
+		unless step1(str)
+			step2b(str) unless step2a(str)
+		end
+		step3(str)
+		remove_accent(str)
+	end
+	##
+	#Use this method in case you are not aware of the codeset the data being
+	#handle have. This method returns a new string with the same codeset as
+	#the original. Be aware that this method is a bit slower than String#es_stem
+	#:call-seq:
+	# str.safe_es_stem    => "new_str"
+	def safe_es_stem
+		if str.encoding == Encoding::UTF_8
+			# remove invalid characters
+			return self.chars.select{|c| c.valid_encoding? }.join.es_stem
+		end
+		unless self.valid_encoding?
+			tmp = self.dup
+			if tmp.force_encoding('UTF-8').valid_encoding?
+				begin
+					return tmp.es_stem
+				rescue
+				end
+			end
+		end
+		default_enc = self.encoding.name
+		str = self.chars.select{|c| c.valid_encoding? }.join
+		return nil if str.empty?
+		begin
+			tmp = str.encode('UTF-8', str.encoding.name).es_stem
+			return tmp.encode(default_enc, 'UTF-8');
+		rescue
+			return nil
+		end
+	end
+# :stopdoc:
+	private
+	def vowel?(c)
+		VOWEL.include?(c)
+	end
+	def consonant?(c)
+		CONSONANT.include?(c)
+	end
+	def remove_accent(str)
+		str.tr('áéíóúÁÉÍÓÚ','aeiouAEIOU')
+	end
+	def rv(str)
+		if consonant? str[1]
+			i=2
+			i+=1 while str[i] and consonant? str[i]
+			return str.nil? ? str.length-1 : i+1
+		end
+		if vowel? str[0] and vowel? str[1]
+			i=2
+			i+=1 while str[i] and vowel? str[i]
+			return str.nil? ? str.length-1 : i+1
+		end
+		return 3 if consonant? str[0] and vowel? str[1]
+		str.length - 1
+	end
+	def r(str, i=0)
+		i+=1 while str[i] and consonant?(str[i])
+		i+=1
+		i+=1 while str[i] and vowel? str[i]
+		str[i].nil? ?  str.length : i+1
+	end
+	def r12(str)
+		r1 = r(str)
+		r2 = r(str,r1)
+		[r1,r2]
+	end
+	#=> true or false
+	def step0(str)
+		return false unless str =~ /(se(l[ao]s?)?|l([aeo]s?)|me|nos)$/i
+		suffix = $&
+		rv_text = str[rv(str)..-1]
+		case rv_text
+		when %r{((?<=i[éÉ]ndo|[áÁ]ndo|[áéíÁÉÍ]r)#{suffix})$}ui
+			str[%r{#$&$}]=''
+			str.replace(remove_accent(str))
+			return true
+		when %r{((?<=iendo|ando|[aei]r)#{suffix})$}i
+			str[%r{#$&$}]=''
+			return true
+		end
+		if rv_text =~ /yendo/i and str =~ /uyendo/i
+		      str[suffix]=''
+		      return true
+		end
+		false
+	end
+	#=> true or false
+	def step1(str)
+		r1,r2 = r12(str)
+		r1_text = str[r1..-1]
+		r2_text = str[r2..-1]
+		case r2_text
+		when /(anzas?|ic[oa]s?|ismos?|[ai]bles?|istas?|os[oa]s?|[ai]mientos?)$/i
+			str[%r{#$&$}]=''
+			return true
+		when /(ic)?(ador([ae]s?)?|aci[óÓ]n|aciones|antes?|ancias?)$/ui
+			str[%r{#$&$}]=''
+			return true
+		when /log[íÍ]as?/ui
+			str[%r{#$&$}]='log'
+			return true
+		when /(uci([óÓ]n|ones))$/ui
+			str[%r{#$&$}]='u'
+			return true
+		when /(encias?)$/i
+			str[%r{#$&$}]='ente'
+			return true
+		end
+		if r2_text =~ /(ativ|iv|os|ic|ad)amente$/i or r1_text =~ /amente$/i
+			str[%r{#$&$}]=''
+			return true
+		end
+		case r2_text
+		when /((ante|[ai]ble)?mente)$/i, /((abil|i[cv])?idad(es)?)$/i, /((at)?iv[ao]s?)$/i
+			str[%r{#$&$}]=''
+			return true
+		end
+		false
+	end
+	#=> true or false
+	def step2a(str)
+		rv_pos = rv(str)
+		idx = str[rv_pos..-1] =~ /(y[oóÓ]|ye(ron|ndo)|y[ae][ns]?|ya(is|mos))$/ui
+		return false unless idx
+		if 'u' == str[rv_pos+idx-1].downcase
+			str[%r{#$&$}] = ''
+			return true
+		end
+		false
+	end
+	STEP2B_REGEXP = /(
+		ar([áÁ][ns]?|a(n|s|is)?|on)? | ar([éÉ]is|emos|é|É) | ar[íÍ]a(n|s|is|mos)? |
+		er([áÁ][sn]?|[éÉ](is)?|emos|[íÍ]a(n|s|is|mos)?)? |
+		ir([íÍ]a(s|n|is|mos)?|[áÁ][ns]?|emos|[éÉ]|éis)? | aba(s|n|is)? |
+		ad([ao]s?)? | ed | id(a|as|o|os)? | [íÍ]a(n|s|is|mos)? | [íÍ]s |
+		as(e[ns]?|te|eis|teis)? | [áÁ](is|bamos|semos|ramos) | a(n|ndo|mos) |
+		ie(ra|se|ran|sen|ron|ndo|ras|ses|rais|seis) | i(ste|steis|[óÓ]|mos|[éÉ]ramos|[éÉ]semos) |
+		en|es|[éÉ]is|emos
+	)$/xiu
+	#=> true or false
+	def step2b(str)
+		rv_pos =  rv(str)
+		if idx = str[rv_pos..-1] =~ STEP2B_REGEXP
+			suffix = $&
+			if suffix =~ /^(en|es|[éÉ]is|emos)$/ui
+				str[%r{#{suffix}$}]=''
+				str[rv_pos+idx-1]='' if str[rv_pos+idx-2] =~ /g/i and  str[rv_pos+idx-1] =~ /u/i
+			else
+				str[%r{#{suffix}$}]=''
+			end
+			return true
+		end
+		false
+	end
+	#=> true or false
+	def step3(str)
+		rv_pos = rv(str)
+		rv_text = str[rv_pos..-1]
+		if rv_text =~ /(os|[aoáíóÁÍÓ])$/ui
+			str[%r{#$&$}]=''
+			return true
+		elsif idx = rv_text =~ /(u?[eéÉ])$/i
+			if $&[0].downcase == 'u' and str[rv_pos+idx-1].downcase == 'g'
+				str[%r{#$&$}]=''
+			else
+				str.chop!
+			end
+			return true
+		end
+		false
+	end
+	VOWEL = 'aeiouáéíóúüAEIOUÁÉÍÓÚÜ'
+	CONSONANT = "bcdfghjklmnñpqrstvwxyzABCDEFGHIJKLMNÑOPQRSTUVWXYZ"
+end
+class String
+	include EStem
+end