RubyGems - estem - Versions diffs - 0.2.4 → 0.2.5 - Mend

estem 0.2.4 → 0.2.5

Files changed (12) hide show

data/ChangeLog +37 -10
data/README.rdoc +7 -38
data/examples/usage.rb +0 -2
data/examples/usage.rb~ +11 -0
data/lib/estem.rb +59 -60
data/lib/estem.rb~ +271 -0
data/test/diffs_ISO88591.txt +28390 -0
data/test/{diffs.txt → diffs_UTF8.txt} +0 -0
data/test/test_estem.rb +15 -10
data/test/test_estem.rb~ +27 -0
metadata +10 -5
data/bin/es_stem.rb +0 -179

data/ChangeLog CHANGED Viewed

@@ -1,3 +1,30 @@
+Version 0.2.5
+2012-09-02  MaG <maguilamo.c@gmail.com>
+*
+  - bin/ directory, removed.
+* README.rdoc:
+  - cleanups
+  - Thanks section, removed.
+* examples/usage.rb:
+  - cleanups
+* estem.rb:
+  - (es_stem): rewritten.
+  - (safe_es_stem): deprecated Iconv, removed.
+* bin/es_stem.rb:
+  - removed.
+* test/:
+  - new test file added.
+  - rename file diffs.txt.
+* test/test_estem.rb:
+  - one more test added.
 Version 0.2.4
 2012-06-25  MaG <maguilamo.c@gmail.com>
@@ -9,19 +36,19 @@ Version 0.2.4
   - examples/usage.rb: new file
 * README.rdoc:
-	- max 80 cols per line.
-	- recomendation about using safe_es_stem().
-	- Fix Spanish typos.
+  - max 80 cols per line.
+  - recomendation about using safe_es_stem().
+  - Fix Spanish typos.
 * estem.gemspec:
-	- cleanups.
-	- (required_ruby_version): Ruby 1.9.1.
+  - cleanups.
+  - (required_ruby_version): Ruby 1.9.1.
 * bin/es_stem.rb:
-	- chmod a+x .
-	- (es_stem.rb:80): fix case sensitive comparation.
-	- (es_stem.rb:25): removed .rb ext.
-	- (es_stem.rb:29): new version.
+  - chmod a+x .
+  - (es_stem.rb:80): fix case sensitive comparation.
+  - (es_stem.rb:25): removed .rb ext.
+  - (es_stem.rb:29): new version.
 * estem.rb:
-	- (safe_es_stem): new method.
+  - (safe_es_stem): new method.

data/README.rdoc CHANGED Viewed

@@ -1,7 +1,7 @@
 = Spanish Stem Gem
 == Description
-This gem is for reducing Spanish words to their roots. It uses an algorithm
+This gem reduces Spanish words to their respective roots. It uses an algorithm
 based on Martin Porter's specifications.
 For more information, visit:
@@ -21,15 +21,14 @@ or
   $ gem install estem
 == Usage
-As a reminder, take in consideration that the Spanish language have several non
+As a reminder, take in consideration that the Spanish language has several non
 US-ASCII characters, and because of that, the same data may varied from one
 codeset to another.
 Please remember to use a UTF-8 compatible encoding while using EStem. Please do
-not use String#force_encoding() to convert from one codeset to another, you
-might try using String#encode() but this later is more likely to fail, consider
-using String#safe_es_stem() when handling incompatibles codesets or the codeset
-type is unknown.
+not use String#force_encoding to convert from one codeset to another, you may
+try using String#encode alone but, instead, consider using String#safe_es_stem
+when handling incompatibles codesets or the codeset type varies.
   require 'estem'
@@ -41,19 +40,6 @@ type is unknown.
   puts "HaBiTaCiOnEs".es_stem   # ==> "HaBiT"
   puts "Hacinamiento".es_stem   # ==> "Hacin"
-You can use <tt>EStem</tt> as a command line tool:
-  $ es_stem --in-enc ISO-8859-1 -f input_file.txt
-for more information type
-  $ es_stem --help
-The <tt>es_stem</tt> program do his best trying to tokenized the lines from
-the file, you might consider finding an Spanish tokenizer, either way this
-program do what it is suppose to do, stem Spanish words.
-NOTE: For excellent results, consider replacing one word per line on the files
-the program handles.
 == Uso
 Como recordatorio, ten en cosideración que el Castellano posee muchos
 carácteres que están fuera del código ASCII, y por esta razón, los datos pueden
@@ -62,9 +48,8 @@ variar de un conjunto de codificación a otro.
 Por favor recuerda utilizar sistemas de condificación compatibles con UTF-8
 cuando se trabaje con EStem. Por favor no use String#force_encoding para
 convertir de un conjunto de codificación a otro, podría utilizar String#encode
-pero este último es más probable que falle en el intento, considere utilizar
-String#safe_es_stem() si está manejando conjuntos de codificación incompatibles
-o se desconoce el tipo.
+solo, pero en su lugar, considere utilizar String#safe_es_stem() si está
+manejando conjuntos de codificación incompatibles o se desconoce el tipo.
   require 'estem'
@@ -75,17 +60,6 @@ o se desconoce el tipo.
   puts "ALbeRGues".es_stem      # ==> "ALbeRG"
   puts "HaBiTaCiOnEs".es_stem   # ==> "HaBiT"
   puts "Hacinamiento".es_stem   # ==> "Hacin"
-Para más información ejecuta:
-  $ es_stem --help
-El programa <tt>es_stem</tt> hará lo posible para separar las palabras de cada
-línea del fichero. Sería sensato utilizar otro programa más especializado para
-este propósito, de todas maneras, es_stem hace lo que se supone debe hacer,
-optener las raíces de las palabras.
-NOTA: Para resultados excelentes, considere poner una palabra por línea en los
-ficheros que pasará el programa.
 == Test
@@ -101,11 +75,6 @@ Incluye 28390 palabras de prueba con sus resultado esperados. Para realizar
 la prueba, ejecuta:
   rake test
-== Thanks -- Agradecimientos
-Ray Pereda https://github.com/raypereda/stemmify/ I used his gem as a guide to
-package mine. http://guides.rubygems.org/make-your-own-gem/ as well.
 == License -- Licencia
 Copyright (c) 2012 Manuel A. Güílamo

data/examples/usage.rb CHANGED Viewed

@@ -1,7 +1,5 @@
 require 'estem'
-hsh = Hash.new
 words = ['albergues','habitaciones','Albergues','ALbeRGues','HaBiTaCiOnEs',
          'Hacinamiento','mujeres','muchedumbre','ocasionalmente']

data/examples/usage.rb~ ADDED Viewed

@@ -0,0 +1,11 @@
+require 'estem'
+hsh = Hash.new
+words = ['albergues','habitaciones','Albergues','ALbeRGues','HaBiTaCiOnEs',
+         'Hacinamiento','mujeres','muchedumbre','ocasionalmente']
+words.each do|w|
+	stem = w.es_stem
+	puts "Word: #{w}\nStem: #{stem}\n\n"
+end

data/lib/estem.rb CHANGED Viewed

@@ -22,8 +22,6 @@
 #   * Manuel A. Güílamo maguilamo.c@gmail.com
 #
-require 'iconv'
 module EStem
 	##
 	# For more information, please refer to <b>String#es_stem</b> method, also <b>EStem</b>.
@@ -38,61 +36,59 @@ module EStem
 	#   "HaBiTaCiOnEs".es_stem   # ==> "HaBiT"
 	#   "Hacinamiento".es_stem   # ==> "Hacin"
 	#
-	#If you are not aware of the codeset the data has, then use
+	#If you are not aware of the codeset the data have, try using
 	#String#safe_es_stem instead.
 	#
 	#:call-seq:
 	# str.es_stem    => "new_str"
 	def es_stem
 		str = self.dup
-		return remove_accent(str) if str.length == 1
-		tmp = step0(str)
-		str = tmp ? tmp : str
-		unless tmp = step1(str)
-			unless tmp = step2a(str)
-				tmp = step2b(str)
-				str = tmp ? tmp : str
-			else
-				str = tmp
-			end
+		case str.length
+		when 0
+			return str
+		when 1
+			return remove_accent(str)
 		end
-		tmp = step3(str)
-		str = tmp.nil? ? str : tmp
+		step0(str)
+		unless step1(str)
+			step2b(str) unless step2a(str)
+		end
+		step3(str)
 		remove_accent(str)
 	end
 	##
 	#Use this method in case you are not aware of the codeset the data being
-	#handle has. This method returns a new string with the same codeset as
-	#the original. Be aware that this method is slower than String#es_stem()
+	#handle have. This method returns a new string with the same codeset as
+	#the original. Be aware that this method is a bit slower than String#es_stem
 	#:call-seq:
 	# str.safe_es_stem    => "new_str"
 	def safe_es_stem
-		return self.es_stem if self.encoding == Encoding::UTF_8
-		default_enc = self.encoding.name
-		str = self.dup.force_encoding('UTF-8')
-		if str.valid_encoding?
-			begin
-				tmp = str.es_stem
-				return tmp.force_encoding(default_enc)
-			rescue
-			end
+		if self.encoding == Encoding::UTF_8
+			# remove invalid characters
+			return self.chars.select{|c| c.valid_encoding? }.join.es_stem
 		end
-		if enc = Encoding.compatible?(self, VOWEL)
-			begin
-				return self.encode(enc).es_stem
-			rescue
+		unless self.valid_encoding?
+			tmp = self.dup
+			if tmp.force_encoding('UTF-8').valid_encoding?
+				begin
+					return tmp.es_stem
+				rescue
+				end
 			end
 		end
+		default_enc = self.encoding.name
+		str = self.chars.select{|c| c.valid_encoding? }.join
+		return nil if str.empty?
 		begin
-			tmp = Iconv.conv('UTF-8', self.encoding.name, self).es_stem
-			return Iconv.conv(default_enc, 'UTF-8', tmp);
+			tmp = str.encode('UTF-8', str.encoding.name).es_stem
+			return tmp.encode(default_enc, 'UTF-8');
 		rescue
 			return nil
 		end
@@ -145,8 +141,9 @@ module EStem
 		[r1,r2]
 	end
+	#=> true or false
 	def step0(str)
-		return nil unless str =~ /(se(l[ao]s?)?|l([aeo]s?)|me|nos)$/i
+		return false unless str =~ /(se(l[ao]s?)?|l([aeo]s?)|me|nos)$/i
 		suffix = $&
 		rv_text = str[rv(str)..-1]
@@ -154,21 +151,21 @@ module EStem
 		case rv_text
 		when %r{((?<=i[éÉ]ndo|[áÁ]ndo|[áéíÁÉÍ]r)#{suffix})$}ui
 			str[%r{#$&$}]=''
-			str = remove_accent(str)
-			return str
+			str.replace(remove_accent(str))
+			return true
 		when %r{((?<=iendo|ando|[aei]r)#{suffix})$}i
 			str[%r{#$&$}]=''
-			return str
+			return true
 		end
 		if rv_text =~ /yendo/i and str =~ /uyendo/i
 		      str[suffix]=''
-		      return str
+		      return true
 		end
-		nil
+		false
 	end
-	#=> new_str or nil
+	#=> true or false
 	def step1(str)
 		r1,r2 = r12(str)
 		r1_text = str[r1..-1]
@@ -177,46 +174,46 @@ module EStem
 		case r2_text
 		when /(anzas?|ic[oa]s?|ismos?|[ai]bles?|istas?|os[oa]s?|[ai]mientos?)$/i
 			str[%r{#$&$}]=''
-			return str
+			return true
 		when /(ic)?(ador([ae]s?)?|aci[óÓ]n|aciones|antes?|ancias?)$/ui
 			str[%r{#$&$}]=''
-			return str
+			return true
 		when /log[íÍ]as?/ui
 			str[%r{#$&$}]='log'
-			return str
+			return true
 		when /(uci([óÓ]n|ones))$/ui
 			str[%r{#$&$}]='u'
-			return str
+			return true
 		when /(encias?)$/i
 			str[%r{#$&$}]='ente'
-			return str
+			return true
 		end
 		if r2_text =~ /(ativ|iv|os|ic|ad)amente$/i or r1_text =~ /amente$/i
 			str[%r{#$&$}]=''
-			return str
+			return true
 		end
 		case r2_text
 		when /((ante|[ai]ble)?mente)$/i, /((abil|i[cv])?idad(es)?)$/i, /((at)?iv[ao]s?)$/i
 			str[%r{#$&$}]=''
-			return str
+			return true
 		end
-		nil
+		false
 	end
-	#=> nil or new_str
+	#=> true or false
 	def step2a(str)
 		rv_pos = rv(str)
 		idx = str[rv_pos..-1] =~ /(y[oóÓ]|ye(ron|ndo)|y[ae][ns]?|ya(is|mos))$/ui
-		return nil unless idx
+		return false unless idx
 		if 'u' == str[rv_pos+idx-1].downcase
 			str[%r{#$&$}] = ''
-			return str
+			return true
 		end
-		nil
+		false
 	end
 	STEP2B_REGEXP = /(
@@ -229,6 +226,7 @@ module EStem
 		en|es|[éÉ]is|emos
 	)$/xiu
+	#=> true or false
 	def step2b(str)
 		rv_pos =  rv(str)
@@ -240,27 +238,28 @@ module EStem
 			else
 				str[%r{#{suffix}$}]=''
 			end
-			return str
+			return true
 		end
-		nil
+		false
 	end
+	#=> true or false
 	def step3(str)
 		rv_pos = rv(str)
 		rv_text = str[rv_pos..-1]
 		if rv_text =~ /(os|[aoáíóÁÍÓ])$/ui
 			str[%r{#$&$}]=''
-			return str
+			return true
 		elsif idx = rv_text =~ /(u?[eéÉ])$/i
 			if $&[0].downcase == 'u' and str[rv_pos+idx-1].downcase == 'g'
 				str[%r{#$&$}]=''
 			else
 				str.chop!
 			end
-			return str
+			return true
 		end
-		nil
+		false
 	end
 	VOWEL = 'aeiouáéíóúüAEIOUÁÉÍÓÚÜ'

data/lib/estem.rb~ ADDED Viewed

@@ -0,0 +1,271 @@
+# encoding: UTF-8
+#
+# :title: Spanish Stemming
+# = Description
+# This gem is for reducing Spanish words to their roots. It uses an algorithm
+# based on Martin Porter's specifications.
+#
+# For more information, visit:
+# http://snowball.tartarus.org/algorithms/spanish/stemmer.html
+#
+# = Descripción
+# Esta gema está para reducir las palabras del Español en sus respectivas raíces,
+# para ello ultiliza un algoritmo basado en las especificaciones de Martin Porter
+#
+# Para más información, visite:
+# http://snowball.tartarus.org/algorithms/spanish/stemmer.html
+#
+# = License -- Licencia
+# This code is provided under the terms of the {MIT License.}[http://www.opensource.org/licenses/mit-license.php]
+#
+# = Authors
+#   * Manuel A. Güílamo maguilamo.c@gmail.com
+#
+module EStem
+	##
+	# For more information, please refer to <b>String#es_stem</b> method, also <b>EStem</b>.
+	# :method: estem
+	##
+	#This method stem Spanish words.
+	#
+	#   "albergues".es_stem      # ==> "alberg"
+	#   "habitaciones".es_stem   # ==> "habit"
+	#   "ALbeRGues".es_stem      # ==> "ALbeRG"
+	#   "HaBiTaCiOnEs".es_stem   # ==> "HaBiT"
+	#   "Hacinamiento".es_stem   # ==> "Hacin"
+	#
+	#If you are not aware of the codeset the data have, try using
+	#String#safe_es_stem instead.
+	#
+	#:call-seq:
+	# str.es_stem    => "new_str"
+	def es_stem
+		str = self.dup
+		case str.length
+		when 0
+			return str
+		when 1
+			return remove_accent(str)
+		end
+		step0(str)
+		unless step1(str)
+			step2b(str) unless step2a(str)
+		end
+		step3(str)
+		remove_accent(str)
+	end
+	##
+	#Use this method in case you are not aware of the codeset the data being
+	#handle have. This method returns a new string with the same codeset as
+	#the original. Be aware that this method is a bit slower than String#es_stem
+	#:call-seq:
+	# str.safe_es_stem    => "new_str"
+	def safe_es_stem
+		if str.encoding == Encoding::UTF_8
+			# remove invalid characters
+			return self.chars.select{|c| c.valid_encoding? }.join.es_stem
+		end
+		unless self.valid_encoding?
+			tmp = self.dup
+			if tmp.force_encoding('UTF-8').valid_encoding?
+				begin
+					return tmp.es_stem
+				rescue
+				end
+			end
+		end
+		default_enc = self.encoding.name
+		str = self.chars.select{|c| c.valid_encoding? }.join
+		return nil if str.empty?
+		begin
+			tmp = str.encode('UTF-8', str.encoding.name).es_stem
+			return tmp.encode(default_enc, 'UTF-8');
+		rescue
+			return nil
+		end
+	end
+# :stopdoc:
+	private
+	def vowel?(c)
+		VOWEL.include?(c)
+	end
+	def consonant?(c)
+		CONSONANT.include?(c)
+	end
+	def remove_accent(str)
+		str.tr('áéíóúÁÉÍÓÚ','aeiouAEIOU')
+	end
+	def rv(str)
+		if consonant? str[1]
+			i=2
+			i+=1 while str[i] and consonant? str[i]
+			return str.nil? ? str.length-1 : i+1
+		end
+		if vowel? str[0] and vowel? str[1]
+			i=2
+			i+=1 while str[i] and vowel? str[i]
+			return str.nil? ? str.length-1 : i+1
+		end
+		return 3 if consonant? str[0] and vowel? str[1]
+		str.length - 1
+	end
+	def r(str, i=0)
+		i+=1 while str[i] and consonant?(str[i])
+		i+=1
+		i+=1 while str[i] and vowel? str[i]
+		str[i].nil? ?  str.length : i+1
+	end
+	def r12(str)
+		r1 = r(str)
+		r2 = r(str,r1)
+		[r1,r2]
+	end
+	#=> true or false
+	def step0(str)
+		return false unless str =~ /(se(l[ao]s?)?|l([aeo]s?)|me|nos)$/i
+		suffix = $&
+		rv_text = str[rv(str)..-1]
+		case rv_text
+		when %r{((?<=i[éÉ]ndo|[áÁ]ndo|[áéíÁÉÍ]r)#{suffix})$}ui
+			str[%r{#$&$}]=''
+			str.replace(remove_accent(str))
+			return true
+		when %r{((?<=iendo|ando|[aei]r)#{suffix})$}i
+			str[%r{#$&$}]=''
+			return true
+		end
+		if rv_text =~ /yendo/i and str =~ /uyendo/i
+		      str[suffix]=''
+		      return true
+		end
+		false
+	end
+	#=> true or false
+	def step1(str)
+		r1,r2 = r12(str)
+		r1_text = str[r1..-1]
+		r2_text = str[r2..-1]
+		case r2_text
+		when /(anzas?|ic[oa]s?|ismos?|[ai]bles?|istas?|os[oa]s?|[ai]mientos?)$/i
+			str[%r{#$&$}]=''
+			return true
+		when /(ic)?(ador([ae]s?)?|aci[óÓ]n|aciones|antes?|ancias?)$/ui
+			str[%r{#$&$}]=''
+			return true
+		when /log[íÍ]as?/ui
+			str[%r{#$&$}]='log'
+			return true
+		when /(uci([óÓ]n|ones))$/ui
+			str[%r{#$&$}]='u'
+			return true
+		when /(encias?)$/i
+			str[%r{#$&$}]='ente'
+			return true
+		end
+		if r2_text =~ /(ativ|iv|os|ic|ad)amente$/i or r1_text =~ /amente$/i
+			str[%r{#$&$}]=''
+			return true
+		end
+		case r2_text
+		when /((ante|[ai]ble)?mente)$/i, /((abil|i[cv])?idad(es)?)$/i, /((at)?iv[ao]s?)$/i
+			str[%r{#$&$}]=''
+			return true
+		end
+		false
+	end
+	#=> true or false
+	def step2a(str)
+		rv_pos = rv(str)
+		idx = str[rv_pos..-1] =~ /(y[oóÓ]|ye(ron|ndo)|y[ae][ns]?|ya(is|mos))$/ui
+		return false unless idx
+		if 'u' == str[rv_pos+idx-1].downcase
+			str[%r{#$&$}] = ''
+			return true
+		end
+		false
+	end
+	STEP2B_REGEXP = /(
+		ar([áÁ][ns]?|a(n|s|is)?|on)? | ar([éÉ]is|emos|é|É) | ar[íÍ]a(n|s|is|mos)? |
+		er([áÁ][sn]?|[éÉ](is)?|emos|[íÍ]a(n|s|is|mos)?)? |
+		ir([íÍ]a(s|n|is|mos)?|[áÁ][ns]?|emos|[éÉ]|éis)? | aba(s|n|is)? |
+		ad([ao]s?)? | ed | id(a|as|o|os)? | [íÍ]a(n|s|is|mos)? | [íÍ]s |
+		as(e[ns]?|te|eis|teis)? | [áÁ](is|bamos|semos|ramos) | a(n|ndo|mos) |
+		ie(ra|se|ran|sen|ron|ndo|ras|ses|rais|seis) | i(ste|steis|[óÓ]|mos|[éÉ]ramos|[éÉ]semos) |
+		en|es|[éÉ]is|emos
+	)$/xiu
+	#=> true or false
+	def step2b(str)
+		rv_pos =  rv(str)
+		if idx = str[rv_pos..-1] =~ STEP2B_REGEXP
+			suffix = $&
+			if suffix =~ /^(en|es|[éÉ]is|emos)$/ui
+				str[%r{#{suffix}$}]=''
+				str[rv_pos+idx-1]='' if str[rv_pos+idx-2] =~ /g/i and  str[rv_pos+idx-1] =~ /u/i
+			else
+				str[%r{#{suffix}$}]=''
+			end
+			return true
+		end
+		false
+	end
+	#=> true or false
+	def step3(str)
+		rv_pos = rv(str)
+		rv_text = str[rv_pos..-1]
+		if rv_text =~ /(os|[aoáíóÁÍÓ])$/ui
+			str[%r{#$&$}]=''
+			return true
+		elsif idx = rv_text =~ /(u?[eéÉ])$/i
+			if $&[0].downcase == 'u' and str[rv_pos+idx-1].downcase == 'g'
+				str[%r{#$&$}]=''
+			else
+				str.chop!
+			end
+			return true
+		end
+		false
+	end
+	VOWEL = 'aeiouáéíóúüAEIOUÁÉÍÓÚÜ'
+	CONSONANT = "bcdfghjklmnñpqrstvwxyzABCDEFGHIJKLMNÑOPQRSTUVWXYZ"
+end
+class String
+	include EStem
+end