RubyGems - indonesian_stemmer - Versions diffs - 0.0.1 → 0.1.0 - Mend

indonesian_stemmer 0.0.1 → 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

data/README.md +28 -2
data/lib/indonesian_stemmer/morphological_utility.rb +126 -21
data/lib/indonesian_stemmer/version.rb +1 -1
data/spec/indonesian_stemmer/morphological_utility_spec.rb +4 -0
metadata +4 -4

data/README.md CHANGED Viewed

@@ -1,6 +1,13 @@
 # IndonesianStemmer
-TODO: Write a gem description
+[![Gem Version](https://badge.fury.io/rb/indonesian_stemmer.png)](http://badge.fury.io/rb/indonesian_stemmer)
+[![Build Status](https://secure.travis-ci.org/apraditya/indonesian_stemmer.png)](http://travis-ci.org/apraditya/indonesian_stemmer)
+[![Dependency Status](https://gemnasium.com/apraditya/indonesian_stemmer.png)](https://gemnasium.com/apraditya/indonesian_stemmer)
+[![Code Climate](https://codeclimate.com/github/apraditya/indonesian_stemmer.png)](https://codeclimate.com/github/apraditya/indonesian_stemmer)
+Stems Indonesian words based on Porter Stemmer, with the algorithm presented in [**A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia**](http://www.illc.uva.nl/Publications/ResearchReports/MoL-2003-02.text.pdf), by Fadillah Z Tala.
 ## Installation
@@ -18,9 +25,28 @@ Or install it yourself as:
 ## Usage
-TODO: Write usage instructions here
+    require 'rubygems'
+    require 'indonesian_stemmer'
+    IndonesianStemmer.stem('mendengarkan')  # => "dengar"
+    'beriman'.stem                          # => "iman"
+## Known Problems
+This gem is in active development, don't rely on this for your analysis or datamining projects. Currently there's no problems stemming Indonesian words. Please [submit a ticket](https://github.com/apraditya/indonesian_stemmer/issues/new) if you find one.
 ## Contributing
+Initially, this gem is based on [Apache Lucene](http://lucene.apache.org/). Currently it's just a ruby port from its analyzer for Indonesian. Its stemmer library only analyze the word length, therefore some modifications added in order to get the actual stemmed word. Feel free to download Lucene's source code under `analysis/common/src/java/org/apache/lucene/analysis/id/`.
+### References
+Some references to help your contribution:
+1. [The Official Kamus Bahasa Indonesia](http://bahasa.kemdiknas.go.id/kbbi/index.php)
+2. To search Indonesian words and their roots, use the [Unofficial Kamus Besar Bahasa Indonesia](http://www.kamusbesar.com/)
+3. Wikipedia's [Prefiks dalam Bahasa Indonesia](http://id.wikipedia.org/wiki/Prefiks_dalam_bahasa_Indonesia)
+### Steps
 1. Fork it
 2. Create your feature branch (`git checkout -b my-new-feature`)

data/lib/indonesian_stemmer/morphological_utility.rb CHANGED Viewed

@@ -7,12 +7,65 @@ module IndonesianStemmer
   POSSESSIVE_PRONOUN_CHARACTERS               = %w( ku mu nya )
   FIRST_ORDER_PREFIX_CHARACTERS               = %w( meng meny men mem me
                                                     peng peny pen pem di ter ke )
-  SPECIAL_FIRST_ORDER_PREFIX_CHARACTERS       = %w( meny peny pen )
+  SPECIAL_FIRST_ORDER_PREFIX_CHARACTERS       = %w( meng peng meny peny men pen
+                                                    mem pem )
   SECOND_ORDER_PREFIX_CHARACTERS              = %w( ber be per pe )
-  SPECIAL_SECOND_ORDER_PREFIX_CHARACTERS      = %w( be )
   NON_SPECIAL_SECOND_ORDER_PREFIX_CHARACTERS  = %w( ber per pe )
   SPECIAL_SECOND_ORDER_PREFIX_WORDS           = %w( belajar pelajar belunjur )
   SUFFIX_CHARACTERS                           = %w( kan an i )
+  WITH_VOWEL_SUBSTITUTION_PREFIX_CHARACTERS   = %w( meny peny men pen )
+  IRREGULARS_FOR_WORDS_BEGINS_WITH_K  = %w(
+    aget alah andung ata ejar eluar embali empis emuka ena enal encang endali ering
+    erja erut etahui etik ibar irim uasai uliti umpul unci unjung unyah upas urang )
+  IRREGULARS_FOR_WORDS_BEGINS_WITH_P  = %w(
+    adam ahat akai amer anas ancang anggang anggil anjat antul asang asti atuhi
+    ecah ecat elihara eluk ercik eriksa erintah esan ikir ilah ilih injam inta
+    isah otong otret uja uji ukul usat utar-balik utus )
+  IRREGULARS_FOR_WORDS_BEGINS_WITH_N = %w( aas ada adi afi afsu aif aik akal akoda
+    alar ama anti arasi asab asib asional atif asehat asihat atural etral ikah )
+  IRREGULARS_FOR_WORDS_BEGINS_WITH_R = %w( aba abak aban abas abat abet abit
+    abuk abun abung abut acak acau acik acuh acun adah adai adak adang adiasi
+    adikal adio adu aga agam agas agi agu aguk ahap ahasia ahat ahim ahmat aih
+    aja ajah ajalela ajam ajang ajin ajuk ajut akap akat akit aksi akuk akus
+    akut akyat alat alip amah amahtamah amah-tamah amai amal ambah ambai ambak
+    amban ambang ambat ambeh ambu ambut amin ampai ampak ampang ampas ampat
+    amping ampok ampung ampus amu amus anap anca ancah ancak ancang ancap
+    ancu ancung anda andai andak andat andau andek anduk andung angah angai
+    angak anggah asa usak )
+  IRREGULAR_PREFIX_CHARACTERS_ON_WORDS = {
+    'meng' => IRREGULARS_FOR_WORDS_BEGINS_WITH_K,
+    'peng' => IRREGULARS_FOR_WORDS_BEGINS_WITH_K,
+    'mem' => IRREGULARS_FOR_WORDS_BEGINS_WITH_P,
+    'pem' => IRREGULARS_FOR_WORDS_BEGINS_WITH_P,  }
+  IRREGULAR_WORDS_ENDS_WITH_COMMON_CHARACTERS = {
+    'kah' => %w(  bengkah berkah bingkah bongkah cekah firkah halakah halkah
+                  harakah ingkah jangkah jerkah kalah kekah kelakah kerakah kerkah
+                  khalikah langkah lukah markah mukah musyarakah nafkah naskah
+                  nikah pangkah rakah rekah rengkah sedekah sekah serakah serkah
+                  sungkah takah tekah telingkah tingkah tongkah ),
+    'lah' => %w(  balah belah beslah bilah celah galah islah istilah jumlah
+                  kalah kelah kilah lalah lelah makalah malah masalah
+                  muamalah mujadalah mukabalah olah onslah oplah pecahbelah
+                  pecah-belah pilah milah sekolah rihlah risalah salah serlah
+                  silsilah sudah sulalah telah tulah ulah uzlah walah wasilah ),
+    'pun' => %w(  ampun depun himpun lapun rapun rumpun ),
+    'ku'  => %w(  awabeku baku bangku beku beluku biku buku ceku ciku cuku deku
+                  jibaku kaku laku leku liku luku paku pangku peku perilaku saku
+                  siku suku teleku terungku tungku waluku ),
+    'mu'  => %w(  ilmu jamu jemu kemu ramu selumu tamu temu ),
+    'nya' => %w(  tanya  ),
+  }
   REMOVED_KE    = 1
   REMOVED_PENG  = 2
@@ -56,19 +109,10 @@ module IndonesianStemmer
       def remove_first_order_prefix(word)
         @number_of_syllables ||= total_syllables(word)
-        word_size = word.size
-        SPECIAL_FIRST_ORDER_PREFIX_CHARACTERS.each do |characters|
-          characters_size = characters.size
-          if starts_with?(word, word_size, characters) && word_size > characters_size && is_vowel?(word[characters_size])
-            @flags ||= collection_for(characters, 'removed')
-            reduce_syllable
-            word = substitute_word_character(word, characters)
-            slice_word_at_position( word,
-                                    characters_size-1,
-                                    :start )
-            return word
-          end
-        end
+        previous_word = word.dup
+        remove_and_substitute_characters_matching_collection(
+            word, collection_for(:special_first_order_prefix), :start )
+        return word if previous_word != word
         remove_characters_matching_collection( word,
                                               collection_for(:first_order_prefix),
@@ -149,10 +193,12 @@ module IndonesianStemmer
         def remove_characters_matching_collection(word, collection, position)
           collection.each do |characters|
             if send("#{position}s_with?", word, word.size, characters)
-              @flags ||= collection_for(characters, 'removed')
-              reduce_syllable
-              slice_word_at_position(word, characters.size, position)
-              return word
+              unless ambiguous_with_characters?(word, characters, position)
+                @flags ||= collection_for(characters, 'removed')
+                reduce_syllable
+                slice_word_at_position(word, characters.size, position)
+                return word
+              end
             end
           end
@@ -164,17 +210,76 @@ module IndonesianStemmer
           word.slice!( multiplier*characters_size, characters_size)
         end
+        def remove_and_substitute_characters_matching_collection(word, collection, position)
+          word_size = word.size
+          collection.each do |characters|
+            characters_size = characters.size
+            if send("#{position}s_with?", word, word_size, characters) &&
+                  word_size > characters_size && is_vowel?(word[characters_size])
+              if WITH_VOWEL_SUBSTITUTION_PREFIX_CHARACTERS.include?(characters) ||
+                    contains_irregular_prefix?(word, characters)
+                @flags ||= collection_for(characters, 'removed')
+                reduce_syllable
+                word = substitute_word_character(word, characters)
+                slice_word_at_position( word,
+                                        characters_size-1,
+                                        :start )
+                return word
+              end
+            end
+          end
+        end
+        def contains_irregular_prefix?(word, characters)
+          if IRREGULAR_PREFIX_CHARACTERS_ON_WORDS.keys.include?(characters)
+            chopped_word_match_words_collection?(
+              word[characters.size, word.size],
+              IRREGULAR_PREFIX_CHARACTERS_ON_WORDS[characters] )
+          end
+        end
+        def chopped_word_match_words_collection?(chopped_word, collection)
+          collection.any? { |w| starts_with?(chopped_word, chopped_word.size, w) }
+        end
         def substitute_word_character(word, characters)
           substitute_char = case
           when %w(meny peny).include?(characters)
             's'
-          when characters == 'pen'
-            't'
+          when %w(men pen).include?(characters)
+            (chopped_word_match_words_collection?(
+                word[characters.size, word.size], IRREGULARS_FOR_WORDS_BEGINS_WITH_N
+              )
+            )? 'n' : 't'
+          when %w(meng peng).include?(characters)
+            'k'
+          when %w(mem pem).include?(characters)
+            'p'
           end
           word[characters.size-1] = substitute_char if substitute_char
           word
         end
+        def ambiguous_with_characters?(word, characters, position)
+          if position == :start
+            if characters == 'per'
+              chopped_word_match_words_collection?(word[3..-1],
+                  IRREGULARS_FOR_WORDS_BEGINS_WITH_R )
+            else
+              return false
+            end
+          else
+            IRREGULAR_WORDS_ENDS_WITH_COMMON_CHARACTERS[characters].any? do |ambiguous_word|
+              # To differentiate 'mobilmu' with 'berilmu'
+              return false unless %w(me be pe).include?(word[0,2])
+              # The rest is ok
+              ends_with?(word, word.size, ambiguous_word)
+            end
+          end
+        end
         def reduce_syllable
           @number_of_syllables -= 1
         end

data/lib/indonesian_stemmer/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module IndonesianStemmer
-  VERSION = "0.0.1"
+  VERSION = "0.1.0"
 end

data/spec/indonesian_stemmer/morphological_utility_spec.rb CHANGED Viewed

@@ -581,6 +581,10 @@ describe IndonesianStemmer::MorphologicalUtility do
   describe '#remove_suffix' do
     describe "words with these suffix characters" do
+      before do
+        IndonesianStemmer.instance_variable_set("@flags", 0)
+      end
       describe "at the end of the word, should remove the suffix characters" do
         it "'kan'" do
           should_transform(:remove_suffix, 'katakan', 'kata')

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: indonesian_stemmer
 version: !ruby/object:Gem::Version
-  version: 0.0.1
+  version: 0.1.0
   prerelease:
 platform: ruby
 authors:
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2013-03-30 00:00:00.000000000 Z
+date: 2013-04-04 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: rake
@@ -82,7 +82,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
       version: '0'
       segments:
       - 0
-      hash: 1838075541569491639
+      hash: -3033082141051403298
 required_rubygems_version: !ruby/object:Gem::Requirement
   none: false
   requirements:
@@ -91,7 +91,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
       segments:
       - 0
-      hash: 1838075541569491639
+      hash: -3033082141051403298
 requirements: []
 rubyforge_project:
 rubygems_version: 1.8.25