RubyGems - chinese_vocab - Versions diffs - 0.8.6 → 0.9.0 - Mend

chinese_vocab 0.8.6 → 0.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

data/ChangeLog.md +13 -0
data/README.md +66 -29
data/lib/chinese_vocab/modules/helper_methods.rb +1 -1
data/lib/chinese_vocab/scraper.rb +22 -6
data/lib/chinese_vocab/version.rb +1 -1
data/lib/chinese_vocab/vocab.rb +96 -12
metadata +2 -2

data/ChangeLog.md CHANGED Viewed

@@ -1,3 +1,16 @@
+## Version 0.9.0 (April 20, 2012)
+#### Others
+* `Vocab`:
+ * Added `#word_frequency`.
+ * Added `#find_minimum_sentences`: new and faster algorithm to calculate the minimum number of required sentences.
+* `Scraper`:
+ * Removed timeout restriction.
+### Bug Fixes
+* `Scraper`: Don't impose a minimum sentence length if this constraint would exclude all sentences.
 ## Version 0.8.6 (April 13, 2012)
 ### Other

data/README.md CHANGED Viewed

@@ -10,19 +10,44 @@
 You can then export the sentences as well as additional tags provided by `Chinese::Vocab` to [Anki](http://ankisrs.net/).
 ## Features
 * Downloads sentences for each word in a Chinese vocabulary list and selects the __minimum required number of sentences__ to represent all words.
-* With the option key `:compact` set to `true` on initialization, all single character words that also appear in at least one multi character word are removed. The reason behind this option is to __remove redundancy in meaning__ and focus on learning distinct words. Example: (["看", "看书"] => [看书])
+* With the option key `:compact` set to `true` on initialization, all single character words that also appear in at least one multi character word are removed. The reason behind this option is to __remove redundancy in meaning__ and focus on learning distinct words.
+   Example: (["看", "看书"] => [看书])
 * Adds additional __tags__ to every sentence that can be used in [Anki](http://ankisrs.net/):
- * __Pinyin__: By default the pinyin representation is added to each sentence. Example: "除了这张大钞以外，我没有其他零票了。" => "chú le zhè zhāng dà chāo yĭ wài ，wŏ méi yŏu qí tā líng piào le 。"
- * __Number of target words__: The number of words from the vocabulary that are covered by a sentence. Example: "除了这张大钞以外，我没有其他零票了。" => "3_words"
- * __List of target words__: A list of the words from the vocabulary that are covered by a sentence. Example: "除了这张大钞以外，我没有其他零票了。" => "[我, 他, 除了 以外]"
+ * __Pinyin__: By default the pinyin representation is added to each sentence.
+   Example: "除了这张大钞以外，我没有其他零票了。" => "chú le zhè zhāng dà chāo yĭ wài ，wŏ méi yŏu qí tā líng piào le 。"
+ * __Number of target words__: The number of words from the vocabulary that are covered by a sentence.
+   Example: "除了这张大钞以外，我没有其他零票了。" => "3_words"
+ * __List of target words__: A list of the words from the vocabulary that are covered by a sentence.
+   Example: "除了这张大钞以外，我没有其他零票了。" => "[我, 他, 除了 以外]"
 * Export data to csv for easy import from [Anki](http://ankisrs.net/).
+## Installation
+```` bash
+$ gem install chinese_vocab
+````
+## The Dictionaries
+`Chinese::Vocab` uses the following online dictionaries to download the Chinese sentences:
+* [Nciku](http://www.nciku.com/): This is a fantastic English-Chinese Dictionary with tons of useful features and a great community.
+* [Jukuu](http://jukuu.com/): This one is special. It searches the Internet for example sentences and thus is able to return results even for more esoteric technical terms. Search results are returned extremely quickly.
+I *highly recommend* both sites for daily use, and suggest you bookmark them right away.
+###__Important Note of Caution__
+In order to save precious bandwidth for these great sites,  please do __only use this gem when you really need the Chinese sentences for your study__!
 ## Real World Example (Using the Traditional HSK Word List)
+__Note__: The number of required sentences to cover all words could be reduced by about __39%__.
 ```` ruby
 # Import words from source.
 # First argument:  path to file
@@ -32,43 +57,52 @@ words = Chinese::Vocab.parse_words('../old_hsk_level_8828_chars_1_word_edited.cs
 p words.take(6)
 # => ["啊", "啊", "矮", "爱", "爱人", "安静"]
 # Initialize an object.
 # First argument:  word list as an array of strings.
 # Options:
 # :compact (defaults to false)
 anki = Chinese::Vocab.new(words, :compact => true)
 # Options:
 # :source (defaults to :nciku)
 # :size   (defaults to :short)
 # :with_pinyin (defaults to true)
 anki.min_sentences(:thread_count => 10)
 # Sample output:
-# [{:word=>"吧", :chinese=>"放心吧，他做事向来把牢。",
-#   :pinyin=>"fàng xīn ba ，tā zuò shì xiàng lái bă láo 。",
-#   :english=>"Take it easy. You can always count on him."},
-#  {:word=>"喝", :chinese=>"喝酒挂红的人一般都很能喝。",
-#   :pinyin=>"hē jiŭ guà hóng de rén yī bān dōu hĕn néng hē 。",
-#   :english=>"Those whose face turn red after drinking are normally heavy drinkers."}]
+# [{:chinese=>"小红经常向别人夸示自己有多贤惠。",
+#   :pinyin=>"xiăo hóng jīng cháng xiàng bié rén kuā shì zì jĭ yŏu duō xián huì 。",
+#   :english=>"Xiaohong always boasts that she is genial and prudent.",
+#   :target_words=>["别人", "经常", "自己", "贤惠"]},
+#  {:chinese=>"一年一度的圣诞节购买礼物的热潮.",
+#   :pinyin=>"yī nián yī dù de shèng dàn jié gòu măi lĭ wù de rè cháo yī",
+#   :english=>"the annual Christmas gift-buying jag",
+#   :target_words=>["礼物", "购买", "圣诞节", "热潮", "一度"]}]
 # Save data to csv.
 # First parameter: path to file
 # Options:
 # Any supported option of Ruby's CSV libary
 anki.to_csv('in_the_wild_test.csv')
-# Sample output (2 sentences/lines out of 4511):
-# 只要我们有信心，就会战胜困难。,zhī yào wŏ men yŏu xìn xīn ，jiù huì zhàn shèng kùn nán 。,
-# "As long as we have confidence, we can overcome difficulties.",
-# 5_words,"[信心, 只要, 困难, 我们, 战胜]"
-# 至于他什么时候回来，我不知道。,zhì yú tā shén mo shí hòu huí lái ，wŏ bù zhī dào 。,
-# "As to what time he's due back, I'm just not sure.",
-# 5_words,"[什么, 回来, 时候, 知道, 至于]"
-````
+# Sample output: 2 sentences (csv rows) of 4431 sentences total
+# (Note that we started out with 7248 sentences):
+# 小红经常向别人夸示自己有多贤惠。,
+# xiăo hóng jīng cháng xiàng bié rén kuā shì zì jĭ yŏu duō xián huì 。,
+# Xiaohong always boasts that she is genial and prudent.,
+# 4_words,"[别人, 经常, 自己, 贤惠]"
+#
+# 一年一度的圣诞节购买礼物的热潮.,
+# yī nián yī dù de shèng dàn jié gòu măi lĭ wù de rè cháo yī,
+# the annual Christmas gift-buying jag,
+# 5_words,"[一度, 圣诞节, 热潮, 礼物, 购买]"
 #### Additional methods
-```` ruby
 # List all words
 p anki.words.take(6)
 # => ["啊", "啊", "矮", "爱", "爱人", "安静"]
@@ -77,22 +111,25 @@ p anki.words.size
 # => 7251
 p anki.stored_sentences.take(2)
-# [{:word=>"吧", :chinese=>"放心吧，他做事向来把牢。",
-#   :pinyin=>"fàng xīn ba ，tā zuò shì xiàng lái bă láo 。",
-#   :english=>"Take it easy. You can always count on him."},
-#  {:word=>"喝", :chinese=>"喝酒挂红的人一般都很能喝。",
-#   :pinyin=>"hē jiŭ guà hóng de rén yī bān dōu hĕn néng hē 。",
-#   :english=>"Those whose face turn red after drinking are normally heavy drinkers."}]
-# words not found
+# [{:chinese=>"小红经常向别人夸示自己有多贤惠。",
+#   :pinyin=>"xiăo hóng jīng cháng xiàng bié rén kuā shì zì jĭ yŏu duō xián huì 。",
+#   :english=>"Xiaohong always boasts that she is genial and prudent.",
+#   :target_words=>["别人", "经常", "自己", "贤惠"]},
+#  {:chinese=>"一年一度的圣诞节购买礼物的热潮.",
+#   :pinyin=>"yī nián yī dù de shèng dàn jié gòu măi lĭ wù de rè cháo yī",
+#   :english=>"the annual Christmas gift-buying jag",
+#   :target_words=>["礼物", "购买", "圣诞节", "热潮", "一度"]}]
+# Words not found on neither online dictionary.
 p anki.not_found
 # ["来回来去", "来看来讲", "深美"]
 # Number of unique characters in the selected sentences
 p anki.sentences_unique_chars.size
-# => 3290
+# => 3232
 ````
 ## Documentation
 * [parse_words](http://rubydoc.info/github/bytesource/chinese_vocab/master/Chinese/Vocab.parse_words) - How to read in the Chinese words and correctly set the column number, Options:
  * The [supported options](http://ruby-doc.org/stdlib-1.9.3/libdoc/csv/rdoc/CSV.html#method-c-new) of Ruby's CSV library as well as the `:encoding` parameter. __Note__: `:encoding` is always set to `utf-8` and `:skip_blanks` to `true` internally.

data/lib/chinese_vocab/modules/helper_methods.rb CHANGED Viewed

@@ -25,7 +25,7 @@ module Chinese
       word.scan(/\p{Word}+/)      # Returns an array of characters that belong together.
     end
-    # Return true if every distince word (as defined by #distinct_words)
+    # Return true if every distinct word as defined by {#distinct_words}
     # can be found in the given sentence.
     def include_every_char?(word, sentence)
       characters = distinct_words(word)

data/lib/chinese_vocab/scraper.rb CHANGED Viewed

@@ -54,7 +54,8 @@ module Chinese
       # http://stackoverflow.com/questions/377768/string-concatenation-and-ruby/378258#378258
       url       = source[:url] + CGI.escape(word)
       # http://ruby-doc.org/stdlib-1.9.2/libdoc/timeout/rdoc/Timeout.html#method-c-timeout
-      content   = Timeout.timeout(20) { open(url) }
+      content   = Timeout.timeout(30) { open(url) }
+      content   = open(url)
       main_node = Nokogiri::HTML(content).css(source[:parent_sel]) # Returns a single node.
       return []  if main_node.to_a.empty?
@@ -91,7 +92,18 @@ module Chinese
       # 北边 => 树林边的河流向北方
       sentence_pairs = sentence_pairs.select { |cn, _| include_every_char?(word, cn) }
-      sentence_pairs
+      # Only select Chinese sentences that are at least x times longer than the word (counting character length),
+      # as sometimes only the word itself is listed as a sentence (or a short expression that does not really
+      # count as a sentence).
+      # Exception: If the result is an empty array (= none of the sentences fulfill the length constrain)
+      # then just return the sentences selected so far.
+      sentence_pairs_selected_by_length_factor = sentence_pairs.select { |cn, _| sentence_times_longer_than_word?(cn, word, 2.2) }
+      unless sentence_pairs_selected_by_length_factor.empty?
+        sentence_pairs_selected_by_length_factor
+      else
+        sentence_pairs
+      end
     end
     def self.sentence(word, options={})
@@ -119,11 +131,15 @@ module Chinese
       pair[0].empty? || pair[1].empty?
     end
-    # Despite its name returns the SECOND shortest sentence,
-    # as the shortest result often is not a real sentence,
-    # but a definition.
+    def self.sentence_times_longer_than_word?(sentence, word, factor)
+      sentence_chars = sentence.scan(/\p{Word}/)
+      word_chars     = word.scan(/\p{Word}/)
+      sentence_chars.size >= (factor * word_chars.size)
+    end
     def self.shortest_size(sentence_pairs)
-      sentence_pairs.sort_by {|(cn,_)| cn.length }.take(2).last
+      sentence_pairs.sort_by {|(cn,_)| cn.length }.first
     end
     def self.longest_size(sentence_pairs)

data/lib/chinese_vocab/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Chinese
-  VERSION = "0.8.6"
+  VERSION = "0.9.0"
 end

data/lib/chinese_vocab/vocab.rb CHANGED Viewed

@@ -4,6 +4,7 @@ require 'open-uri'
 require 'nokogiri'
 require 'cgi'
 require 'csv'
+require 'set'
 require 'with_validations'
 require 'string_to_pinyin'
 require 'chinese_vocab/scraper'
@@ -20,7 +21,7 @@ module Chinese
     #
     #  * Removing parentheses (with the content inside each parenthesis).
     #  * Removing any slash (/) and only keeping the longest part.
-    #  * Removing '儿' for any word longer than two characters.
+    #  * Removing trailing '儿' from any word longer than two characters.
     #  * Removing non-word characters such as points and commas.
     #  * Removing and duplicate words.
     #@return [Array<String>]
@@ -163,9 +164,11 @@ module Chinese
         @not_found = convert(not_found)
         size_a = words.size
         size_b = to_queue.size
-        # puts "Size(words)       = #{size_a}"
-        # puts "Size(to_queue)    = #{size_b}"
-        # puts "Size(words+queue) = #{size_a+size_b}"
+        puts "Size(@not_found)  = #{@not_found.size}"
+        puts "Size(words)       = #{size_a}"
+        puts "Size(to_queue)    = #{size_b}"
+        puts "Size(words+queue) = #{size_a+size_b}"
+        puts "Size(sentences)   = #{to_queue.size}"
         # Remove file
         File.unlink(file_name)
@@ -183,7 +186,7 @@ module Chinese
             begin
               local_result = select_sentence(word, options)
-              puts "Processing word: #{word}"
+              puts "Processing word: #{word} (#{from_queue.size} words left)"
               # rescue SocketError, Timeout::Error, Errno::ETIMEDOUT,
               # Errno::ECONNREFUSED, Errno::ECONNRESET, EOFError => e
             rescue Exception => e
@@ -268,21 +271,69 @@ module Chinese
       thread_count = validate { :thread_count }
       sentences    = sentences(options)
-      minimum_sentences = select_minimum_necessary_sentences(sentences)
+      # Remove those words that don't have a sentence
+      words             = @words - @not_found
+      puts "Determining the target words for every sentence..."
+      sentences         = add_target_words(sentences, words)
+      minimum_sentences = find_minimum_sentences(sentences, words)
       # :uwc = 'unique words count'
-      with_uwc_tag      = add_key(minimum_sentences, :uwc) {|row| uwc_tag(row[:target_words]) }
+      with_uwc_tag = add_key(minimum_sentences, :uwc) {|row| uwc_tag(row[:target_words]) }
       # :uws = 'unique words string'
       with_uwc_uws_tags = add_key(with_uwc_tag, :uws) do |row|
         words = row[:target_words].sort.join(', ')
         "[" + words + "]"
       end
       # Remove those keys we don't need anymore
-      result            = remove_keys(with_uwc_uws_tags, :target_words, :word)
+      result = remove_keys(with_uwc_uws_tags, :target_words, :word)
       @stored_sentences = result
       @stored_sentences
     end
+    def find_minimum_sentences(sentences, words)
+      min_sentences   = []
+      # At the start the variable 'remaining words' contains all
+      # target words - minus those with no sentence found.
+      remaining_words = Set.new(words.dup)
+      # On every round:
+      # Finds the sentence with the most target words ('best sentence').
+      # Adds that sentence to the result array.
+      # Deletes all target words from the remaining words that are part of
+      # the best sentence.
+      while(!remaining_words.empty?) do
+        puts "Number of remaining_words: #{remaining_words.size}"
+        # puts "Take five: #{remaining_words.take(5)}"
+        # Return the sentence with the largest number of target words.
+        sentences = sentences.sort_by do |row|
+          # Returns a new array containing elements common to
+          # the two arrays, with no duplicates.
+          words_left = remaining_words.intersection(row[:target_words])
+          # Sort by size of words left first (in descsending order),
+          # if equal, sort by length of the Chinese sentence (in ascending order).
+          [-words_left.size, row[:chinese].size]
+        end
+        best_sentence = sentences.first
+        # Add the sentence with the largest number of
+        # target words to the result array.
+        min_sentences << best_sentence
+        # Remove the target words that are part of the
+        # best sentence from the remaining words.
+        remaining_words = remaining_words - best_sentence[:target_words]
+      end
+      # puts "Number of minimum sentences: #{min_sentences.size}"
+      min_sentences
+    end
     # Finds the unique Chinese characters from either the data in {#stored_sentences} or an
     # array of Chinese sentences passed as an argument.
     # @overload sentences_unique_chars(sentences)
@@ -458,12 +509,12 @@ module Chinese
     end
-    def add_target_words(hash_array)
+    def add_target_words(hash_array, words)
       from_queue  = Queue.new
       to_queue    = Queue.new
       # semaphore = Mutex.new
       result      = []
-      words       = @words
+      # words       = @words
       hash_array.each {|hash| from_queue << hash}
       10.times.map {
@@ -502,9 +553,27 @@ module Chinese
         #     first.nonzero? || (a[:chinese].size <=> b[:chinese].size) }
     end
+    # Calculates the number of occurences of every word of {#words} in {#stored_sentences}
+    # @return [Hash] Keys are the words in {#words} with the values indicating the number of
+    #   occurences in {#stored_sentences}
+    def word_frequency
+      words.reduce({}) do |acc, word|
+        acc[word] = 0 # Set key with a default value of zero.
+        stored_sentences.each do |row|
+          sentence = row[:chinese]
+          acc[word] += 1 if include_every_char?(word, sentence)
+        end
+        acc
+      end
+    end
+    # @deprecated  This method has been replaced by {#find_minimum_sentences}.
     def select_minimum_necessary_sentences(sentences)
-      with_target_words = add_target_words(sentences)
+      words = @words - @not_found
+      with_target_words = add_target_words(sentences, words)
       rows              = sort_by_target_word_count(with_target_words)
       selected_rows   = []
@@ -529,6 +598,13 @@ module Chinese
     end
+    def occurrence_count(word_array, frequency)
+      word_array.reduce(0) do |acc, word|
+        acc + frequency[word]
+      end
+    end
     def remove_keys(hash_array, *keys)
       hash_array.map { |row| row.delete_keys(*keys) }
     end
@@ -572,7 +648,15 @@ module Chinese
         acc
       end
-      matched_words.size == @words.size
+      # matched_words.size == @words.size
+      if matched_words.size == @words.size
+        true
+      else
+        puts "Words not found in sentences:"
+        p @words - matched_words
+        false
+      end
     end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: chinese_vocab
 version: !ruby/object:Gem::Version
-  version: 0.8.6
+  version: 0.9.0
   prerelease:
 platform: ruby
 authors:
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-04-13 00:00:00.000000000 Z
+date: 2012-04-20 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: with_validations