chinese_vocab 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in chinese.gemspec
4
+ gemspec
data/LICENSE ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2012 Stefan Rohlfing
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,103 @@
1
+ # Chinese::Vocab
2
+
3
+ `Chinese::Vocab` is meant to make live easier for any Chinese language student who:
4
+
5
+ * Prefers to learn vocabulary from Chinese sentences.
6
+ * Needs to memorize a lot of words on a __tight time schedule__.
7
+ * Uses the spaced repetition flashcard program [Anki](http://ankisrs.net/).
8
+
9
+ `Chinese::Vocab` addresses all of the above requirements by downloading sentences for each word and selecting the __minimum required number of Chinese sentences__ (and English translations) to __represent all words__.
10
+
11
+ You can then export the sentences as well as additional tags provided by `Chinese::Vocab` to Anki.
12
+
13
+ ## Features
14
+
15
+ * Downloads sentences for each word in a Chinese vocabulary list and selects the __minimum required number of sentences__ to represent all words.
16
+ * With the option key `:compact` set to `true` on initialization, all single character words that also appear in at least one multi character word are removed. The reason behind this option is to __remove redundancy in meaning__ and focus on learning distinct words. Example: (["看", "看书"] => [看书])
17
+ * Adds additional __tags__ to every sentence that can be used in *Anki*:
18
+ * __Pinyin__: By default the pinyin representation is added to each sentence. Example: "除了这张大钞以外,我没有其他零票了。" => "chú le zhè zhāng dà chāo yĭ wài ,wŏ méi yŏu qí tā líng piào le 。"
19
+ * __Number of target words__: The number of words from the vocabulary that are covered by a sentence. Example: "除了这张大钞以外,我没有其他零票了。" => "3_words"
20
+ * __List of target words__: A list of the words from the vocabulary that are covered by a sentence. Example: "除了这张大钞以外,我没有其他零票了。" => "[我, 他, 除了 以外]"
21
+ * Export data to csv for easy import from *Anki*.
22
+
23
+
24
+ ## Real World Example (using the Traditional HSK word list)
25
+
26
+ ```` ruby
27
+ # Import words from source.
28
+ # First argument: path to file
29
+ # Second argument: column number of word column (counting starts at 1)
30
+ words = Chinese::Vocab.parse_words('../old_hsk_level_8828_chars_1_word_edited.csv', 4)
31
+ # Sample output:
32
+ words.take(6)
33
+ # => ["啊", "啊", "矮", "爱", "爱人", "安静"]
34
+
35
+
36
+ # Initialize an object.
37
+ # First argument: word list as an array of strings.
38
+ # Options:
39
+ # :compact (defaults to false)
40
+ anki = Chinese::Vocab.new(words, :compact => true)
41
+
42
+ # List all words
43
+ p anki.words.take(6)
44
+ # => ["啊", "啊", "矮", "爱", "爱人", "安静"]
45
+ p anki.words.size
46
+ # => 7251
47
+
48
+ # Options:
49
+ # :source (defaults to :nciku)
50
+ # :size (defaults to :short)
51
+ # :with_pinyin (defaults to true)
52
+ anki.min_sentences(:thread_count => 10)
53
+
54
+ p anki.stored_sentences.take(2)
55
+ # [{:word=>"吧", :chinese=>"放心吧,他做事向来把牢。",
56
+ # :pinyin=>"fàng xīn ba ,tā zuò shì xiàng lái bă láo 。",
57
+ # :english=>"Take it easy. You can always count on him."},
58
+ # {:word=>"喝", :chinese=>"喝酒挂红的人一般都很能喝。",
59
+ # :pinyin=>"hē jiŭ guà hóng de rén yī bān dōu hĕn néng hē 。",
60
+ # :english=>"Those whose face turn red after drinking are normally heavy drinkers."}]
61
+
62
+ # words not found
63
+ p anki.not_found
64
+ # ["来回来去", "来看来讲", "深美"]
65
+
66
+ # Number of unique characters in the selected sentences
67
+ p anki.sentences_unique_chars.size
68
+ # => 3290
69
+
70
+ # Save data to csv.
71
+ # First parameter: path to file
72
+ # Options:
73
+ # Any supported option of Ruby's CSV libary
74
+ anki.to_csv('in_the_wild_test.csv')
75
+ # Sample output (2 sentences/lines out of 4511):
76
+
77
+ # 舞台上正在上演的是吕剧。,wŭ tái shàng zhèng zài shàng yăn de shì lǚ jù 。,
78
+ # What is being performed on the stage is Lv opera (a local opera of Shandong Province).
79
+ # ,2_words,"[正在, 舞台]"
80
+ # 古代官员上朝都要穿朝靴。,gŭ dài guān yuán shàng cháo dōu yào chuān cháo xuē 。,
81
+ # "In ancient times, all courtiers had to wear special boots to enter the court.",
82
+ # 2_words,"[古代, 官员]"
83
+
84
+ ````
85
+
86
+ ## Documentation
87
+ * [parse_words](http://rubydoc.info/github/bytesource/chinese_vocab/master/Chinese/Vocab.parse_words) - How to read in the Chinese words and correctly set the column number, Options:
88
+ * The [supported options](http://ruby-doc.org/stdlib-1.9.3/libdoc/csv/rdoc/CSV.html#method-c-new) of Ruby's CSV library as well as the `:encoding` parameter. __Note__: `:encoding` is always set to `utf-8` and `:skip_blanks` to `true` internally.
89
+ * [initialize](http://rubydoc.info/github/bytesource/chinese_vocab/master/Chinese/Vocab:initialize) - How to write composite expressions such as "除了。。以外", Options:
90
+ * `:compress` (`Boolean`): Whether or not to remove all single character words that
91
+ also appear in at least one multi character word. Example: (["看", "看书"] => [看书]). The reason behind this option is to remove redundancy in meaning and focus on learning distinct words.
92
+ * [words](http://rubydoc.info/github/bytesource/chinese_vocab/master/Chinese/Vocab:words) - Learn how words are edited internally.
93
+ * [min_sentences](http://rubydoc.info/github/bytesource/chinese_vocab/master/Chinese/Vocab:min_sentences) - Options:
94
+ * `:source` (`Symbol`): The online dictionary to download the sentences from, either [:nciku](http://www.nciku.com) or [:jukuu](http://www.jukuu.com). Defaults to `:nciku`. __Note__: Despite the download source chosen (by using the default or setting the `:source` options), if a word was not found on the first site, the second site is used as an alternative.
95
+ * `:with_pinyin` (`Boolean`): Whether or not to return the pinyin representation of a sentence. Defaults to `true`.
96
+ * `:size` (`Symbol`): The size of the sentence to return from a possible set of several sentences. Supports the values `:short`, `:average`, and `:long`. Defaults to `:short`.
97
+ * `:thread_count` (`Integer`): The number of threads used to download the sentences. Defaults to `8`.
98
+ * [sentences_unique_chars](http://rubydoc.info/github/bytesource/chinese_vocab/master/Chinese/Vocab:sentences_unique_chars) - List of unique Chinese *characters* (single character words) are found in the selected sentences.
99
+ * [to_csv](http://rubydoc.info/github/bytesource/chinese_vocab/master/Chinese/Vocab:to_csv) - Options:
100
+ * All [supported options](http://ruby-doc.org/stdlib-1.9.3/libdoc/csv/rdoc/CSV.html#method-c-new) of Ruby's CSV library.
101
+
102
+
103
+
data/Rakefile ADDED
@@ -0,0 +1,22 @@
1
+ # require 'spec/rake/spectask' # depreciated
2
+ require 'rspec/core/rake_task'
3
+ # require 'rake/gempackagetask' # depreciated
4
+ require 'rubygems/package_task'
5
+ require 'rdoc/task'
6
+
7
+ # Build gem: rake gem
8
+ # Push gem: rake push
9
+
10
+ task :default => [ :spec, :gem ]
11
+
12
+ RSpec::Core::RakeTask.new {:spec}
13
+
14
+ gem_spec = eval(File.read('chinese_vocab.gemspec'))
15
+
16
+ Gem::PackageTask.new( gem_spec ) do |t|
17
+ t.need_zip = true
18
+ end
19
+
20
+ task :push => :gem do |t|
21
+ sh "gem push -v pkg/#{gem_spec.name}-#{gem_spec.version}.gem"
22
+ end
data/lib/chinese.rb ADDED
@@ -0,0 +1,11 @@
1
+ # encoding: utf-8
2
+ require 'chinese/vocab'
3
+ require 'chinese/scraper'
4
+ require 'chinese/version'
5
+ require 'chinese/core_ext/array'
6
+ require 'chinese/core_ext/hash'
7
+ require 'chinese/core_ext/queue'
8
+ require 'chinese/modules/helper_methods'
9
+
10
+ module Chinese
11
+ end
@@ -0,0 +1,14 @@
1
+ # encoding: utf-8
2
+
3
+ class Array
4
+
5
+ # Input: [1,2,3,4,5]
6
+ # Output: [[1, 2], [2, 3], [3, 4], [4, 5]]
7
+ def overlap_pairs
8
+ second = self.dup.drop(1)
9
+ self.each_with_index.inject([]) {|acc,(item,i)|
10
+ acc << [item,second[i]] unless second[i].nil?
11
+ acc
12
+ }
13
+ end
14
+ end
@@ -0,0 +1,37 @@
1
+ # encoding: utf-8
2
+
3
+ class Hash
4
+
5
+ # Returns a copy of self with *keys removed.
6
+ def delete_keys(*keys)
7
+ hash = self.dup
8
+
9
+ keys.each do |key|
10
+ hash.delete(key)
11
+ end
12
+ hash
13
+ end
14
+
15
+ # Remove *keys from self
16
+ def delete_keys!(*keys)
17
+ keys.each do |key|
18
+ self.delete(key)
19
+ end
20
+ end
21
+
22
+ # Creates a sub-hash from `self` with the keys from `keys`
23
+ # @note keys in `keys` not present in `self` are silently ignored.
24
+ # @return [Hash] a copy of `self`.
25
+ def slice(*keys)
26
+ self.select { |k,v| keys.include?(k) }
27
+ end
28
+
29
+ def slice!(*keys)
30
+ sub_hash = self.select { |k,v| keys.include?(k) }
31
+ # Remove 'keys' form self:
32
+ self.delete_keys!(*sub_hash.keys)
33
+ sub_hash
34
+ end
35
+ end
36
+
37
+
@@ -0,0 +1,25 @@
1
+ # encoding: utf-8
2
+
3
+ require 'thread'
4
+
5
+ class Queue
6
+
7
+ def to_a
8
+ @que
9
+ end
10
+
11
+ # Return nil if queue is empty.
12
+ def pop!
13
+ pop(non_block = true)
14
+ rescue ThreadError => e
15
+ case e.message
16
+ when /queue empty/
17
+ nil
18
+ else
19
+ raise
20
+ end
21
+ end
22
+
23
+ end
24
+
25
+
@@ -0,0 +1,38 @@
1
+ # encoding: utf-8
2
+
3
+ module Chinese
4
+ module HelperMethods
5
+
6
+ def self.included(klass)
7
+ klass.extend(self)
8
+ end
9
+
10
+ def is_unicode?(word)
11
+ # Remove all non-ascii and non-unicode word characters
12
+ word = distinct_words(word).join
13
+ # English text at this point only contains characters that are mathed by \w
14
+ # Chinese text at this point contains mostly/only unicode word characters that are not matched by \w.
15
+ # In case of Chinese text the size of 'char_arr' therefore has to be smaller than the size of 'word'
16
+ char_arr = word.scan(/\w/)
17
+ char_arr.size < word.size
18
+ end
19
+
20
+ # Input: "除了。。。 以外。。。"
21
+ # Outout: ["除了", "以外"]
22
+ def distinct_words(word)
23
+ # http://stackoverflow.com/a/3976004
24
+ # Alternative: /[[:word:]]+/
25
+ word.scan(/\p{Word}+/) # Returns an array of characters that belong together.
26
+ end
27
+
28
+ # Return true if every distince word (as defined by #distinct_words)
29
+ # can be found in the given sentence.
30
+ def include_every_char?(word, sentence)
31
+ characters = distinct_words(word)
32
+ characters.all? {|char| sentence.include?(char) }
33
+ end
34
+
35
+
36
+ end
37
+ end
38
+
@@ -0,0 +1,143 @@
1
+ # encoding: utf-8
2
+ require 'cgi'
3
+ require 'open-uri'
4
+ require 'nokogiri'
5
+ require 'timeout'
6
+ require 'chinese/core_ext/array'
7
+ require 'with_validations'
8
+ require 'chinese/modules/helper_methods'
9
+
10
+ module Chinese
11
+ class Scraper
12
+ include WithValidations
13
+ include HelperMethods
14
+
15
+ attr_reader :source, :word
16
+ attr_accessor :sentences
17
+
18
+ Sources = {
19
+ nciku:
20
+ {:url => "http://www.nciku.com/search/all/examples/",
21
+ :parent_sel => "div.examples_box > dl",
22
+ :cn_sel => "//dt/span[1]",
23
+ :en_sel => "//dd/span[@class='tc_sub']",
24
+ # Only cn/en sentence pairs where the second node has a class 'tc_sub' belong together.
25
+ :select_pair => lambda { |node1,node2| node1['class'] != "tc_sub" && node2['class'] == "tc_sub" },
26
+ # Just return the text stored in the node. :text_sel is mainly intended for jukuu (see below)
27
+ :text_sel => "text()",
28
+ # We want cn first, en second, but nciku does not return cn/en sentence pairs in a strict order.
29
+ :reorder => lambda { |text1,text2| if is_unicode?(text2) then [text2,text1] else [text1,text2] end }},
30
+ jukuu:
31
+ {:url => "http://www.jukuu.com/search.php?q=",
32
+ :parent_sel => "table#Table1 table[width = '680']",
33
+ :cn_sel => "//tr[@class='c']",
34
+ :en_sel => "//tr[@class='e']",
35
+ # Only cn/en sentence pairs where the first node has a class 'e' belong together.
36
+ :select_pair => lambda { |node1,node2| node1['class'] == "e" && node2['class'] != "e" },
37
+ :text_sel => "td[2]",
38
+ :reorder => lambda { |text1,text2| [text2,text1] }}
39
+ }
40
+
41
+ OPTIONS = {:source => [:nciku, lambda {|value| Sources.keys.include?(value) }],
42
+ :size => [:average, lambda {|value| [:short, :average, :long].include?(value) }]}
43
+
44
+
45
+ # Options:
46
+ # size => [:short, :average, :long], default = :average
47
+ def self.sentences(word, options={})
48
+ download_source = validate { :source }
49
+
50
+ source = Sources[download_source]
51
+
52
+ CGI.accept_charset = 'UTF-8'
53
+ # Note: Use + because << changes the object on its left hand side, but + doesn't:
54
+ # http://stackoverflow.com/questions/377768/string-concatenation-and-ruby/378258#378258
55
+ url = source[:url] + CGI.escape(word)
56
+ # http://ruby-doc.org/stdlib-1.9.2/libdoc/timeout/rdoc/Timeout.html#method-c-timeout
57
+ content = Timeout.timeout(20) { open(url) }
58
+ main_node = Nokogiri::HTML(content).css(source[:parent_sel]) # Returns a single node.
59
+ return [] if main_node.to_a.empty?
60
+
61
+ # CSS selector: Returns the tags in the order they are specified
62
+ # XPath selector: Return the tags in the order they appear in the document (that's what we want here).
63
+ # Source: http://stackoverflow.com/questions/5825136/nokogiri-and-finding-element-by-name/5845985#5845985
64
+ target_nodes = main_node.search("#{source[:cn_sel]} | #{source[:en_sel]}")
65
+ return [] if target_nodes.to_a.empty?
66
+
67
+ # In order to make sure we only return text that also has a translation,
68
+ # we need to first group each target node with Array#overlap_pairs like this:
69
+ # Input: [cn1, cn2, en2, cn3, en3, cn4]
70
+ # Output: [[cn1,cn2],[cn2,en2],[en2,cn3],[cn3,en3],[en3,cn4]]
71
+ # and then select the correct pairs: [[cn2,en2],[cn3,en3]].
72
+ # Regarding #to_a: Nokogiri::XML::NodeSet => Array
73
+ sentence_pairs = target_nodes.to_a.overlap_pairs.select {|(node1,node2)| source[:select_pair].call(node1,node2) }
74
+ sentence_pairs = sentence_pairs.reduce([]) do |acc,(cn_node,en_node)|
75
+ cn = cn_node.css(source[:text_sel]).text.strip # 'text' returns an empty string when 'css' returns an empty array.
76
+ en = en_node.css(source[:text_sel]).text.strip
77
+ pair = [cn,en]
78
+ # Ensure that both the chinese and english selector have text.
79
+ # (sometimes they don't).
80
+ acc << pair unless pair_with_empty_string?(pair)
81
+ acc
82
+ end
83
+ # Switch position of each pair if the first entry is the translation,
84
+ # as we always return an array of [cn_sentence,en_sentence] pairs.
85
+ # The following step is necessary because:
86
+ # 1) Jukuu returns sentences in the order English first, Chinese second
87
+ # 2) Nciku mostly returns sentences in the order Chinese first, English second
88
+ # (but sometimes it is the other way round.)
89
+ sentence_pairs = sentence_pairs.map {|node1,node2| source[:reorder].call(node1,node2) }
90
+ # Only select Chinese sentences that don't separate words, e.g., skip all sentences like the following:
91
+ # 北边 => 树林边的河流向北方
92
+ sentence_pairs = sentence_pairs.select { |cn, _| include_every_char?(word, cn) }
93
+
94
+ sentence_pairs
95
+ end
96
+
97
+ def self.sentence(word, options={})
98
+ value = validate { :size }
99
+
100
+ scraped_sentences = sentences(word, options)
101
+ return [] if scraped_sentences.empty?
102
+
103
+ case value
104
+ when :short
105
+ shortest_size(scraped_sentences)
106
+ when :average
107
+ average_size(scraped_sentences)
108
+ when :long
109
+ longest_size(scraped_sentences)
110
+ end
111
+ end
112
+
113
+
114
+ # ===================
115
+ # Helper methods
116
+ # ===================
117
+
118
+ def self.pair_with_empty_string?(pair)
119
+ pair[0].empty? || pair[1].empty?
120
+ end
121
+
122
+ # Despite its name returns the SECOND shortest sentence,
123
+ # as the shortest result often is not a real sentence,
124
+ # but a definition.
125
+ def self.shortest_size(sentence_pairs)
126
+ sentence_pairs.sort_by {|(cn,_)| cn.length }.take(2).last
127
+ end
128
+
129
+ def self.longest_size(sentence_pairs)
130
+ sentence_pairs.sort_by {|(cn,_)| cn.length }.last
131
+ end
132
+
133
+ def self.average_size(sentence_pairs)
134
+ sorted = sentence_pairs.sort_by {|(cn,_)| cn.length }
135
+ length = sorted.length
136
+ sorted.find {|(cn,_)| cn.size >= length/2 }
137
+ end
138
+
139
+
140
+
141
+ end
142
+ end
143
+
@@ -0,0 +1,3 @@
1
+ module Chinese
2
+ VERSION = "0.8.0"
3
+ end
@@ -0,0 +1,595 @@
1
+ # encoding: utf-8
2
+ require 'thread'
3
+ require 'open-uri'
4
+ require 'nokogiri'
5
+ require 'cgi'
6
+ require 'csv'
7
+ require 'with_validations'
8
+ require 'string_to_pinyin'
9
+ require 'chinese/scraper'
10
+ require 'chinese/modules/helper_methods'
11
+ require 'chinese/core_ext/hash'
12
+ require 'chinese/core_ext/queue'
13
+
14
+ module Chinese
15
+ class Vocab
16
+ include WithValidations
17
+ include HelperMethods
18
+
19
+ # The list of Chinese words after calling {#edit_vocab}. Editing includes:
20
+ #
21
+ # * Removing parentheses (with the content inside each parenthesis).
22
+ # * Removing any slash (/) and only keeping the longest part.
23
+ # * Removing '儿' for any word longer than two characters.
24
+ # * Removing non-word characters such as points and commas.
25
+ # * Removing and duplicate words.
26
+ #@return [Array<String>]
27
+ attr_reader :words
28
+ #@return [Boolean] the value of the _:compact_ options key.
29
+ attr_reader :compact
30
+ #@return [Array<String>] holds those Chinese words from {#words} that could not be found in any
31
+ # of the supported online dictionaries during a call to either {#sentences} or {#min_sentences}.
32
+ # Defaults to `[]`.
33
+ attr_reader :not_found
34
+ #@return [Boolean] the value of the `:with_pinyin` option key.
35
+ attr_reader :with_pinyin
36
+ #@return [Array<Hash>] holds the return value of either {#sentences} or {#min_sentences},
37
+ # whichever was called last. Defaults to `[]`.
38
+ attr_reader :stored_sentences
39
+
40
+ # Mandatory constant for the [WithValidations](http://rubydoc.info/github/bytesource/with_validations/file/README.md) module. Each key-value pair is of the following type:
41
+ # `option_key => [default_value, validation]`
42
+ OPTIONS = {:compact => [false, lambda {|value| is_boolean?(value) }],
43
+ :with_pinyin => [true, lambda {|value| is_boolean?(value) }],
44
+ :thread_count => [8, lambda {|value| value.kind_of?(Integer) }]}
45
+
46
+ # Intializes an object.
47
+ # @note Words that are composite expressions must be written with a least one non-word
48
+ # character (such as whitespace) between each sub-expression. Example: "除了 以外" or
49
+ # "除了。。以外" instead of "除了以外".
50
+ # @overload initialize(word_array, options)
51
+ # @param [Array<String>] word_array An array of Chinese words that is stored in {#words} after
52
+ # all non-ascii, non-unicode characters have been stripped and double entries removed.
53
+ # @param [Hash] options The options to customize the following feature.
54
+ # @option options [Boolean] :compact Whether or not to remove all single character words that
55
+ # also appear in at least one multi character word. Example: (["看", "看书"] => [看书])
56
+ # The reason behind this option is to remove redundancy in meaning and focus on learning distinct words.
57
+ # Defaults to `false`.
58
+ # @overload initialize(word_array)
59
+ # @param [Array<String>] word_array An array of Chinese words that is stored in {#words} after
60
+ # all non-ascii, non-unicode characters have been stripped and double entries removed.
61
+ # @example (see #sentences_unique_chars)
62
+ def initialize(word_array, options={})
63
+ @compact = validate { :compact }
64
+ @words = edit_vocab(word_array)
65
+ @words = remove_redundant_single_char_words(@words) if @compact
66
+ @chinese = is_unicode?(@words[0])
67
+ @not_found = []
68
+ @stored_sentences = []
69
+ end
70
+
71
+
72
+ # Extracts the vocabulary column from a CSV file as an array of strings. The array is
73
+ # normally provided as an argument to {#initialize}
74
+ # @note (see #initialize)
75
+ # @overload parse_words(path_to_csv, word_col, options)
76
+ # @param [String] path_to_csv The relative or full path to the CSV file.
77
+ # @param [Integer] word_col The column number of the vocabulary column (counting starts at 1).
78
+ # @param [Hash] options The [supported options](http://ruby-doc.org/stdlib-1.9.3/libdoc/csv/rdoc/CSV.html#method-c-new) of Ruby's CSV library as well as the `:encoding` parameter.
79
+ # Exceptions: `:encoding` is always set to `utf-8` and `:skip_blanks` to `true` internally.
80
+ # @overload parse_words(path_to_csv, word_col)
81
+ # @param [String] path_to_csv The relative or full path to the CSV file.
82
+ # @param [Integer] word_col The column number of the vocabulary column (counting starts at 1).
83
+ # @return [Array<String>] The vocabluary (Chinese words)
84
+ # @example (see #sentences_unique_chars)
85
+ def self.parse_words(path_to_csv, word_col, options={})
86
+ # Enforced options:
87
+ # encoding: utf-8 (necessary for parsing Chinese characters)
88
+ # skip_blanks: true
89
+ options.merge!({:encoding => 'utf-8', :skip_blanks => true})
90
+ csv = CSV.read(path_to_csv, options)
91
+
92
+ raise ArgumentError, "Column number (#{word_col}) out of range." unless within_range?(word_col, csv[0])
93
+ # 'word_col counting starts at 1, but CSV.read returns an array,
94
+ # where counting starts at 0.
95
+ col = word_col-1
96
+ csv.reduce([]) {|words, row|
97
+ word = row[col]
98
+ # If word_col contains no data, CSV::read returns nil.
99
+ # We also want to skip empty strings or strings that only contain whitespace.
100
+ words << word unless word.nil? || word.strip.empty?
101
+ words
102
+ }
103
+ end
104
+
105
+
106
+ # For every Chinese word in {#words} fetches a Chinese sentence and its English translation
107
+ # from an online dictionary,
108
+ # @note (Normally you only call this method directly if you really need one sentence
109
+ # per Chinese word (even if these words might appear in more than one of the sentences.).
110
+ # @note (see #min_sentences)
111
+ # @overload sentences(options)
112
+ # @param [Hash] options The options to customize the following features.
113
+ # @option options [Symbol] :source The online dictionary to download the sentences from,
114
+ # either [:nciku](http://www.nciku.com) or [:jukuu](http://www.jukuu.com).
115
+ # Defaults to *:nciku*.
116
+ # @option options [Symbol] :size The size of the sentence to return from a possible set of
117
+ # several sentences. Supports the values *:short*, *:average*, and *:long*.
118
+ # Defaults to *:short*.
119
+ # @option options [Boolean] :with_pinyin Whether or not to return the pinyin representation
120
+ # of a sentence.
121
+ # Defaults to `true`.
122
+ # @option options [Integer] :thread_count The number of threads used to download the sentences.
123
+ # Defaults to `8`.
124
+ # @return [Hash] By default each hash holds the following key-value pairs (The return value is also stored in {#stored_sentences}.):
125
+ #
126
+ # * :chinese => Chinese sentence
127
+ # * :english => English translation
128
+ # * :pinyin => Pinyin
129
+ # The return value is also stored in {#stored_sentences}.
130
+ # @example
131
+ # require 'chinese_vocab'
132
+ #
133
+ # # Extract the Chinese words from a CSV file.
134
+ # words = Chinese::Vocab.parse_words('path/to/file/hsk.csv', 4)
135
+ #
136
+ # # Initialize Chinese::Vocab with word array
137
+ # # :compact => true means single character words are that also appear in multi-character
138
+ # # words are removed from the word array (["看", "看书"] => [看书])
139
+ # vocabulary = Chinese::Vocab.new(words, :compact => true)
140
+ #
141
+ # # Return a sentence for each word
142
+ # vocabulary.sentences(:size => small)
143
+ def sentences(options={})
144
+ puts "Fetching sentences..."
145
+ # Always run this method.
146
+
147
+ # We assign all options to a variable here (also those that are passed on)
148
+ # as we need them in order to calculate the id.
149
+ @with_pinyin = validate { :with_pinyin }
150
+ thread_count = validate { :thread_count }
151
+ id = make_hash(@words, options.to_a.sort)
152
+ words = @words
153
+
154
+ from_queue = Queue.new
155
+ to_queue = Queue.new
156
+ file_name = id
157
+
158
+ if File.exist?(file_name)
159
+ puts "Examining file..."
160
+ words, sentences, not_found = File.open(file_name) { |f| f.readlines }
161
+ words = convert(words)
162
+ convert(sentences).each { |s| to_queue << s }
163
+ @not_found = convert(not_found)
164
+ size_a = words.size
165
+ size_b = to_queue.size
166
+ # puts "Size(words) = #{size_a}"
167
+ # puts "Size(to_queue) = #{size_b}"
168
+ # puts "Size(words+queue) = #{size_a+size_b}"
169
+
170
+ # Remove file
171
+ File.unlink(file_name)
172
+ end
173
+
174
+ words.each {|word| from_queue << word }
175
+ result = []
176
+
177
+ Thread.abort_on_exception = false
178
+
179
+ 1.upto(thread_count).map {
180
+ Thread.new do
181
+
182
+ while(word = from_queue.pop!) do
183
+
184
+ begin
185
+ local_result = select_sentence(word, options)
186
+ puts "Processing word: #{word}"
187
+ # rescue SocketError, Timeout::Error, Errno::ETIMEDOUT,
188
+ # Errno::ECONNREFUSED, Errno::ECONNRESET, EOFError => e
189
+ rescue Exception => e
190
+ puts " #{e.message}."
191
+ puts "Please DO NOT abort, but wait for either the program to continue or all threads"
192
+ puts "to terminate (in which case the data will be saved to disk for fast retrieval on the next run.)"
193
+ puts "Number of running threads: #{Thread.list.size - 1}."
194
+ raise
195
+
196
+ ensure
197
+ from_queue << word if $!
198
+ puts "Wrote '#{word}' back to queue" if $!
199
+ end
200
+
201
+ to_queue << local_result unless local_result.nil?
202
+
203
+ end
204
+ end
205
+ }.each {|thread| thread.join }
206
+
207
+ @stored_sentences = to_queue.to_a
208
+ @stored_sentences
209
+
210
+ ensure
211
+ if $!
212
+ while(Thread.list.size > 1) do # Wait for all child threads to terminate.
213
+ sleep 5
214
+ end
215
+
216
+ File.open(file_name, 'w') do |f|
217
+ p "============================="
218
+ p "Writing data to file..."
219
+ f.write from_queue.to_a
220
+ f.puts
221
+ f.write to_queue.to_a
222
+ f.puts
223
+ f.write @not_found
224
+ puts "Finished writing data."
225
+ puts "Please run the program again after solving the (connection) problem."
226
+ end
227
+ end
228
+ end
229
+
230
+
231
+ # For every Chinese word in {#words} fetches a Chinese sentence and its English translation
232
+ # from an online dictionary, then calculates and the minimum number of sentences
233
+ # necessary to cover every word in {#words} at least once.
234
+ # The calculation is based on the fact that many words occur in more than one sentence.
235
+ #
236
+ # @note In case of a network error during dowloading the sentences the data fetched
237
+ # so far is automatically copied to a file after several retries. This data is read and
238
+ # processed on the next run to reduce the time spend with downloading the sentences
239
+ # (which is by far the most time-consuming part).
240
+ # @note Despite the download source chosen (by using the default or setting the `:source` options), if a word was not found on the first site, the second site is used as an alternative.
241
+ # @overload min_sentences(options)
242
+ # @param [Hash] options The options to customize the following features.
243
+ # @option options [Symbol] :source The online dictionary to download the sentences from,
244
+ # either [:nciku](http://www.nciku.com) or [:jukuu](http://www.jukuu.com).
245
+ # Defaults to `:nciku`.
246
+ # @option options [Symbol] :size The size of the sentence to return from a possible set of
247
+ # several sentences. Supports the values `:short`, `:average`, and `:long`.
248
+ # Defaults to `:short`.
249
+ # @option options [Boolean] :with_pinyin Whether or not to return the pinyin representation
250
+ # of a sentence.
251
+ # Defaults to `true`.
252
+ # @option options [Integer] :thread_count The number of threads used to download the sentences.
253
+ # Defaults to `8`.
254
+ # @return [Array<Hash>, []] By default each hash holds the following key-value pairs (The return value is also stored in {#stored_sentences}.):
255
+ #
256
+ # * :chinese => Chinese sentence
257
+ # * :english => English translation
258
+ # * :pinyin => Pinyin
259
+ # * :uwc => Unique words count tag (String) of the form "x_words",
260
+ # where *x* denotes the number of unique words from {#words} found in the sentence.
261
+ # * :uws => Unique words string tag (String) of the form "[词语1,词语2,...]",
262
+ # where *词语* denotes the actual word(s) from {#words} found in the sentence.
263
+ # The return value is also stored in {#stored_sentences}.
264
+ # @example (see #sentences_unique_chars)
265
+ def min_sentences(options = {})
266
+ @with_pinyin = validate { :with_pinyin }
267
+ # Always run this method.
268
+ thread_count = validate { :thread_count }
269
+ sentences = sentences(options)
270
+
271
+ minimum_sentences = select_minimum_necessary_sentences(sentences)
272
+ # :uwc = 'unique words count'
273
+ with_uwc_tag = add_key(minimum_sentences, :uwc) {|row| uwc_tag(row[:target_words]) }
274
+ # :uws = 'unique words string'
275
+ with_uwc_uws_tags = add_key(with_uwc_tag, :uws) do |row|
276
+ words = row[:target_words].sort.join(', ')
277
+ "[" + words + "]"
278
+ end
279
+ # Remove those keys we don't need anymore
280
+ result = remove_keys(with_uwc_uws_tags, :target_words, :word)
281
+ @stored_sentences = result
282
+ @stored_sentences
283
+ end
284
+
285
+
286
+ # Finds the unique Chinese characters from either the data in {#stored_sentences} or an
287
+ # array of Chinese sentences passed as an argument.
288
+ # @overload sentences_unique_chars(sentences)
289
+ # @param [Array<String>, Array<Hash>] sentences An array of chinese sentences or an array of hashes with the key `:chinese`.
290
+ # @note If no argument is passed, the data from {#stored_sentences} is used as input
291
+ # @return [Array<String>] The unique Chinese characters
292
+ # @example
293
+ # require 'chinese_vocab'
294
+ #
295
+ # # Extract the Chinese words from a CSV file.
296
+ # words = Chinese::Vocab.parse_words('path/to/file/hsk.csv', 4)
297
+ #
298
+ # # Initialize Chinese::Vocab with word array
299
+ # # :compact => true means single character words are that also appear in multi-character
300
+ # # words are removed from the word array (["看", "看书"] => [看书])
301
+ # vocabulary = Chinese::Vocab.new(words, :compact => true)
302
+ #
303
+ # # Return minimum necessary sentences.
304
+ # vocabulary.min_sentences(:size => small)
305
+ #
306
+ # # See how what are the unique characters in all these sentences.
307
+ # vocabulary.sentences_unique_chars(my_sentences)
308
+ # # => ["我", "们", "跟", "他", "是", "好", "朋", "友", ...]
309
+ #
310
+ # # Save to file
311
+ # vocabulary.to_csv('path/to_file/vocab_sentences.csv')
312
+ def sentences_unique_chars(sentences = stored_sentences)
313
+ # If the argument is an array of hashes, then it must be the data from @stored_sentences
314
+ sentences = sentences.map { |hash| hash[:chinese] } if sentences[0].kind_of?(Hash)
315
+
316
+ sentences.reduce([]) do |acc, row|
317
+ acc = acc | row.scan(/\p{Word}/) # only return characters, skip punctuation marks
318
+ acc
319
+ end
320
+ end
321
+
322
+
323
+ # Saves the data stored in {#stored_sentences} to disk.
324
+ # @overload to_csv(path_to_file, options)
325
+ # @param [String] path_to_file file name and path of where to save the file.
326
+ # @param [Hash] options The [supported options](http://ruby-doc.org/stdlib-1.9.3/libdoc/csv/rdoc/CSV.html#method-c-new) of Ruby's CSV library.
327
+ # @overload to_csv(path_to_file)
328
+ # @param [String] path_to_file file name and path of where to save the file.
329
+ # @return [void]
330
+ # @example (see #sentences_unique_chars)
331
+ def to_csv(path_to_file, options = {})
332
+
333
+ CSV.open(path_to_file, "w", options) do |csv|
334
+ @stored_sentences.each do |row|
335
+ csv << row.values
336
+ end
337
+ end
338
+ end
339
+
340
+
341
+ # Helper functions
342
+ # -----------------
343
+ def remove_parens(word)
344
+ # 1) Remove all ASCII parens and all data in between.
345
+ # 2) Remove all Chinese parens and all data in between.
346
+ word.gsub(/\(.*?\)/, '').gsub(/(.*?)/, '')
347
+ end
348
+
349
+
350
+ def is_boolean?(value)
351
+ # Only true for either 'false' or 'true'
352
+ !!value == value
353
+ end
354
+
355
+
356
+ # Remove all non-word characters
357
+ def edit_vocab(word_array)
358
+
359
+ word_array.map {|word|
360
+ edited = remove_parens(word)
361
+ edited = remove_slash(edited)
362
+ edited = remove_er_character_from_end(edited)
363
+ distinct_words(edited).join(' ')
364
+ }.uniq
365
+ end
366
+
367
+
368
+ def remove_er_character_from_end(word)
369
+ if word.size > 2
370
+ word.gsub(/儿$/, '')
371
+ else # Don't remove "儿" form words like 女儿
372
+ word
373
+ end
374
+ end
375
+
376
+
377
+ def remove_slash(word)
378
+ if word.match(/\//)
379
+ word.split(/\//).sort_by { |w| w.size }.last
380
+ else
381
+ word
382
+ end
383
+ end
384
+
385
+
386
+ def make_hash(*data)
387
+ require 'digest'
388
+ data = data.reduce("") { |acc, item| acc << item.to_s }
389
+ Digest::SHA2.hexdigest(data)[0..6]
390
+ end
391
+
392
+
393
+ # Input: ["看", "书", "看书"]
394
+ # Output: ["看书"]
395
+ def remove_redundant_single_char_words(words)
396
+ puts "Removing redundant single character words from the vocabulary..."
397
+
398
+ single_char_words, multi_char_words = words.partition {|word| word.length == 1 }
399
+ return single_char_words if multi_char_words.empty?
400
+
401
+ non_redundant_single_char_words = single_char_words.reduce([]) do |acc, single_c|
402
+
403
+ already_found = multi_char_words.find do |multi_c|
404
+ multi_c.include?(single_c)
405
+ end
406
+ # Add single char word to array if it is not part of any of the multi char words.
407
+ acc << single_c unless already_found
408
+ acc
409
+ end
410
+
411
+ non_redundant_single_char_words + multi_char_words
412
+ end
413
+
414
+
415
+ # Uses options passed from #sentences
416
+ def select_sentence(word, options)
417
+ sentence_pair = Scraper.sentence(word, options)
418
+
419
+ sources = Scraper::Sources.keys
420
+ sentence_pair = try_alternate_download_sources(sources, word, options) if sentence_pair.empty?
421
+
422
+ if sentence_pair.empty?
423
+ @not_found << word
424
+ return nil
425
+ else
426
+ chinese, english = sentence_pair
427
+
428
+ result = Hash.new
429
+ result.merge!(word: word)
430
+ result.merge!(chinese: chinese)
431
+ result.merge!(pinyin: chinese.to_pinyin) if @with_pinyin
432
+ result.merge!(english: english)
433
+ end
434
+ end
435
+
436
+
437
+ def try_alternate_download_sources(alternate_sources, word, options)
438
+ sources = alternate_sources.dup
439
+ sources.delete(options[:source])
440
+
441
+ result = sources.find do |s|
442
+ options = options.merge(:source => s)
443
+ sentence = Scraper.sentence(word, options)
444
+ sentence.empty? ? nil : sentence
445
+ end
446
+
447
+ if result
448
+ optins = options.merge(:source => result)
449
+ Scraper.sentence(word, options)
450
+ else
451
+ []
452
+ end
453
+ end
454
+
455
+
456
+ def convert(text)
457
+ eval(text.chomp)
458
+ end
459
+
460
+
461
+ def add_target_words(hash_array)
462
+ from_queue = Queue.new
463
+ to_queue = Queue.new
464
+ # semaphore = Mutex.new
465
+ result = []
466
+ words = @words
467
+ hash_array.each {|hash| from_queue << hash}
468
+
469
+ 10.times.map {
470
+ Thread.new(words) do
471
+
472
+ while(row = from_queue.pop!)
473
+ sentence = row[:chinese]
474
+ target_words = target_words_per_sentence(sentence, words)
475
+
476
+ to_queue << row.merge(:target_words => target_words)
477
+
478
+ end
479
+ end
480
+ }.map {|thread| thread.join}
481
+
482
+ to_queue.to_a
483
+
484
+ end
485
+
486
+
487
+ def target_words_per_sentence(sentence, words)
488
+ words.select {|w| include_every_char?(w, sentence) }
489
+ end
490
+
491
+
492
+ def sort_by_target_word_count(with_target_words)
493
+
494
+ # First sort by size of unique word array (from large to short)
495
+ # If the unique word count is equal, sort by the length of the sentence (from small to large)
496
+ with_target_words.sort_by {|row|
497
+ [-row[:target_words].size, row[:chinese].size] }
498
+
499
+ # The above is the same as:
500
+ # with_target_words.sort {|a,b|
501
+ # first = -(a[:target_words].size <=> b[:target_words].size)
502
+ # first.nonzero? || (a[:chinese].size <=> b[:chinese].size) }
503
+ end
504
+
505
+
506
+ def select_minimum_necessary_sentences(sentences)
507
+ with_target_words = add_target_words(sentences)
508
+ rows = sort_by_target_word_count(with_target_words)
509
+
510
+ selected_rows = []
511
+ unmatched_words = @words.dup
512
+ matched_words = []
513
+
514
+ rows.each do |row|
515
+ words = row[:target_words].dup
516
+ # Delete all words from 'words' that have already been encoutered
517
+ # (and are included in 'matched_words').
518
+ words = words - matched_words
519
+
520
+ if words.size > 0 # Words that where not deleted above have to be part of 'unmatched_words'.
521
+ selected_rows << row # Select this row.
522
+
523
+ # When a row is selected, its 'words' are no longer unmatched but matched.
524
+ unmatched_words = unmatched_words - words
525
+ matched_words = matched_words + words
526
+ end
527
+ end
528
+ selected_rows
529
+ end
530
+
531
+
532
+ def remove_keys(hash_array, *keys)
533
+ hash_array.map { |row| row.delete_keys(*keys) }
534
+ end
535
+
536
+
537
+ def add_key(hash_array, key, &block)
538
+ hash_array.map do |row|
539
+ if block
540
+ row.merge({key => block.call(row)})
541
+ else
542
+ row
543
+ end
544
+ end
545
+ end
546
+
547
+
548
+ def uwc_tag(string)
549
+ size = string.length
550
+ case size
551
+ when 1
552
+ "1_word"
553
+ else
554
+ "#{size}_words"
555
+ end
556
+ end
557
+
558
+
559
+ def contains_all_target_words?(selected_rows, sentence_key)
560
+
561
+ matched_words = @words.reduce([]) do |acc, word|
562
+
563
+ result = selected_rows.find do |row|
564
+ sentence = row[sentence_key]
565
+ include_every_char?(word, sentence)
566
+ end
567
+
568
+ if result
569
+ acc << word
570
+ end
571
+
572
+ acc
573
+ end
574
+
575
+ matched_words.size == @words.size
576
+ end
577
+
578
+
579
+ # Input:
580
+ # column: word column number (counting from 1)
581
+ # row : Array of the processed CSV data that contains our word column.
582
+ def self.within_range?(column, row)
583
+ no_of_cols = row.size
584
+ column >= 1 && column <= no_of_cols
585
+ end
586
+
587
+
588
+ def alternate_source(sources, selection)
589
+ sources = sources.dup
590
+ sources.delete(selection)
591
+ sources.pop
592
+ end
593
+
594
+ end
595
+ end
metadata ADDED
@@ -0,0 +1,120 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: chinese_vocab
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.8.0
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - Stefan Rohlfing
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2012-04-13 00:00:00.000000000Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: with_validations
16
+ requirement: &13638320 !ruby/object:Gem::Requirement
17
+ none: false
18
+ requirements:
19
+ - - ! '>='
20
+ - !ruby/object:Gem::Version
21
+ version: '0'
22
+ type: :runtime
23
+ prerelease: false
24
+ version_requirements: *13638320
25
+ - !ruby/object:Gem::Dependency
26
+ name: nokogiri
27
+ requirement: &13637840 !ruby/object:Gem::Requirement
28
+ none: false
29
+ requirements:
30
+ - - ! '>='
31
+ - !ruby/object:Gem::Version
32
+ version: '0'
33
+ type: :runtime
34
+ prerelease: false
35
+ version_requirements: *13637840
36
+ - !ruby/object:Gem::Dependency
37
+ name: string_to_pinyin
38
+ requirement: &13637400 !ruby/object:Gem::Requirement
39
+ none: false
40
+ requirements:
41
+ - - ! '>='
42
+ - !ruby/object:Gem::Version
43
+ version: '0'
44
+ type: :runtime
45
+ prerelease: false
46
+ version_requirements: *13637400
47
+ - !ruby/object:Gem::Dependency
48
+ name: rspec
49
+ requirement: &13636980 !ruby/object:Gem::Requirement
50
+ none: false
51
+ requirements:
52
+ - - ! '>='
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ type: :development
56
+ prerelease: false
57
+ version_requirements: *13636980
58
+ description: ! '=== Chinese::Vocab
59
+
60
+ This gem is meant to make live easier for any Chinese language student who:
61
+
62
+ * Prefers to learn vocabulary from Chinese sentences.
63
+
64
+ * Needs to memorize a lot of words on a _tight_ _time_ _schedule_.
65
+
66
+ * Uses the spaced repetition flashcard program {Anki}[http://ankisrs.net/].
67
+
68
+
69
+ Chinese::Vocab addresses all of the above requirements by downloading sentences
70
+ for each word and
71
+
72
+ selecting the *minimum* *required* *number* *of* *Chinese* *sentences* (and English
73
+ translations)
74
+
75
+ to *represent* *all* *words*.
76
+
77
+ '
78
+ email: stefan.rohlfing@gmail.com
79
+ executables: []
80
+ extensions: []
81
+ extra_rdoc_files: []
82
+ files:
83
+ - lib/chinese.rb
84
+ - lib/chinese/version.rb
85
+ - lib/chinese/core_ext/hash.rb
86
+ - lib/chinese/core_ext/array.rb
87
+ - lib/chinese/core_ext/queue.rb
88
+ - lib/chinese/modules/helper_methods.rb
89
+ - lib/chinese/vocab.rb
90
+ - lib/chinese/scraper.rb
91
+ - README.md
92
+ - Rakefile
93
+ - LICENSE
94
+ - Gemfile
95
+ homepage: http://github.com/bytesource/chinese_vocab
96
+ licenses: []
97
+ post_install_message:
98
+ rdoc_options: []
99
+ require_paths:
100
+ - lib
101
+ required_ruby_version: !ruby/object:Gem::Requirement
102
+ none: false
103
+ requirements:
104
+ - - ! '>='
105
+ - !ruby/object:Gem::Version
106
+ version: 1.9.1
107
+ required_rubygems_version: !ruby/object:Gem::Requirement
108
+ none: false
109
+ requirements:
110
+ - - ! '>='
111
+ - !ruby/object:Gem::Version
112
+ version: '0'
113
+ requirements: []
114
+ rubyforge_project: chinese_vocab
115
+ rubygems_version: 1.8.15
116
+ signing_key:
117
+ specification_version: 3
118
+ summary: Chinese::Vocab - Downloading and selecting the minimum required number of
119
+ sentences to your Chinese vocabulary list
120
+ test_files: []