chinese_vocab 0.8.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/Gemfile +4 -0
- data/LICENSE +22 -0
- data/README.md +103 -0
- data/Rakefile +22 -0
- data/lib/chinese.rb +11 -0
- data/lib/chinese/core_ext/array.rb +14 -0
- data/lib/chinese/core_ext/hash.rb +37 -0
- data/lib/chinese/core_ext/queue.rb +25 -0
- data/lib/chinese/modules/helper_methods.rb +38 -0
- data/lib/chinese/scraper.rb +143 -0
- data/lib/chinese/version.rb +3 -0
- data/lib/chinese/vocab.rb +595 -0
- metadata +120 -0
data/Gemfile
ADDED
data/LICENSE
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
Copyright (c) 2012 Stefan Rohlfing
|
2
|
+
|
3
|
+
MIT License
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
6
|
+
a copy of this software and associated documentation files (the
|
7
|
+
"Software"), to deal in the Software without restriction, including
|
8
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
9
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
10
|
+
permit persons to whom the Software is furnished to do so, subject to
|
11
|
+
the following conditions:
|
12
|
+
|
13
|
+
The above copyright notice and this permission notice shall be
|
14
|
+
included in all copies or substantial portions of the Software.
|
15
|
+
|
16
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
17
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
18
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
19
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
20
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
21
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
22
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,103 @@
|
|
1
|
+
# Chinese::Vocab
|
2
|
+
|
3
|
+
`Chinese::Vocab` is meant to make live easier for any Chinese language student who:
|
4
|
+
|
5
|
+
* Prefers to learn vocabulary from Chinese sentences.
|
6
|
+
* Needs to memorize a lot of words on a __tight time schedule__.
|
7
|
+
* Uses the spaced repetition flashcard program [Anki](http://ankisrs.net/).
|
8
|
+
|
9
|
+
`Chinese::Vocab` addresses all of the above requirements by downloading sentences for each word and selecting the __minimum required number of Chinese sentences__ (and English translations) to __represent all words__.
|
10
|
+
|
11
|
+
You can then export the sentences as well as additional tags provided by `Chinese::Vocab` to Anki.
|
12
|
+
|
13
|
+
## Features
|
14
|
+
|
15
|
+
* Downloads sentences for each word in a Chinese vocabulary list and selects the __minimum required number of sentences__ to represent all words.
|
16
|
+
* With the option key `:compact` set to `true` on initialization, all single character words that also appear in at least one multi character word are removed. The reason behind this option is to __remove redundancy in meaning__ and focus on learning distinct words. Example: (["看", "看书"] => [看书])
|
17
|
+
* Adds additional __tags__ to every sentence that can be used in *Anki*:
|
18
|
+
* __Pinyin__: By default the pinyin representation is added to each sentence. Example: "除了这张大钞以外,我没有其他零票了。" => "chú le zhè zhāng dà chāo yĭ wài ,wŏ méi yŏu qí tā líng piào le 。"
|
19
|
+
* __Number of target words__: The number of words from the vocabulary that are covered by a sentence. Example: "除了这张大钞以外,我没有其他零票了。" => "3_words"
|
20
|
+
* __List of target words__: A list of the words from the vocabulary that are covered by a sentence. Example: "除了这张大钞以外,我没有其他零票了。" => "[我, 他, 除了 以外]"
|
21
|
+
* Export data to csv for easy import from *Anki*.
|
22
|
+
|
23
|
+
|
24
|
+
## Real World Example (using the Traditional HSK word list)
|
25
|
+
|
26
|
+
```` ruby
|
27
|
+
# Import words from source.
|
28
|
+
# First argument: path to file
|
29
|
+
# Second argument: column number of word column (counting starts at 1)
|
30
|
+
words = Chinese::Vocab.parse_words('../old_hsk_level_8828_chars_1_word_edited.csv', 4)
|
31
|
+
# Sample output:
|
32
|
+
words.take(6)
|
33
|
+
# => ["啊", "啊", "矮", "爱", "爱人", "安静"]
|
34
|
+
|
35
|
+
|
36
|
+
# Initialize an object.
|
37
|
+
# First argument: word list as an array of strings.
|
38
|
+
# Options:
|
39
|
+
# :compact (defaults to false)
|
40
|
+
anki = Chinese::Vocab.new(words, :compact => true)
|
41
|
+
|
42
|
+
# List all words
|
43
|
+
p anki.words.take(6)
|
44
|
+
# => ["啊", "啊", "矮", "爱", "爱人", "安静"]
|
45
|
+
p anki.words.size
|
46
|
+
# => 7251
|
47
|
+
|
48
|
+
# Options:
|
49
|
+
# :source (defaults to :nciku)
|
50
|
+
# :size (defaults to :short)
|
51
|
+
# :with_pinyin (defaults to true)
|
52
|
+
anki.min_sentences(:thread_count => 10)
|
53
|
+
|
54
|
+
p anki.stored_sentences.take(2)
|
55
|
+
# [{:word=>"吧", :chinese=>"放心吧,他做事向来把牢。",
|
56
|
+
# :pinyin=>"fàng xīn ba ,tā zuò shì xiàng lái bă láo 。",
|
57
|
+
# :english=>"Take it easy. You can always count on him."},
|
58
|
+
# {:word=>"喝", :chinese=>"喝酒挂红的人一般都很能喝。",
|
59
|
+
# :pinyin=>"hē jiŭ guà hóng de rén yī bān dōu hĕn néng hē 。",
|
60
|
+
# :english=>"Those whose face turn red after drinking are normally heavy drinkers."}]
|
61
|
+
|
62
|
+
# words not found
|
63
|
+
p anki.not_found
|
64
|
+
# ["来回来去", "来看来讲", "深美"]
|
65
|
+
|
66
|
+
# Number of unique characters in the selected sentences
|
67
|
+
p anki.sentences_unique_chars.size
|
68
|
+
# => 3290
|
69
|
+
|
70
|
+
# Save data to csv.
|
71
|
+
# First parameter: path to file
|
72
|
+
# Options:
|
73
|
+
# Any supported option of Ruby's CSV libary
|
74
|
+
anki.to_csv('in_the_wild_test.csv')
|
75
|
+
# Sample output (2 sentences/lines out of 4511):
|
76
|
+
|
77
|
+
# 舞台上正在上演的是吕剧。,wŭ tái shàng zhèng zài shàng yăn de shì lǚ jù 。,
|
78
|
+
# What is being performed on the stage is Lv opera (a local opera of Shandong Province).
|
79
|
+
# ,2_words,"[正在, 舞台]"
|
80
|
+
# 古代官员上朝都要穿朝靴。,gŭ dài guān yuán shàng cháo dōu yào chuān cháo xuē 。,
|
81
|
+
# "In ancient times, all courtiers had to wear special boots to enter the court.",
|
82
|
+
# 2_words,"[古代, 官员]"
|
83
|
+
|
84
|
+
````
|
85
|
+
|
86
|
+
## Documentation
|
87
|
+
* [parse_words](http://rubydoc.info/github/bytesource/chinese_vocab/master/Chinese/Vocab.parse_words) - How to read in the Chinese words and correctly set the column number, Options:
|
88
|
+
* The [supported options](http://ruby-doc.org/stdlib-1.9.3/libdoc/csv/rdoc/CSV.html#method-c-new) of Ruby's CSV library as well as the `:encoding` parameter. __Note__: `:encoding` is always set to `utf-8` and `:skip_blanks` to `true` internally.
|
89
|
+
* [initialize](http://rubydoc.info/github/bytesource/chinese_vocab/master/Chinese/Vocab:initialize) - How to write composite expressions such as "除了。。以外", Options:
|
90
|
+
* `:compress` (`Boolean`): Whether or not to remove all single character words that
|
91
|
+
also appear in at least one multi character word. Example: (["看", "看书"] => [看书]). The reason behind this option is to remove redundancy in meaning and focus on learning distinct words.
|
92
|
+
* [words](http://rubydoc.info/github/bytesource/chinese_vocab/master/Chinese/Vocab:words) - Learn how words are edited internally.
|
93
|
+
* [min_sentences](http://rubydoc.info/github/bytesource/chinese_vocab/master/Chinese/Vocab:min_sentences) - Options:
|
94
|
+
* `:source` (`Symbol`): The online dictionary to download the sentences from, either [:nciku](http://www.nciku.com) or [:jukuu](http://www.jukuu.com). Defaults to `:nciku`. __Note__: Despite the download source chosen (by using the default or setting the `:source` options), if a word was not found on the first site, the second site is used as an alternative.
|
95
|
+
* `:with_pinyin` (`Boolean`): Whether or not to return the pinyin representation of a sentence. Defaults to `true`.
|
96
|
+
* `:size` (`Symbol`): The size of the sentence to return from a possible set of several sentences. Supports the values `:short`, `:average`, and `:long`. Defaults to `:short`.
|
97
|
+
* `:thread_count` (`Integer`): The number of threads used to download the sentences. Defaults to `8`.
|
98
|
+
* [sentences_unique_chars](http://rubydoc.info/github/bytesource/chinese_vocab/master/Chinese/Vocab:sentences_unique_chars) - List of unique Chinese *characters* (single character words) are found in the selected sentences.
|
99
|
+
* [to_csv](http://rubydoc.info/github/bytesource/chinese_vocab/master/Chinese/Vocab:to_csv) - Options:
|
100
|
+
* All [supported options](http://ruby-doc.org/stdlib-1.9.3/libdoc/csv/rdoc/CSV.html#method-c-new) of Ruby's CSV library.
|
101
|
+
|
102
|
+
|
103
|
+
|
data/Rakefile
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
# require 'spec/rake/spectask' # depreciated
|
2
|
+
require 'rspec/core/rake_task'
|
3
|
+
# require 'rake/gempackagetask' # depreciated
|
4
|
+
require 'rubygems/package_task'
|
5
|
+
require 'rdoc/task'
|
6
|
+
|
7
|
+
# Build gem: rake gem
|
8
|
+
# Push gem: rake push
|
9
|
+
|
10
|
+
task :default => [ :spec, :gem ]
|
11
|
+
|
12
|
+
RSpec::Core::RakeTask.new {:spec}
|
13
|
+
|
14
|
+
gem_spec = eval(File.read('chinese_vocab.gemspec'))
|
15
|
+
|
16
|
+
Gem::PackageTask.new( gem_spec ) do |t|
|
17
|
+
t.need_zip = true
|
18
|
+
end
|
19
|
+
|
20
|
+
task :push => :gem do |t|
|
21
|
+
sh "gem push -v pkg/#{gem_spec.name}-#{gem_spec.version}.gem"
|
22
|
+
end
|
data/lib/chinese.rb
ADDED
@@ -0,0 +1,11 @@
|
|
1
|
+
# encoding: utf-8
|
2
|
+
require 'chinese/vocab'
|
3
|
+
require 'chinese/scraper'
|
4
|
+
require 'chinese/version'
|
5
|
+
require 'chinese/core_ext/array'
|
6
|
+
require 'chinese/core_ext/hash'
|
7
|
+
require 'chinese/core_ext/queue'
|
8
|
+
require 'chinese/modules/helper_methods'
|
9
|
+
|
10
|
+
module Chinese
|
11
|
+
end
|
@@ -0,0 +1,14 @@
|
|
1
|
+
# encoding: utf-8
|
2
|
+
|
3
|
+
class Array
|
4
|
+
|
5
|
+
# Input: [1,2,3,4,5]
|
6
|
+
# Output: [[1, 2], [2, 3], [3, 4], [4, 5]]
|
7
|
+
def overlap_pairs
|
8
|
+
second = self.dup.drop(1)
|
9
|
+
self.each_with_index.inject([]) {|acc,(item,i)|
|
10
|
+
acc << [item,second[i]] unless second[i].nil?
|
11
|
+
acc
|
12
|
+
}
|
13
|
+
end
|
14
|
+
end
|
@@ -0,0 +1,37 @@
|
|
1
|
+
# encoding: utf-8
|
2
|
+
|
3
|
+
class Hash
|
4
|
+
|
5
|
+
# Returns a copy of self with *keys removed.
|
6
|
+
def delete_keys(*keys)
|
7
|
+
hash = self.dup
|
8
|
+
|
9
|
+
keys.each do |key|
|
10
|
+
hash.delete(key)
|
11
|
+
end
|
12
|
+
hash
|
13
|
+
end
|
14
|
+
|
15
|
+
# Remove *keys from self
|
16
|
+
def delete_keys!(*keys)
|
17
|
+
keys.each do |key|
|
18
|
+
self.delete(key)
|
19
|
+
end
|
20
|
+
end
|
21
|
+
|
22
|
+
# Creates a sub-hash from `self` with the keys from `keys`
|
23
|
+
# @note keys in `keys` not present in `self` are silently ignored.
|
24
|
+
# @return [Hash] a copy of `self`.
|
25
|
+
def slice(*keys)
|
26
|
+
self.select { |k,v| keys.include?(k) }
|
27
|
+
end
|
28
|
+
|
29
|
+
def slice!(*keys)
|
30
|
+
sub_hash = self.select { |k,v| keys.include?(k) }
|
31
|
+
# Remove 'keys' form self:
|
32
|
+
self.delete_keys!(*sub_hash.keys)
|
33
|
+
sub_hash
|
34
|
+
end
|
35
|
+
end
|
36
|
+
|
37
|
+
|
@@ -0,0 +1,25 @@
|
|
1
|
+
# encoding: utf-8
|
2
|
+
|
3
|
+
require 'thread'
|
4
|
+
|
5
|
+
class Queue
|
6
|
+
|
7
|
+
def to_a
|
8
|
+
@que
|
9
|
+
end
|
10
|
+
|
11
|
+
# Return nil if queue is empty.
|
12
|
+
def pop!
|
13
|
+
pop(non_block = true)
|
14
|
+
rescue ThreadError => e
|
15
|
+
case e.message
|
16
|
+
when /queue empty/
|
17
|
+
nil
|
18
|
+
else
|
19
|
+
raise
|
20
|
+
end
|
21
|
+
end
|
22
|
+
|
23
|
+
end
|
24
|
+
|
25
|
+
|
@@ -0,0 +1,38 @@
|
|
1
|
+
# encoding: utf-8
|
2
|
+
|
3
|
+
module Chinese
|
4
|
+
module HelperMethods
|
5
|
+
|
6
|
+
def self.included(klass)
|
7
|
+
klass.extend(self)
|
8
|
+
end
|
9
|
+
|
10
|
+
def is_unicode?(word)
|
11
|
+
# Remove all non-ascii and non-unicode word characters
|
12
|
+
word = distinct_words(word).join
|
13
|
+
# English text at this point only contains characters that are mathed by \w
|
14
|
+
# Chinese text at this point contains mostly/only unicode word characters that are not matched by \w.
|
15
|
+
# In case of Chinese text the size of 'char_arr' therefore has to be smaller than the size of 'word'
|
16
|
+
char_arr = word.scan(/\w/)
|
17
|
+
char_arr.size < word.size
|
18
|
+
end
|
19
|
+
|
20
|
+
# Input: "除了。。。 以外。。。"
|
21
|
+
# Outout: ["除了", "以外"]
|
22
|
+
def distinct_words(word)
|
23
|
+
# http://stackoverflow.com/a/3976004
|
24
|
+
# Alternative: /[[:word:]]+/
|
25
|
+
word.scan(/\p{Word}+/) # Returns an array of characters that belong together.
|
26
|
+
end
|
27
|
+
|
28
|
+
# Return true if every distince word (as defined by #distinct_words)
|
29
|
+
# can be found in the given sentence.
|
30
|
+
def include_every_char?(word, sentence)
|
31
|
+
characters = distinct_words(word)
|
32
|
+
characters.all? {|char| sentence.include?(char) }
|
33
|
+
end
|
34
|
+
|
35
|
+
|
36
|
+
end
|
37
|
+
end
|
38
|
+
|
@@ -0,0 +1,143 @@
|
|
1
|
+
# encoding: utf-8
|
2
|
+
require 'cgi'
|
3
|
+
require 'open-uri'
|
4
|
+
require 'nokogiri'
|
5
|
+
require 'timeout'
|
6
|
+
require 'chinese/core_ext/array'
|
7
|
+
require 'with_validations'
|
8
|
+
require 'chinese/modules/helper_methods'
|
9
|
+
|
10
|
+
module Chinese
|
11
|
+
class Scraper
|
12
|
+
include WithValidations
|
13
|
+
include HelperMethods
|
14
|
+
|
15
|
+
attr_reader :source, :word
|
16
|
+
attr_accessor :sentences
|
17
|
+
|
18
|
+
Sources = {
|
19
|
+
nciku:
|
20
|
+
{:url => "http://www.nciku.com/search/all/examples/",
|
21
|
+
:parent_sel => "div.examples_box > dl",
|
22
|
+
:cn_sel => "//dt/span[1]",
|
23
|
+
:en_sel => "//dd/span[@class='tc_sub']",
|
24
|
+
# Only cn/en sentence pairs where the second node has a class 'tc_sub' belong together.
|
25
|
+
:select_pair => lambda { |node1,node2| node1['class'] != "tc_sub" && node2['class'] == "tc_sub" },
|
26
|
+
# Just return the text stored in the node. :text_sel is mainly intended for jukuu (see below)
|
27
|
+
:text_sel => "text()",
|
28
|
+
# We want cn first, en second, but nciku does not return cn/en sentence pairs in a strict order.
|
29
|
+
:reorder => lambda { |text1,text2| if is_unicode?(text2) then [text2,text1] else [text1,text2] end }},
|
30
|
+
jukuu:
|
31
|
+
{:url => "http://www.jukuu.com/search.php?q=",
|
32
|
+
:parent_sel => "table#Table1 table[width = '680']",
|
33
|
+
:cn_sel => "//tr[@class='c']",
|
34
|
+
:en_sel => "//tr[@class='e']",
|
35
|
+
# Only cn/en sentence pairs where the first node has a class 'e' belong together.
|
36
|
+
:select_pair => lambda { |node1,node2| node1['class'] == "e" && node2['class'] != "e" },
|
37
|
+
:text_sel => "td[2]",
|
38
|
+
:reorder => lambda { |text1,text2| [text2,text1] }}
|
39
|
+
}
|
40
|
+
|
41
|
+
OPTIONS = {:source => [:nciku, lambda {|value| Sources.keys.include?(value) }],
|
42
|
+
:size => [:average, lambda {|value| [:short, :average, :long].include?(value) }]}
|
43
|
+
|
44
|
+
|
45
|
+
# Options:
|
46
|
+
# size => [:short, :average, :long], default = :average
|
47
|
+
def self.sentences(word, options={})
|
48
|
+
download_source = validate { :source }
|
49
|
+
|
50
|
+
source = Sources[download_source]
|
51
|
+
|
52
|
+
CGI.accept_charset = 'UTF-8'
|
53
|
+
# Note: Use + because << changes the object on its left hand side, but + doesn't:
|
54
|
+
# http://stackoverflow.com/questions/377768/string-concatenation-and-ruby/378258#378258
|
55
|
+
url = source[:url] + CGI.escape(word)
|
56
|
+
# http://ruby-doc.org/stdlib-1.9.2/libdoc/timeout/rdoc/Timeout.html#method-c-timeout
|
57
|
+
content = Timeout.timeout(20) { open(url) }
|
58
|
+
main_node = Nokogiri::HTML(content).css(source[:parent_sel]) # Returns a single node.
|
59
|
+
return [] if main_node.to_a.empty?
|
60
|
+
|
61
|
+
# CSS selector: Returns the tags in the order they are specified
|
62
|
+
# XPath selector: Return the tags in the order they appear in the document (that's what we want here).
|
63
|
+
# Source: http://stackoverflow.com/questions/5825136/nokogiri-and-finding-element-by-name/5845985#5845985
|
64
|
+
target_nodes = main_node.search("#{source[:cn_sel]} | #{source[:en_sel]}")
|
65
|
+
return [] if target_nodes.to_a.empty?
|
66
|
+
|
67
|
+
# In order to make sure we only return text that also has a translation,
|
68
|
+
# we need to first group each target node with Array#overlap_pairs like this:
|
69
|
+
# Input: [cn1, cn2, en2, cn3, en3, cn4]
|
70
|
+
# Output: [[cn1,cn2],[cn2,en2],[en2,cn3],[cn3,en3],[en3,cn4]]
|
71
|
+
# and then select the correct pairs: [[cn2,en2],[cn3,en3]].
|
72
|
+
# Regarding #to_a: Nokogiri::XML::NodeSet => Array
|
73
|
+
sentence_pairs = target_nodes.to_a.overlap_pairs.select {|(node1,node2)| source[:select_pair].call(node1,node2) }
|
74
|
+
sentence_pairs = sentence_pairs.reduce([]) do |acc,(cn_node,en_node)|
|
75
|
+
cn = cn_node.css(source[:text_sel]).text.strip # 'text' returns an empty string when 'css' returns an empty array.
|
76
|
+
en = en_node.css(source[:text_sel]).text.strip
|
77
|
+
pair = [cn,en]
|
78
|
+
# Ensure that both the chinese and english selector have text.
|
79
|
+
# (sometimes they don't).
|
80
|
+
acc << pair unless pair_with_empty_string?(pair)
|
81
|
+
acc
|
82
|
+
end
|
83
|
+
# Switch position of each pair if the first entry is the translation,
|
84
|
+
# as we always return an array of [cn_sentence,en_sentence] pairs.
|
85
|
+
# The following step is necessary because:
|
86
|
+
# 1) Jukuu returns sentences in the order English first, Chinese second
|
87
|
+
# 2) Nciku mostly returns sentences in the order Chinese first, English second
|
88
|
+
# (but sometimes it is the other way round.)
|
89
|
+
sentence_pairs = sentence_pairs.map {|node1,node2| source[:reorder].call(node1,node2) }
|
90
|
+
# Only select Chinese sentences that don't separate words, e.g., skip all sentences like the following:
|
91
|
+
# 北边 => 树林边的河流向北方
|
92
|
+
sentence_pairs = sentence_pairs.select { |cn, _| include_every_char?(word, cn) }
|
93
|
+
|
94
|
+
sentence_pairs
|
95
|
+
end
|
96
|
+
|
97
|
+
def self.sentence(word, options={})
|
98
|
+
value = validate { :size }
|
99
|
+
|
100
|
+
scraped_sentences = sentences(word, options)
|
101
|
+
return [] if scraped_sentences.empty?
|
102
|
+
|
103
|
+
case value
|
104
|
+
when :short
|
105
|
+
shortest_size(scraped_sentences)
|
106
|
+
when :average
|
107
|
+
average_size(scraped_sentences)
|
108
|
+
when :long
|
109
|
+
longest_size(scraped_sentences)
|
110
|
+
end
|
111
|
+
end
|
112
|
+
|
113
|
+
|
114
|
+
# ===================
|
115
|
+
# Helper methods
|
116
|
+
# ===================
|
117
|
+
|
118
|
+
def self.pair_with_empty_string?(pair)
|
119
|
+
pair[0].empty? || pair[1].empty?
|
120
|
+
end
|
121
|
+
|
122
|
+
# Despite its name returns the SECOND shortest sentence,
|
123
|
+
# as the shortest result often is not a real sentence,
|
124
|
+
# but a definition.
|
125
|
+
def self.shortest_size(sentence_pairs)
|
126
|
+
sentence_pairs.sort_by {|(cn,_)| cn.length }.take(2).last
|
127
|
+
end
|
128
|
+
|
129
|
+
def self.longest_size(sentence_pairs)
|
130
|
+
sentence_pairs.sort_by {|(cn,_)| cn.length }.last
|
131
|
+
end
|
132
|
+
|
133
|
+
def self.average_size(sentence_pairs)
|
134
|
+
sorted = sentence_pairs.sort_by {|(cn,_)| cn.length }
|
135
|
+
length = sorted.length
|
136
|
+
sorted.find {|(cn,_)| cn.size >= length/2 }
|
137
|
+
end
|
138
|
+
|
139
|
+
|
140
|
+
|
141
|
+
end
|
142
|
+
end
|
143
|
+
|
@@ -0,0 +1,595 @@
|
|
1
|
+
# encoding: utf-8
|
2
|
+
require 'thread'
|
3
|
+
require 'open-uri'
|
4
|
+
require 'nokogiri'
|
5
|
+
require 'cgi'
|
6
|
+
require 'csv'
|
7
|
+
require 'with_validations'
|
8
|
+
require 'string_to_pinyin'
|
9
|
+
require 'chinese/scraper'
|
10
|
+
require 'chinese/modules/helper_methods'
|
11
|
+
require 'chinese/core_ext/hash'
|
12
|
+
require 'chinese/core_ext/queue'
|
13
|
+
|
14
|
+
module Chinese
|
15
|
+
class Vocab
|
16
|
+
include WithValidations
|
17
|
+
include HelperMethods
|
18
|
+
|
19
|
+
# The list of Chinese words after calling {#edit_vocab}. Editing includes:
|
20
|
+
#
|
21
|
+
# * Removing parentheses (with the content inside each parenthesis).
|
22
|
+
# * Removing any slash (/) and only keeping the longest part.
|
23
|
+
# * Removing '儿' for any word longer than two characters.
|
24
|
+
# * Removing non-word characters such as points and commas.
|
25
|
+
# * Removing and duplicate words.
|
26
|
+
#@return [Array<String>]
|
27
|
+
attr_reader :words
|
28
|
+
#@return [Boolean] the value of the _:compact_ options key.
|
29
|
+
attr_reader :compact
|
30
|
+
#@return [Array<String>] holds those Chinese words from {#words} that could not be found in any
|
31
|
+
# of the supported online dictionaries during a call to either {#sentences} or {#min_sentences}.
|
32
|
+
# Defaults to `[]`.
|
33
|
+
attr_reader :not_found
|
34
|
+
#@return [Boolean] the value of the `:with_pinyin` option key.
|
35
|
+
attr_reader :with_pinyin
|
36
|
+
#@return [Array<Hash>] holds the return value of either {#sentences} or {#min_sentences},
|
37
|
+
# whichever was called last. Defaults to `[]`.
|
38
|
+
attr_reader :stored_sentences
|
39
|
+
|
40
|
+
# Mandatory constant for the [WithValidations](http://rubydoc.info/github/bytesource/with_validations/file/README.md) module. Each key-value pair is of the following type:
|
41
|
+
# `option_key => [default_value, validation]`
|
42
|
+
OPTIONS = {:compact => [false, lambda {|value| is_boolean?(value) }],
|
43
|
+
:with_pinyin => [true, lambda {|value| is_boolean?(value) }],
|
44
|
+
:thread_count => [8, lambda {|value| value.kind_of?(Integer) }]}
|
45
|
+
|
46
|
+
# Intializes an object.
|
47
|
+
# @note Words that are composite expressions must be written with a least one non-word
|
48
|
+
# character (such as whitespace) between each sub-expression. Example: "除了 以外" or
|
49
|
+
# "除了。。以外" instead of "除了以外".
|
50
|
+
# @overload initialize(word_array, options)
|
51
|
+
# @param [Array<String>] word_array An array of Chinese words that is stored in {#words} after
|
52
|
+
# all non-ascii, non-unicode characters have been stripped and double entries removed.
|
53
|
+
# @param [Hash] options The options to customize the following feature.
|
54
|
+
# @option options [Boolean] :compact Whether or not to remove all single character words that
|
55
|
+
# also appear in at least one multi character word. Example: (["看", "看书"] => [看书])
|
56
|
+
# The reason behind this option is to remove redundancy in meaning and focus on learning distinct words.
|
57
|
+
# Defaults to `false`.
|
58
|
+
# @overload initialize(word_array)
|
59
|
+
# @param [Array<String>] word_array An array of Chinese words that is stored in {#words} after
|
60
|
+
# all non-ascii, non-unicode characters have been stripped and double entries removed.
|
61
|
+
# @example (see #sentences_unique_chars)
|
62
|
+
def initialize(word_array, options={})
|
63
|
+
@compact = validate { :compact }
|
64
|
+
@words = edit_vocab(word_array)
|
65
|
+
@words = remove_redundant_single_char_words(@words) if @compact
|
66
|
+
@chinese = is_unicode?(@words[0])
|
67
|
+
@not_found = []
|
68
|
+
@stored_sentences = []
|
69
|
+
end
|
70
|
+
|
71
|
+
|
72
|
+
# Extracts the vocabulary column from a CSV file as an array of strings. The array is
|
73
|
+
# normally provided as an argument to {#initialize}
|
74
|
+
# @note (see #initialize)
|
75
|
+
# @overload parse_words(path_to_csv, word_col, options)
|
76
|
+
# @param [String] path_to_csv The relative or full path to the CSV file.
|
77
|
+
# @param [Integer] word_col The column number of the vocabulary column (counting starts at 1).
|
78
|
+
# @param [Hash] options The [supported options](http://ruby-doc.org/stdlib-1.9.3/libdoc/csv/rdoc/CSV.html#method-c-new) of Ruby's CSV library as well as the `:encoding` parameter.
|
79
|
+
# Exceptions: `:encoding` is always set to `utf-8` and `:skip_blanks` to `true` internally.
|
80
|
+
# @overload parse_words(path_to_csv, word_col)
|
81
|
+
# @param [String] path_to_csv The relative or full path to the CSV file.
|
82
|
+
# @param [Integer] word_col The column number of the vocabulary column (counting starts at 1).
|
83
|
+
# @return [Array<String>] The vocabluary (Chinese words)
|
84
|
+
# @example (see #sentences_unique_chars)
|
85
|
+
def self.parse_words(path_to_csv, word_col, options={})
|
86
|
+
# Enforced options:
|
87
|
+
# encoding: utf-8 (necessary for parsing Chinese characters)
|
88
|
+
# skip_blanks: true
|
89
|
+
options.merge!({:encoding => 'utf-8', :skip_blanks => true})
|
90
|
+
csv = CSV.read(path_to_csv, options)
|
91
|
+
|
92
|
+
raise ArgumentError, "Column number (#{word_col}) out of range." unless within_range?(word_col, csv[0])
|
93
|
+
# 'word_col counting starts at 1, but CSV.read returns an array,
|
94
|
+
# where counting starts at 0.
|
95
|
+
col = word_col-1
|
96
|
+
csv.reduce([]) {|words, row|
|
97
|
+
word = row[col]
|
98
|
+
# If word_col contains no data, CSV::read returns nil.
|
99
|
+
# We also want to skip empty strings or strings that only contain whitespace.
|
100
|
+
words << word unless word.nil? || word.strip.empty?
|
101
|
+
words
|
102
|
+
}
|
103
|
+
end
|
104
|
+
|
105
|
+
|
106
|
+
# For every Chinese word in {#words} fetches a Chinese sentence and its English translation
|
107
|
+
# from an online dictionary,
|
108
|
+
# @note (Normally you only call this method directly if you really need one sentence
|
109
|
+
# per Chinese word (even if these words might appear in more than one of the sentences.).
|
110
|
+
# @note (see #min_sentences)
|
111
|
+
# @overload sentences(options)
|
112
|
+
# @param [Hash] options The options to customize the following features.
|
113
|
+
# @option options [Symbol] :source The online dictionary to download the sentences from,
|
114
|
+
# either [:nciku](http://www.nciku.com) or [:jukuu](http://www.jukuu.com).
|
115
|
+
# Defaults to *:nciku*.
|
116
|
+
# @option options [Symbol] :size The size of the sentence to return from a possible set of
|
117
|
+
# several sentences. Supports the values *:short*, *:average*, and *:long*.
|
118
|
+
# Defaults to *:short*.
|
119
|
+
# @option options [Boolean] :with_pinyin Whether or not to return the pinyin representation
|
120
|
+
# of a sentence.
|
121
|
+
# Defaults to `true`.
|
122
|
+
# @option options [Integer] :thread_count The number of threads used to download the sentences.
|
123
|
+
# Defaults to `8`.
|
124
|
+
# @return [Hash] By default each hash holds the following key-value pairs (The return value is also stored in {#stored_sentences}.):
|
125
|
+
#
|
126
|
+
# * :chinese => Chinese sentence
|
127
|
+
# * :english => English translation
|
128
|
+
# * :pinyin => Pinyin
|
129
|
+
# The return value is also stored in {#stored_sentences}.
|
130
|
+
# @example
|
131
|
+
# require 'chinese_vocab'
|
132
|
+
#
|
133
|
+
# # Extract the Chinese words from a CSV file.
|
134
|
+
# words = Chinese::Vocab.parse_words('path/to/file/hsk.csv', 4)
|
135
|
+
#
|
136
|
+
# # Initialize Chinese::Vocab with word array
|
137
|
+
# # :compact => true means single character words are that also appear in multi-character
|
138
|
+
# # words are removed from the word array (["看", "看书"] => [看书])
|
139
|
+
# vocabulary = Chinese::Vocab.new(words, :compact => true)
|
140
|
+
#
|
141
|
+
# # Return a sentence for each word
|
142
|
+
# vocabulary.sentences(:size => small)
|
143
|
+
def sentences(options={})
|
144
|
+
puts "Fetching sentences..."
|
145
|
+
# Always run this method.
|
146
|
+
|
147
|
+
# We assign all options to a variable here (also those that are passed on)
|
148
|
+
# as we need them in order to calculate the id.
|
149
|
+
@with_pinyin = validate { :with_pinyin }
|
150
|
+
thread_count = validate { :thread_count }
|
151
|
+
id = make_hash(@words, options.to_a.sort)
|
152
|
+
words = @words
|
153
|
+
|
154
|
+
from_queue = Queue.new
|
155
|
+
to_queue = Queue.new
|
156
|
+
file_name = id
|
157
|
+
|
158
|
+
if File.exist?(file_name)
|
159
|
+
puts "Examining file..."
|
160
|
+
words, sentences, not_found = File.open(file_name) { |f| f.readlines }
|
161
|
+
words = convert(words)
|
162
|
+
convert(sentences).each { |s| to_queue << s }
|
163
|
+
@not_found = convert(not_found)
|
164
|
+
size_a = words.size
|
165
|
+
size_b = to_queue.size
|
166
|
+
# puts "Size(words) = #{size_a}"
|
167
|
+
# puts "Size(to_queue) = #{size_b}"
|
168
|
+
# puts "Size(words+queue) = #{size_a+size_b}"
|
169
|
+
|
170
|
+
# Remove file
|
171
|
+
File.unlink(file_name)
|
172
|
+
end
|
173
|
+
|
174
|
+
words.each {|word| from_queue << word }
|
175
|
+
result = []
|
176
|
+
|
177
|
+
Thread.abort_on_exception = false
|
178
|
+
|
179
|
+
1.upto(thread_count).map {
|
180
|
+
Thread.new do
|
181
|
+
|
182
|
+
while(word = from_queue.pop!) do
|
183
|
+
|
184
|
+
begin
|
185
|
+
local_result = select_sentence(word, options)
|
186
|
+
puts "Processing word: #{word}"
|
187
|
+
# rescue SocketError, Timeout::Error, Errno::ETIMEDOUT,
|
188
|
+
# Errno::ECONNREFUSED, Errno::ECONNRESET, EOFError => e
|
189
|
+
rescue Exception => e
|
190
|
+
puts " #{e.message}."
|
191
|
+
puts "Please DO NOT abort, but wait for either the program to continue or all threads"
|
192
|
+
puts "to terminate (in which case the data will be saved to disk for fast retrieval on the next run.)"
|
193
|
+
puts "Number of running threads: #{Thread.list.size - 1}."
|
194
|
+
raise
|
195
|
+
|
196
|
+
ensure
|
197
|
+
from_queue << word if $!
|
198
|
+
puts "Wrote '#{word}' back to queue" if $!
|
199
|
+
end
|
200
|
+
|
201
|
+
to_queue << local_result unless local_result.nil?
|
202
|
+
|
203
|
+
end
|
204
|
+
end
|
205
|
+
}.each {|thread| thread.join }
|
206
|
+
|
207
|
+
@stored_sentences = to_queue.to_a
|
208
|
+
@stored_sentences
|
209
|
+
|
210
|
+
ensure
|
211
|
+
if $!
|
212
|
+
while(Thread.list.size > 1) do # Wait for all child threads to terminate.
|
213
|
+
sleep 5
|
214
|
+
end
|
215
|
+
|
216
|
+
File.open(file_name, 'w') do |f|
|
217
|
+
p "============================="
|
218
|
+
p "Writing data to file..."
|
219
|
+
f.write from_queue.to_a
|
220
|
+
f.puts
|
221
|
+
f.write to_queue.to_a
|
222
|
+
f.puts
|
223
|
+
f.write @not_found
|
224
|
+
puts "Finished writing data."
|
225
|
+
puts "Please run the program again after solving the (connection) problem."
|
226
|
+
end
|
227
|
+
end
|
228
|
+
end
|
229
|
+
|
230
|
+
|
231
|
+
# For every Chinese word in {#words} fetches a Chinese sentence and its English translation
|
232
|
+
# from an online dictionary, then calculates and the minimum number of sentences
|
233
|
+
# necessary to cover every word in {#words} at least once.
|
234
|
+
# The calculation is based on the fact that many words occur in more than one sentence.
|
235
|
+
#
|
236
|
+
# @note In case of a network error during dowloading the sentences the data fetched
|
237
|
+
# so far is automatically copied to a file after several retries. This data is read and
|
238
|
+
# processed on the next run to reduce the time spend with downloading the sentences
|
239
|
+
# (which is by far the most time-consuming part).
|
240
|
+
# @note Despite the download source chosen (by using the default or setting the `:source` options), if a word was not found on the first site, the second site is used as an alternative.
|
241
|
+
# @overload min_sentences(options)
|
242
|
+
# @param [Hash] options The options to customize the following features.
|
243
|
+
# @option options [Symbol] :source The online dictionary to download the sentences from,
|
244
|
+
# either [:nciku](http://www.nciku.com) or [:jukuu](http://www.jukuu.com).
|
245
|
+
# Defaults to `:nciku`.
|
246
|
+
# @option options [Symbol] :size The size of the sentence to return from a possible set of
|
247
|
+
# several sentences. Supports the values `:short`, `:average`, and `:long`.
|
248
|
+
# Defaults to `:short`.
|
249
|
+
# @option options [Boolean] :with_pinyin Whether or not to return the pinyin representation
|
250
|
+
# of a sentence.
|
251
|
+
# Defaults to `true`.
|
252
|
+
# @option options [Integer] :thread_count The number of threads used to download the sentences.
|
253
|
+
# Defaults to `8`.
|
254
|
+
# @return [Array<Hash>, []] By default each hash holds the following key-value pairs (The return value is also stored in {#stored_sentences}.):
|
255
|
+
#
|
256
|
+
# * :chinese => Chinese sentence
|
257
|
+
# * :english => English translation
|
258
|
+
# * :pinyin => Pinyin
|
259
|
+
# * :uwc => Unique words count tag (String) of the form "x_words",
|
260
|
+
# where *x* denotes the number of unique words from {#words} found in the sentence.
|
261
|
+
# * :uws => Unique words string tag (String) of the form "[词语1,词语2,...]",
|
262
|
+
# where *词语* denotes the actual word(s) from {#words} found in the sentence.
|
263
|
+
# The return value is also stored in {#stored_sentences}.
|
264
|
+
# @example (see #sentences_unique_chars)
|
265
|
+
def min_sentences(options = {})
|
266
|
+
@with_pinyin = validate { :with_pinyin }
|
267
|
+
# Always run this method.
|
268
|
+
thread_count = validate { :thread_count }
|
269
|
+
sentences = sentences(options)
|
270
|
+
|
271
|
+
minimum_sentences = select_minimum_necessary_sentences(sentences)
|
272
|
+
# :uwc = 'unique words count'
|
273
|
+
with_uwc_tag = add_key(minimum_sentences, :uwc) {|row| uwc_tag(row[:target_words]) }
|
274
|
+
# :uws = 'unique words string'
|
275
|
+
with_uwc_uws_tags = add_key(with_uwc_tag, :uws) do |row|
|
276
|
+
words = row[:target_words].sort.join(', ')
|
277
|
+
"[" + words + "]"
|
278
|
+
end
|
279
|
+
# Remove those keys we don't need anymore
|
280
|
+
result = remove_keys(with_uwc_uws_tags, :target_words, :word)
|
281
|
+
@stored_sentences = result
|
282
|
+
@stored_sentences
|
283
|
+
end
|
284
|
+
|
285
|
+
|
286
|
+
# Finds the unique Chinese characters from either the data in {#stored_sentences} or an
|
287
|
+
# array of Chinese sentences passed as an argument.
|
288
|
+
# @overload sentences_unique_chars(sentences)
|
289
|
+
# @param [Array<String>, Array<Hash>] sentences An array of chinese sentences or an array of hashes with the key `:chinese`.
|
290
|
+
# @note If no argument is passed, the data from {#stored_sentences} is used as input
|
291
|
+
# @return [Array<String>] The unique Chinese characters
|
292
|
+
# @example
|
293
|
+
# require 'chinese_vocab'
|
294
|
+
#
|
295
|
+
# # Extract the Chinese words from a CSV file.
|
296
|
+
# words = Chinese::Vocab.parse_words('path/to/file/hsk.csv', 4)
|
297
|
+
#
|
298
|
+
# # Initialize Chinese::Vocab with word array
|
299
|
+
# # :compact => true means single character words are that also appear in multi-character
|
300
|
+
# # words are removed from the word array (["看", "看书"] => [看书])
|
301
|
+
# vocabulary = Chinese::Vocab.new(words, :compact => true)
|
302
|
+
#
|
303
|
+
# # Return minimum necessary sentences.
|
304
|
+
# vocabulary.min_sentences(:size => small)
|
305
|
+
#
|
306
|
+
# # See how what are the unique characters in all these sentences.
|
307
|
+
# vocabulary.sentences_unique_chars(my_sentences)
|
308
|
+
# # => ["我", "们", "跟", "他", "是", "好", "朋", "友", ...]
|
309
|
+
#
|
310
|
+
# # Save to file
|
311
|
+
# vocabulary.to_csv('path/to_file/vocab_sentences.csv')
|
312
|
+
def sentences_unique_chars(sentences = stored_sentences)
|
313
|
+
# If the argument is an array of hashes, then it must be the data from @stored_sentences
|
314
|
+
sentences = sentences.map { |hash| hash[:chinese] } if sentences[0].kind_of?(Hash)
|
315
|
+
|
316
|
+
sentences.reduce([]) do |acc, row|
|
317
|
+
acc = acc | row.scan(/\p{Word}/) # only return characters, skip punctuation marks
|
318
|
+
acc
|
319
|
+
end
|
320
|
+
end
|
321
|
+
|
322
|
+
|
323
|
+
# Saves the data stored in {#stored_sentences} to disk.
|
324
|
+
# @overload to_csv(path_to_file, options)
|
325
|
+
# @param [String] path_to_file file name and path of where to save the file.
|
326
|
+
# @param [Hash] options The [supported options](http://ruby-doc.org/stdlib-1.9.3/libdoc/csv/rdoc/CSV.html#method-c-new) of Ruby's CSV library.
|
327
|
+
# @overload to_csv(path_to_file)
|
328
|
+
# @param [String] path_to_file file name and path of where to save the file.
|
329
|
+
# @return [void]
|
330
|
+
# @example (see #sentences_unique_chars)
|
331
|
+
def to_csv(path_to_file, options = {})
|
332
|
+
|
333
|
+
CSV.open(path_to_file, "w", options) do |csv|
|
334
|
+
@stored_sentences.each do |row|
|
335
|
+
csv << row.values
|
336
|
+
end
|
337
|
+
end
|
338
|
+
end
|
339
|
+
|
340
|
+
|
341
|
+
# Helper functions
|
342
|
+
# -----------------
|
343
|
+
def remove_parens(word)
|
344
|
+
# 1) Remove all ASCII parens and all data in between.
|
345
|
+
# 2) Remove all Chinese parens and all data in between.
|
346
|
+
word.gsub(/\(.*?\)/, '').gsub(/(.*?)/, '')
|
347
|
+
end
|
348
|
+
|
349
|
+
|
350
|
+
def is_boolean?(value)
|
351
|
+
# Only true for either 'false' or 'true'
|
352
|
+
!!value == value
|
353
|
+
end
|
354
|
+
|
355
|
+
|
356
|
+
# Remove all non-word characters
|
357
|
+
def edit_vocab(word_array)
|
358
|
+
|
359
|
+
word_array.map {|word|
|
360
|
+
edited = remove_parens(word)
|
361
|
+
edited = remove_slash(edited)
|
362
|
+
edited = remove_er_character_from_end(edited)
|
363
|
+
distinct_words(edited).join(' ')
|
364
|
+
}.uniq
|
365
|
+
end
|
366
|
+
|
367
|
+
|
368
|
+
def remove_er_character_from_end(word)
|
369
|
+
if word.size > 2
|
370
|
+
word.gsub(/儿$/, '')
|
371
|
+
else # Don't remove "儿" form words like 女儿
|
372
|
+
word
|
373
|
+
end
|
374
|
+
end
|
375
|
+
|
376
|
+
|
377
|
+
def remove_slash(word)
|
378
|
+
if word.match(/\//)
|
379
|
+
word.split(/\//).sort_by { |w| w.size }.last
|
380
|
+
else
|
381
|
+
word
|
382
|
+
end
|
383
|
+
end
|
384
|
+
|
385
|
+
|
386
|
+
def make_hash(*data)
|
387
|
+
require 'digest'
|
388
|
+
data = data.reduce("") { |acc, item| acc << item.to_s }
|
389
|
+
Digest::SHA2.hexdigest(data)[0..6]
|
390
|
+
end
|
391
|
+
|
392
|
+
|
393
|
+
# Input: ["看", "书", "看书"]
|
394
|
+
# Output: ["看书"]
|
395
|
+
def remove_redundant_single_char_words(words)
|
396
|
+
puts "Removing redundant single character words from the vocabulary..."
|
397
|
+
|
398
|
+
single_char_words, multi_char_words = words.partition {|word| word.length == 1 }
|
399
|
+
return single_char_words if multi_char_words.empty?
|
400
|
+
|
401
|
+
non_redundant_single_char_words = single_char_words.reduce([]) do |acc, single_c|
|
402
|
+
|
403
|
+
already_found = multi_char_words.find do |multi_c|
|
404
|
+
multi_c.include?(single_c)
|
405
|
+
end
|
406
|
+
# Add single char word to array if it is not part of any of the multi char words.
|
407
|
+
acc << single_c unless already_found
|
408
|
+
acc
|
409
|
+
end
|
410
|
+
|
411
|
+
non_redundant_single_char_words + multi_char_words
|
412
|
+
end
|
413
|
+
|
414
|
+
|
415
|
+
# Uses options passed from #sentences
|
416
|
+
def select_sentence(word, options)
|
417
|
+
sentence_pair = Scraper.sentence(word, options)
|
418
|
+
|
419
|
+
sources = Scraper::Sources.keys
|
420
|
+
sentence_pair = try_alternate_download_sources(sources, word, options) if sentence_pair.empty?
|
421
|
+
|
422
|
+
if sentence_pair.empty?
|
423
|
+
@not_found << word
|
424
|
+
return nil
|
425
|
+
else
|
426
|
+
chinese, english = sentence_pair
|
427
|
+
|
428
|
+
result = Hash.new
|
429
|
+
result.merge!(word: word)
|
430
|
+
result.merge!(chinese: chinese)
|
431
|
+
result.merge!(pinyin: chinese.to_pinyin) if @with_pinyin
|
432
|
+
result.merge!(english: english)
|
433
|
+
end
|
434
|
+
end
|
435
|
+
|
436
|
+
|
437
|
+
def try_alternate_download_sources(alternate_sources, word, options)
|
438
|
+
sources = alternate_sources.dup
|
439
|
+
sources.delete(options[:source])
|
440
|
+
|
441
|
+
result = sources.find do |s|
|
442
|
+
options = options.merge(:source => s)
|
443
|
+
sentence = Scraper.sentence(word, options)
|
444
|
+
sentence.empty? ? nil : sentence
|
445
|
+
end
|
446
|
+
|
447
|
+
if result
|
448
|
+
optins = options.merge(:source => result)
|
449
|
+
Scraper.sentence(word, options)
|
450
|
+
else
|
451
|
+
[]
|
452
|
+
end
|
453
|
+
end
|
454
|
+
|
455
|
+
|
456
|
+
def convert(text)
|
457
|
+
eval(text.chomp)
|
458
|
+
end
|
459
|
+
|
460
|
+
|
461
|
+
def add_target_words(hash_array)
|
462
|
+
from_queue = Queue.new
|
463
|
+
to_queue = Queue.new
|
464
|
+
# semaphore = Mutex.new
|
465
|
+
result = []
|
466
|
+
words = @words
|
467
|
+
hash_array.each {|hash| from_queue << hash}
|
468
|
+
|
469
|
+
10.times.map {
|
470
|
+
Thread.new(words) do
|
471
|
+
|
472
|
+
while(row = from_queue.pop!)
|
473
|
+
sentence = row[:chinese]
|
474
|
+
target_words = target_words_per_sentence(sentence, words)
|
475
|
+
|
476
|
+
to_queue << row.merge(:target_words => target_words)
|
477
|
+
|
478
|
+
end
|
479
|
+
end
|
480
|
+
}.map {|thread| thread.join}
|
481
|
+
|
482
|
+
to_queue.to_a
|
483
|
+
|
484
|
+
end
|
485
|
+
|
486
|
+
|
487
|
+
def target_words_per_sentence(sentence, words)
|
488
|
+
words.select {|w| include_every_char?(w, sentence) }
|
489
|
+
end
|
490
|
+
|
491
|
+
|
492
|
+
def sort_by_target_word_count(with_target_words)
|
493
|
+
|
494
|
+
# First sort by size of unique word array (from large to short)
|
495
|
+
# If the unique word count is equal, sort by the length of the sentence (from small to large)
|
496
|
+
with_target_words.sort_by {|row|
|
497
|
+
[-row[:target_words].size, row[:chinese].size] }
|
498
|
+
|
499
|
+
# The above is the same as:
|
500
|
+
# with_target_words.sort {|a,b|
|
501
|
+
# first = -(a[:target_words].size <=> b[:target_words].size)
|
502
|
+
# first.nonzero? || (a[:chinese].size <=> b[:chinese].size) }
|
503
|
+
end
|
504
|
+
|
505
|
+
|
506
|
+
def select_minimum_necessary_sentences(sentences)
|
507
|
+
with_target_words = add_target_words(sentences)
|
508
|
+
rows = sort_by_target_word_count(with_target_words)
|
509
|
+
|
510
|
+
selected_rows = []
|
511
|
+
unmatched_words = @words.dup
|
512
|
+
matched_words = []
|
513
|
+
|
514
|
+
rows.each do |row|
|
515
|
+
words = row[:target_words].dup
|
516
|
+
# Delete all words from 'words' that have already been encoutered
|
517
|
+
# (and are included in 'matched_words').
|
518
|
+
words = words - matched_words
|
519
|
+
|
520
|
+
if words.size > 0 # Words that where not deleted above have to be part of 'unmatched_words'.
|
521
|
+
selected_rows << row # Select this row.
|
522
|
+
|
523
|
+
# When a row is selected, its 'words' are no longer unmatched but matched.
|
524
|
+
unmatched_words = unmatched_words - words
|
525
|
+
matched_words = matched_words + words
|
526
|
+
end
|
527
|
+
end
|
528
|
+
selected_rows
|
529
|
+
end
|
530
|
+
|
531
|
+
|
532
|
+
def remove_keys(hash_array, *keys)
|
533
|
+
hash_array.map { |row| row.delete_keys(*keys) }
|
534
|
+
end
|
535
|
+
|
536
|
+
|
537
|
+
def add_key(hash_array, key, &block)
|
538
|
+
hash_array.map do |row|
|
539
|
+
if block
|
540
|
+
row.merge({key => block.call(row)})
|
541
|
+
else
|
542
|
+
row
|
543
|
+
end
|
544
|
+
end
|
545
|
+
end
|
546
|
+
|
547
|
+
|
548
|
+
def uwc_tag(string)
|
549
|
+
size = string.length
|
550
|
+
case size
|
551
|
+
when 1
|
552
|
+
"1_word"
|
553
|
+
else
|
554
|
+
"#{size}_words"
|
555
|
+
end
|
556
|
+
end
|
557
|
+
|
558
|
+
|
559
|
+
def contains_all_target_words?(selected_rows, sentence_key)
|
560
|
+
|
561
|
+
matched_words = @words.reduce([]) do |acc, word|
|
562
|
+
|
563
|
+
result = selected_rows.find do |row|
|
564
|
+
sentence = row[sentence_key]
|
565
|
+
include_every_char?(word, sentence)
|
566
|
+
end
|
567
|
+
|
568
|
+
if result
|
569
|
+
acc << word
|
570
|
+
end
|
571
|
+
|
572
|
+
acc
|
573
|
+
end
|
574
|
+
|
575
|
+
matched_words.size == @words.size
|
576
|
+
end
|
577
|
+
|
578
|
+
|
579
|
+
# Input:
|
580
|
+
# column: word column number (counting from 1)
|
581
|
+
# row : Array of the processed CSV data that contains our word column.
|
582
|
+
def self.within_range?(column, row)
|
583
|
+
no_of_cols = row.size
|
584
|
+
column >= 1 && column <= no_of_cols
|
585
|
+
end
|
586
|
+
|
587
|
+
|
588
|
+
def alternate_source(sources, selection)
|
589
|
+
sources = sources.dup
|
590
|
+
sources.delete(selection)
|
591
|
+
sources.pop
|
592
|
+
end
|
593
|
+
|
594
|
+
end
|
595
|
+
end
|
metadata
ADDED
@@ -0,0 +1,120 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: chinese_vocab
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.8.0
|
5
|
+
prerelease:
|
6
|
+
platform: ruby
|
7
|
+
authors:
|
8
|
+
- Stefan Rohlfing
|
9
|
+
autorequire:
|
10
|
+
bindir: bin
|
11
|
+
cert_chain: []
|
12
|
+
date: 2012-04-13 00:00:00.000000000Z
|
13
|
+
dependencies:
|
14
|
+
- !ruby/object:Gem::Dependency
|
15
|
+
name: with_validations
|
16
|
+
requirement: &13638320 !ruby/object:Gem::Requirement
|
17
|
+
none: false
|
18
|
+
requirements:
|
19
|
+
- - ! '>='
|
20
|
+
- !ruby/object:Gem::Version
|
21
|
+
version: '0'
|
22
|
+
type: :runtime
|
23
|
+
prerelease: false
|
24
|
+
version_requirements: *13638320
|
25
|
+
- !ruby/object:Gem::Dependency
|
26
|
+
name: nokogiri
|
27
|
+
requirement: &13637840 !ruby/object:Gem::Requirement
|
28
|
+
none: false
|
29
|
+
requirements:
|
30
|
+
- - ! '>='
|
31
|
+
- !ruby/object:Gem::Version
|
32
|
+
version: '0'
|
33
|
+
type: :runtime
|
34
|
+
prerelease: false
|
35
|
+
version_requirements: *13637840
|
36
|
+
- !ruby/object:Gem::Dependency
|
37
|
+
name: string_to_pinyin
|
38
|
+
requirement: &13637400 !ruby/object:Gem::Requirement
|
39
|
+
none: false
|
40
|
+
requirements:
|
41
|
+
- - ! '>='
|
42
|
+
- !ruby/object:Gem::Version
|
43
|
+
version: '0'
|
44
|
+
type: :runtime
|
45
|
+
prerelease: false
|
46
|
+
version_requirements: *13637400
|
47
|
+
- !ruby/object:Gem::Dependency
|
48
|
+
name: rspec
|
49
|
+
requirement: &13636980 !ruby/object:Gem::Requirement
|
50
|
+
none: false
|
51
|
+
requirements:
|
52
|
+
- - ! '>='
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
version: '0'
|
55
|
+
type: :development
|
56
|
+
prerelease: false
|
57
|
+
version_requirements: *13636980
|
58
|
+
description: ! '=== Chinese::Vocab
|
59
|
+
|
60
|
+
This gem is meant to make live easier for any Chinese language student who:
|
61
|
+
|
62
|
+
* Prefers to learn vocabulary from Chinese sentences.
|
63
|
+
|
64
|
+
* Needs to memorize a lot of words on a _tight_ _time_ _schedule_.
|
65
|
+
|
66
|
+
* Uses the spaced repetition flashcard program {Anki}[http://ankisrs.net/].
|
67
|
+
|
68
|
+
|
69
|
+
Chinese::Vocab addresses all of the above requirements by downloading sentences
|
70
|
+
for each word and
|
71
|
+
|
72
|
+
selecting the *minimum* *required* *number* *of* *Chinese* *sentences* (and English
|
73
|
+
translations)
|
74
|
+
|
75
|
+
to *represent* *all* *words*.
|
76
|
+
|
77
|
+
'
|
78
|
+
email: stefan.rohlfing@gmail.com
|
79
|
+
executables: []
|
80
|
+
extensions: []
|
81
|
+
extra_rdoc_files: []
|
82
|
+
files:
|
83
|
+
- lib/chinese.rb
|
84
|
+
- lib/chinese/version.rb
|
85
|
+
- lib/chinese/core_ext/hash.rb
|
86
|
+
- lib/chinese/core_ext/array.rb
|
87
|
+
- lib/chinese/core_ext/queue.rb
|
88
|
+
- lib/chinese/modules/helper_methods.rb
|
89
|
+
- lib/chinese/vocab.rb
|
90
|
+
- lib/chinese/scraper.rb
|
91
|
+
- README.md
|
92
|
+
- Rakefile
|
93
|
+
- LICENSE
|
94
|
+
- Gemfile
|
95
|
+
homepage: http://github.com/bytesource/chinese_vocab
|
96
|
+
licenses: []
|
97
|
+
post_install_message:
|
98
|
+
rdoc_options: []
|
99
|
+
require_paths:
|
100
|
+
- lib
|
101
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
102
|
+
none: false
|
103
|
+
requirements:
|
104
|
+
- - ! '>='
|
105
|
+
- !ruby/object:Gem::Version
|
106
|
+
version: 1.9.1
|
107
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
108
|
+
none: false
|
109
|
+
requirements:
|
110
|
+
- - ! '>='
|
111
|
+
- !ruby/object:Gem::Version
|
112
|
+
version: '0'
|
113
|
+
requirements: []
|
114
|
+
rubyforge_project: chinese_vocab
|
115
|
+
rubygems_version: 1.8.15
|
116
|
+
signing_key:
|
117
|
+
specification_version: 3
|
118
|
+
summary: Chinese::Vocab - Downloading and selecting the minimum required number of
|
119
|
+
sentences to your Chinese vocabulary list
|
120
|
+
test_files: []
|