chinese_vocab 0.8.6 → 0.9.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/ChangeLog.md +13 -0
- data/README.md +66 -29
- data/lib/chinese_vocab/modules/helper_methods.rb +1 -1
- data/lib/chinese_vocab/scraper.rb +22 -6
- data/lib/chinese_vocab/version.rb +1 -1
- data/lib/chinese_vocab/vocab.rb +96 -12
- metadata +2 -2
data/ChangeLog.md
CHANGED
@@ -1,3 +1,16 @@
|
|
1
|
+
## Version 0.9.0 (April 20, 2012)
|
2
|
+
|
3
|
+
#### Others
|
4
|
+
* `Vocab`:
|
5
|
+
* Added `#word_frequency`.
|
6
|
+
* Added `#find_minimum_sentences`: new and faster algorithm to calculate the minimum number of required sentences.
|
7
|
+
* `Scraper`:
|
8
|
+
* Removed timeout restriction.
|
9
|
+
|
10
|
+
### Bug Fixes
|
11
|
+
* `Scraper`: Don't impose a minimum sentence length if this constraint would exclude all sentences.
|
12
|
+
|
13
|
+
|
1
14
|
## Version 0.8.6 (April 13, 2012)
|
2
15
|
|
3
16
|
### Other
|
data/README.md
CHANGED
@@ -10,19 +10,44 @@
|
|
10
10
|
|
11
11
|
You can then export the sentences as well as additional tags provided by `Chinese::Vocab` to [Anki](http://ankisrs.net/).
|
12
12
|
|
13
|
+
|
13
14
|
## Features
|
14
15
|
|
15
16
|
* Downloads sentences for each word in a Chinese vocabulary list and selects the __minimum required number of sentences__ to represent all words.
|
16
|
-
* With the option key `:compact` set to `true` on initialization, all single character words that also appear in at least one multi character word are removed. The reason behind this option is to __remove redundancy in meaning__ and focus on learning distinct words.
|
17
|
+
* With the option key `:compact` set to `true` on initialization, all single character words that also appear in at least one multi character word are removed. The reason behind this option is to __remove redundancy in meaning__ and focus on learning distinct words.
|
18
|
+
Example: (["看", "看书"] => [看书])
|
17
19
|
* Adds additional __tags__ to every sentence that can be used in [Anki](http://ankisrs.net/):
|
18
|
-
* __Pinyin__: By default the pinyin representation is added to each sentence.
|
19
|
-
|
20
|
-
*
|
20
|
+
* __Pinyin__: By default the pinyin representation is added to each sentence.
|
21
|
+
Example: "除了这张大钞以外,我没有其他零票了。" => "chú le zhè zhāng dà chāo yĭ wài ,wŏ méi yŏu qí tā líng piào le 。"
|
22
|
+
* __Number of target words__: The number of words from the vocabulary that are covered by a sentence.
|
23
|
+
Example: "除了这张大钞以外,我没有其他零票了。" => "3_words"
|
24
|
+
* __List of target words__: A list of the words from the vocabulary that are covered by a sentence.
|
25
|
+
Example: "除了这张大钞以外,我没有其他零票了。" => "[我, 他, 除了 以外]"
|
21
26
|
* Export data to csv for easy import from [Anki](http://ankisrs.net/).
|
22
27
|
|
23
28
|
|
29
|
+
## Installation
|
30
|
+
|
31
|
+
```` bash
|
32
|
+
$ gem install chinese_vocab
|
33
|
+
````
|
34
|
+
|
35
|
+
## The Dictionaries
|
36
|
+
`Chinese::Vocab` uses the following online dictionaries to download the Chinese sentences:
|
37
|
+
|
38
|
+
* [Nciku](http://www.nciku.com/): This is a fantastic English-Chinese Dictionary with tons of useful features and a great community.
|
39
|
+
* [Jukuu](http://jukuu.com/): This one is special. It searches the Internet for example sentences and thus is able to return results even for more esoteric technical terms. Search results are returned extremely quickly.
|
40
|
+
|
41
|
+
I *highly recommend* both sites for daily use, and suggest you bookmark them right away.
|
42
|
+
|
43
|
+
###__Important Note of Caution__
|
44
|
+
In order to save precious bandwidth for these great sites, please do __only use this gem when you really need the Chinese sentences for your study__!
|
45
|
+
|
46
|
+
|
24
47
|
## Real World Example (Using the Traditional HSK Word List)
|
25
48
|
|
49
|
+
__Note__: The number of required sentences to cover all words could be reduced by about __39%__.
|
50
|
+
|
26
51
|
```` ruby
|
27
52
|
# Import words from source.
|
28
53
|
# First argument: path to file
|
@@ -32,43 +57,52 @@ words = Chinese::Vocab.parse_words('../old_hsk_level_8828_chars_1_word_edited.cs
|
|
32
57
|
p words.take(6)
|
33
58
|
# => ["啊", "啊", "矮", "爱", "爱人", "安静"]
|
34
59
|
|
60
|
+
|
35
61
|
# Initialize an object.
|
36
62
|
# First argument: word list as an array of strings.
|
37
63
|
# Options:
|
38
64
|
# :compact (defaults to false)
|
39
65
|
anki = Chinese::Vocab.new(words, :compact => true)
|
40
66
|
|
67
|
+
|
41
68
|
# Options:
|
42
69
|
# :source (defaults to :nciku)
|
43
70
|
# :size (defaults to :short)
|
44
71
|
# :with_pinyin (defaults to true)
|
45
72
|
anki.min_sentences(:thread_count => 10)
|
46
73
|
# Sample output:
|
47
|
-
# [{:
|
48
|
-
# :pinyin=>"
|
49
|
-
# :english=>"
|
50
|
-
#
|
51
|
-
#
|
52
|
-
# :
|
74
|
+
# [{:chinese=>"小红经常向别人夸示自己有多贤惠。",
|
75
|
+
# :pinyin=>"xiăo hóng jīng cháng xiàng bié rén kuā shì zì jĭ yŏu duō xián huì 。",
|
76
|
+
# :english=>"Xiaohong always boasts that she is genial and prudent.",
|
77
|
+
# :target_words=>["别人", "经常", "自己", "贤惠"]},
|
78
|
+
# {:chinese=>"一年一度的圣诞节购买礼物的热潮.",
|
79
|
+
# :pinyin=>"yī nián yī dù de shèng dàn jié gòu măi lĭ wù de rè cháo yī",
|
80
|
+
# :english=>"the annual Christmas gift-buying jag",
|
81
|
+
# :target_words=>["礼物", "购买", "圣诞节", "热潮", "一度"]}]
|
53
82
|
|
54
83
|
# Save data to csv.
|
55
84
|
# First parameter: path to file
|
56
85
|
# Options:
|
57
86
|
# Any supported option of Ruby's CSV libary
|
58
87
|
anki.to_csv('in_the_wild_test.csv')
|
59
|
-
# Sample output
|
60
|
-
|
61
|
-
|
62
|
-
#
|
63
|
-
#
|
64
|
-
#
|
65
|
-
# "
|
66
|
-
#
|
67
|
-
|
88
|
+
# Sample output: 2 sentences (csv rows) of 4431 sentences total
|
89
|
+
# (Note that we started out with 7248 sentences):
|
90
|
+
|
91
|
+
# 小红经常向别人夸示自己有多贤惠。,
|
92
|
+
# xiăo hóng jīng cháng xiàng bié rén kuā shì zì jĭ yŏu duō xián huì 。,
|
93
|
+
# Xiaohong always boasts that she is genial and prudent.,
|
94
|
+
# 4_words,"[别人, 经常, 自己, 贤惠]"
|
95
|
+
#
|
96
|
+
# 一年一度的圣诞节购买礼物的热潮.,
|
97
|
+
# yī nián yī dù de shèng dàn jié gòu măi lĭ wù de rè cháo yī,
|
98
|
+
# the annual Christmas gift-buying jag,
|
99
|
+
# 5_words,"[一度, 圣诞节, 热潮, 礼物, 购买]"
|
100
|
+
|
101
|
+
|
102
|
+
|
68
103
|
|
69
104
|
#### Additional methods
|
70
105
|
|
71
|
-
```` ruby
|
72
106
|
# List all words
|
73
107
|
p anki.words.take(6)
|
74
108
|
# => ["啊", "啊", "矮", "爱", "爱人", "安静"]
|
@@ -77,22 +111,25 @@ p anki.words.size
|
|
77
111
|
# => 7251
|
78
112
|
|
79
113
|
p anki.stored_sentences.take(2)
|
80
|
-
# [{:
|
81
|
-
# :pinyin=>"
|
82
|
-
# :english=>"
|
83
|
-
#
|
84
|
-
#
|
85
|
-
# :
|
86
|
-
|
87
|
-
#
|
114
|
+
# [{:chinese=>"小红经常向别人夸示自己有多贤惠。",
|
115
|
+
# :pinyin=>"xiăo hóng jīng cháng xiàng bié rén kuā shì zì jĭ yŏu duō xián huì 。",
|
116
|
+
# :english=>"Xiaohong always boasts that she is genial and prudent.",
|
117
|
+
# :target_words=>["别人", "经常", "自己", "贤惠"]},
|
118
|
+
# {:chinese=>"一年一度的圣诞节购买礼物的热潮.",
|
119
|
+
# :pinyin=>"yī nián yī dù de shèng dàn jié gòu măi lĭ wù de rè cháo yī",
|
120
|
+
# :english=>"the annual Christmas gift-buying jag",
|
121
|
+
# :target_words=>["礼物", "购买", "圣诞节", "热潮", "一度"]}]
|
122
|
+
|
123
|
+
# Words not found on neither online dictionary.
|
88
124
|
p anki.not_found
|
89
125
|
# ["来回来去", "来看来讲", "深美"]
|
90
126
|
|
91
127
|
# Number of unique characters in the selected sentences
|
92
128
|
p anki.sentences_unique_chars.size
|
93
|
-
# =>
|
129
|
+
# => 3232
|
94
130
|
````
|
95
131
|
|
132
|
+
|
96
133
|
## Documentation
|
97
134
|
* [parse_words](http://rubydoc.info/github/bytesource/chinese_vocab/master/Chinese/Vocab.parse_words) - How to read in the Chinese words and correctly set the column number, Options:
|
98
135
|
* The [supported options](http://ruby-doc.org/stdlib-1.9.3/libdoc/csv/rdoc/CSV.html#method-c-new) of Ruby's CSV library as well as the `:encoding` parameter. __Note__: `:encoding` is always set to `utf-8` and `:skip_blanks` to `true` internally.
|
@@ -25,7 +25,7 @@ module Chinese
|
|
25
25
|
word.scan(/\p{Word}+/) # Returns an array of characters that belong together.
|
26
26
|
end
|
27
27
|
|
28
|
-
# Return true if every
|
28
|
+
# Return true if every distinct word as defined by {#distinct_words}
|
29
29
|
# can be found in the given sentence.
|
30
30
|
def include_every_char?(word, sentence)
|
31
31
|
characters = distinct_words(word)
|
@@ -54,7 +54,8 @@ module Chinese
|
|
54
54
|
# http://stackoverflow.com/questions/377768/string-concatenation-and-ruby/378258#378258
|
55
55
|
url = source[:url] + CGI.escape(word)
|
56
56
|
# http://ruby-doc.org/stdlib-1.9.2/libdoc/timeout/rdoc/Timeout.html#method-c-timeout
|
57
|
-
content = Timeout.timeout(
|
57
|
+
content = Timeout.timeout(30) { open(url) }
|
58
|
+
content = open(url)
|
58
59
|
main_node = Nokogiri::HTML(content).css(source[:parent_sel]) # Returns a single node.
|
59
60
|
return [] if main_node.to_a.empty?
|
60
61
|
|
@@ -91,7 +92,18 @@ module Chinese
|
|
91
92
|
# 北边 => 树林边的河流向北方
|
92
93
|
sentence_pairs = sentence_pairs.select { |cn, _| include_every_char?(word, cn) }
|
93
94
|
|
94
|
-
|
95
|
+
# Only select Chinese sentences that are at least x times longer than the word (counting character length),
|
96
|
+
# as sometimes only the word itself is listed as a sentence (or a short expression that does not really
|
97
|
+
# count as a sentence).
|
98
|
+
# Exception: If the result is an empty array (= none of the sentences fulfill the length constrain)
|
99
|
+
# then just return the sentences selected so far.
|
100
|
+
sentence_pairs_selected_by_length_factor = sentence_pairs.select { |cn, _| sentence_times_longer_than_word?(cn, word, 2.2) }
|
101
|
+
|
102
|
+
unless sentence_pairs_selected_by_length_factor.empty?
|
103
|
+
sentence_pairs_selected_by_length_factor
|
104
|
+
else
|
105
|
+
sentence_pairs
|
106
|
+
end
|
95
107
|
end
|
96
108
|
|
97
109
|
def self.sentence(word, options={})
|
@@ -119,11 +131,15 @@ module Chinese
|
|
119
131
|
pair[0].empty? || pair[1].empty?
|
120
132
|
end
|
121
133
|
|
122
|
-
|
123
|
-
|
124
|
-
|
134
|
+
|
135
|
+
def self.sentence_times_longer_than_word?(sentence, word, factor)
|
136
|
+
sentence_chars = sentence.scan(/\p{Word}/)
|
137
|
+
word_chars = word.scan(/\p{Word}/)
|
138
|
+
sentence_chars.size >= (factor * word_chars.size)
|
139
|
+
end
|
140
|
+
|
125
141
|
def self.shortest_size(sentence_pairs)
|
126
|
-
sentence_pairs.sort_by {|(cn,_)| cn.length }.
|
142
|
+
sentence_pairs.sort_by {|(cn,_)| cn.length }.first
|
127
143
|
end
|
128
144
|
|
129
145
|
def self.longest_size(sentence_pairs)
|
data/lib/chinese_vocab/vocab.rb
CHANGED
@@ -4,6 +4,7 @@ require 'open-uri'
|
|
4
4
|
require 'nokogiri'
|
5
5
|
require 'cgi'
|
6
6
|
require 'csv'
|
7
|
+
require 'set'
|
7
8
|
require 'with_validations'
|
8
9
|
require 'string_to_pinyin'
|
9
10
|
require 'chinese_vocab/scraper'
|
@@ -20,7 +21,7 @@ module Chinese
|
|
20
21
|
#
|
21
22
|
# * Removing parentheses (with the content inside each parenthesis).
|
22
23
|
# * Removing any slash (/) and only keeping the longest part.
|
23
|
-
# * Removing '儿'
|
24
|
+
# * Removing trailing '儿' from any word longer than two characters.
|
24
25
|
# * Removing non-word characters such as points and commas.
|
25
26
|
# * Removing and duplicate words.
|
26
27
|
#@return [Array<String>]
|
@@ -163,9 +164,11 @@ module Chinese
|
|
163
164
|
@not_found = convert(not_found)
|
164
165
|
size_a = words.size
|
165
166
|
size_b = to_queue.size
|
166
|
-
|
167
|
-
|
168
|
-
|
167
|
+
puts "Size(@not_found) = #{@not_found.size}"
|
168
|
+
puts "Size(words) = #{size_a}"
|
169
|
+
puts "Size(to_queue) = #{size_b}"
|
170
|
+
puts "Size(words+queue) = #{size_a+size_b}"
|
171
|
+
puts "Size(sentences) = #{to_queue.size}"
|
169
172
|
|
170
173
|
# Remove file
|
171
174
|
File.unlink(file_name)
|
@@ -183,7 +186,7 @@ module Chinese
|
|
183
186
|
|
184
187
|
begin
|
185
188
|
local_result = select_sentence(word, options)
|
186
|
-
puts "Processing word: #{word}"
|
189
|
+
puts "Processing word: #{word} (#{from_queue.size} words left)"
|
187
190
|
# rescue SocketError, Timeout::Error, Errno::ETIMEDOUT,
|
188
191
|
# Errno::ECONNREFUSED, Errno::ECONNRESET, EOFError => e
|
189
192
|
rescue Exception => e
|
@@ -268,21 +271,69 @@ module Chinese
|
|
268
271
|
thread_count = validate { :thread_count }
|
269
272
|
sentences = sentences(options)
|
270
273
|
|
271
|
-
|
274
|
+
# Remove those words that don't have a sentence
|
275
|
+
words = @words - @not_found
|
276
|
+
puts "Determining the target words for every sentence..."
|
277
|
+
sentences = add_target_words(sentences, words)
|
278
|
+
|
279
|
+
minimum_sentences = find_minimum_sentences(sentences, words)
|
280
|
+
|
272
281
|
# :uwc = 'unique words count'
|
273
|
-
with_uwc_tag
|
282
|
+
with_uwc_tag = add_key(minimum_sentences, :uwc) {|row| uwc_tag(row[:target_words]) }
|
274
283
|
# :uws = 'unique words string'
|
275
284
|
with_uwc_uws_tags = add_key(with_uwc_tag, :uws) do |row|
|
276
285
|
words = row[:target_words].sort.join(', ')
|
277
286
|
"[" + words + "]"
|
278
287
|
end
|
279
288
|
# Remove those keys we don't need anymore
|
280
|
-
result
|
289
|
+
result = remove_keys(with_uwc_uws_tags, :target_words, :word)
|
281
290
|
@stored_sentences = result
|
282
291
|
@stored_sentences
|
283
292
|
end
|
284
293
|
|
285
294
|
|
295
|
+
def find_minimum_sentences(sentences, words)
|
296
|
+
min_sentences = []
|
297
|
+
# At the start the variable 'remaining words' contains all
|
298
|
+
# target words - minus those with no sentence found.
|
299
|
+
remaining_words = Set.new(words.dup)
|
300
|
+
|
301
|
+
|
302
|
+
# On every round:
|
303
|
+
# Finds the sentence with the most target words ('best sentence').
|
304
|
+
# Adds that sentence to the result array.
|
305
|
+
# Deletes all target words from the remaining words that are part of
|
306
|
+
# the best sentence.
|
307
|
+
while(!remaining_words.empty?) do
|
308
|
+
puts "Number of remaining_words: #{remaining_words.size}"
|
309
|
+
# puts "Take five: #{remaining_words.take(5)}"
|
310
|
+
|
311
|
+
# Return the sentence with the largest number of target words.
|
312
|
+
sentences = sentences.sort_by do |row|
|
313
|
+
# Returns a new array containing elements common to
|
314
|
+
# the two arrays, with no duplicates.
|
315
|
+
words_left = remaining_words.intersection(row[:target_words])
|
316
|
+
|
317
|
+
# Sort by size of words left first (in descsending order),
|
318
|
+
# if equal, sort by length of the Chinese sentence (in ascending order).
|
319
|
+
[-words_left.size, row[:chinese].size]
|
320
|
+
end
|
321
|
+
|
322
|
+
best_sentence = sentences.first
|
323
|
+
|
324
|
+
# Add the sentence with the largest number of
|
325
|
+
# target words to the result array.
|
326
|
+
min_sentences << best_sentence
|
327
|
+
# Remove the target words that are part of the
|
328
|
+
# best sentence from the remaining words.
|
329
|
+
remaining_words = remaining_words - best_sentence[:target_words]
|
330
|
+
end
|
331
|
+
|
332
|
+
# puts "Number of minimum sentences: #{min_sentences.size}"
|
333
|
+
min_sentences
|
334
|
+
end
|
335
|
+
|
336
|
+
|
286
337
|
# Finds the unique Chinese characters from either the data in {#stored_sentences} or an
|
287
338
|
# array of Chinese sentences passed as an argument.
|
288
339
|
# @overload sentences_unique_chars(sentences)
|
@@ -458,12 +509,12 @@ module Chinese
|
|
458
509
|
end
|
459
510
|
|
460
511
|
|
461
|
-
def add_target_words(hash_array)
|
512
|
+
def add_target_words(hash_array, words)
|
462
513
|
from_queue = Queue.new
|
463
514
|
to_queue = Queue.new
|
464
515
|
# semaphore = Mutex.new
|
465
516
|
result = []
|
466
|
-
words = @words
|
517
|
+
# words = @words
|
467
518
|
hash_array.each {|hash| from_queue << hash}
|
468
519
|
|
469
520
|
10.times.map {
|
@@ -502,9 +553,27 @@ module Chinese
|
|
502
553
|
# first.nonzero? || (a[:chinese].size <=> b[:chinese].size) }
|
503
554
|
end
|
504
555
|
|
556
|
+
# Calculates the number of occurences of every word of {#words} in {#stored_sentences}
|
557
|
+
# @return [Hash] Keys are the words in {#words} with the values indicating the number of
|
558
|
+
# occurences in {#stored_sentences}
|
559
|
+
def word_frequency
|
505
560
|
|
561
|
+
words.reduce({}) do |acc, word|
|
562
|
+
acc[word] = 0 # Set key with a default value of zero.
|
563
|
+
|
564
|
+
stored_sentences.each do |row|
|
565
|
+
sentence = row[:chinese]
|
566
|
+
acc[word] += 1 if include_every_char?(word, sentence)
|
567
|
+
end
|
568
|
+
acc
|
569
|
+
end
|
570
|
+
end
|
571
|
+
|
572
|
+
|
573
|
+
# @deprecated This method has been replaced by {#find_minimum_sentences}.
|
506
574
|
def select_minimum_necessary_sentences(sentences)
|
507
|
-
|
575
|
+
words = @words - @not_found
|
576
|
+
with_target_words = add_target_words(sentences, words)
|
508
577
|
rows = sort_by_target_word_count(with_target_words)
|
509
578
|
|
510
579
|
selected_rows = []
|
@@ -529,6 +598,13 @@ module Chinese
|
|
529
598
|
end
|
530
599
|
|
531
600
|
|
601
|
+
def occurrence_count(word_array, frequency)
|
602
|
+
word_array.reduce(0) do |acc, word|
|
603
|
+
acc + frequency[word]
|
604
|
+
end
|
605
|
+
end
|
606
|
+
|
607
|
+
|
532
608
|
def remove_keys(hash_array, *keys)
|
533
609
|
hash_array.map { |row| row.delete_keys(*keys) }
|
534
610
|
end
|
@@ -572,7 +648,15 @@ module Chinese
|
|
572
648
|
acc
|
573
649
|
end
|
574
650
|
|
575
|
-
matched_words.size == @words.size
|
651
|
+
# matched_words.size == @words.size
|
652
|
+
|
653
|
+
if matched_words.size == @words.size
|
654
|
+
true
|
655
|
+
else
|
656
|
+
puts "Words not found in sentences:"
|
657
|
+
p @words - matched_words
|
658
|
+
false
|
659
|
+
end
|
576
660
|
end
|
577
661
|
|
578
662
|
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: chinese_vocab
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.9.0
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-04-
|
12
|
+
date: 2012-04-20 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: with_validations
|