chinese_vocab 0.8.6 → 0.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/ChangeLog.md CHANGED
@@ -1,3 +1,16 @@
1
+ ## Version 0.9.0 (April 20, 2012)
2
+
3
+ #### Others
4
+ * `Vocab`:
5
+ * Added `#word_frequency`.
6
+ * Added `#find_minimum_sentences`: new and faster algorithm to calculate the minimum number of required sentences.
7
+ * `Scraper`:
8
+ * Removed timeout restriction.
9
+
10
+ ### Bug Fixes
11
+ * `Scraper`: Don't impose a minimum sentence length if this constraint would exclude all sentences.
12
+
13
+
1
14
  ## Version 0.8.6 (April 13, 2012)
2
15
 
3
16
  ### Other
data/README.md CHANGED
@@ -10,19 +10,44 @@
10
10
 
11
11
  You can then export the sentences as well as additional tags provided by `Chinese::Vocab` to [Anki](http://ankisrs.net/).
12
12
 
13
+
13
14
  ## Features
14
15
 
15
16
  * Downloads sentences for each word in a Chinese vocabulary list and selects the __minimum required number of sentences__ to represent all words.
16
- * With the option key `:compact` set to `true` on initialization, all single character words that also appear in at least one multi character word are removed. The reason behind this option is to __remove redundancy in meaning__ and focus on learning distinct words. Example: (["看", "看书"] => [看书])
17
+ * With the option key `:compact` set to `true` on initialization, all single character words that also appear in at least one multi character word are removed. The reason behind this option is to __remove redundancy in meaning__ and focus on learning distinct words.
18
+ Example: (["看", "看书"] => [看书])
17
19
  * Adds additional __tags__ to every sentence that can be used in [Anki](http://ankisrs.net/):
18
- * __Pinyin__: By default the pinyin representation is added to each sentence. Example: "除了这张大钞以外,我没有其他零票了。" => "chú le zhè zhāng dà chāo yĭ wài ,wŏ méi yŏu qí tā líng piào le 。"
19
- * __Number of target words__: The number of words from the vocabulary that are covered by a sentence. Example: "除了这张大钞以外,我没有其他零票了。" => "3_words"
20
- * __List of target words__: A list of the words from the vocabulary that are covered by a sentence. Example: "除了这张大钞以外,我没有其他零票了。" => "[我, 他, 除了 以外]"
20
+ * __Pinyin__: By default the pinyin representation is added to each sentence.
21
+ Example: "除了这张大钞以外,我没有其他零票了。" => "chú le zhè zhāng chāo wài ,wŏ méi yŏu líng piào le "
22
+ * __Number of target words__: The number of words from the vocabulary that are covered by a sentence.
23
+ Example: "除了这张大钞以外,我没有其他零票了。" => "3_words"
24
+ * __List of target words__: A list of the words from the vocabulary that are covered by a sentence.
25
+ Example: "除了这张大钞以外,我没有其他零票了。" => "[我, 他, 除了 以外]"
21
26
  * Export data to csv for easy import from [Anki](http://ankisrs.net/).
22
27
 
23
28
 
29
+ ## Installation
30
+
31
+ ```` bash
32
+ $ gem install chinese_vocab
33
+ ````
34
+
35
+ ## The Dictionaries
36
+ `Chinese::Vocab` uses the following online dictionaries to download the Chinese sentences:
37
+
38
+ * [Nciku](http://www.nciku.com/): This is a fantastic English-Chinese Dictionary with tons of useful features and a great community.
39
+ * [Jukuu](http://jukuu.com/): This one is special. It searches the Internet for example sentences and thus is able to return results even for more esoteric technical terms. Search results are returned extremely quickly.
40
+
41
+ I *highly recommend* both sites for daily use, and suggest you bookmark them right away.
42
+
43
+ ###__Important Note of Caution__
44
+ In order to save precious bandwidth for these great sites, please do __only use this gem when you really need the Chinese sentences for your study__!
45
+
46
+
24
47
  ## Real World Example (Using the Traditional HSK Word List)
25
48
 
49
+ __Note__: The number of required sentences to cover all words could be reduced by about __39%__.
50
+
26
51
  ```` ruby
27
52
  # Import words from source.
28
53
  # First argument: path to file
@@ -32,43 +57,52 @@ words = Chinese::Vocab.parse_words('../old_hsk_level_8828_chars_1_word_edited.cs
32
57
  p words.take(6)
33
58
  # => ["啊", "啊", "矮", "爱", "爱人", "安静"]
34
59
 
60
+
35
61
  # Initialize an object.
36
62
  # First argument: word list as an array of strings.
37
63
  # Options:
38
64
  # :compact (defaults to false)
39
65
  anki = Chinese::Vocab.new(words, :compact => true)
40
66
 
67
+
41
68
  # Options:
42
69
  # :source (defaults to :nciku)
43
70
  # :size (defaults to :short)
44
71
  # :with_pinyin (defaults to true)
45
72
  anki.min_sentences(:thread_count => 10)
46
73
  # Sample output:
47
- # [{:word=>"吧", :chinese=>"放心吧,他做事向来把牢。",
48
- # :pinyin=>"fàng xīn ba ,tā zuò shì xiàng lái láo 。",
49
- # :english=>"Take it easy. You can always count on him."},
50
- # {:word=>"", :chinese=>"喝酒挂红的人一般都很能喝。",
51
- # :pinyin=>"hē jiŭ guà hóng de rén yī bān dōu hĕn néng hē 。",
52
- # :english=>"Those whose face turn red after drinking are normally heavy drinkers."}]
74
+ # [{:chinese=>"小红经常向别人夸示自己有多贤惠。",
75
+ # :pinyin=>"xiăo hóng jīng cháng xiàng bié rén kuā shì yŏu duō xián huì 。",
76
+ # :english=>"Xiaohong always boasts that she is genial and prudent.",
77
+ # :target_words=>["别人", "经常", "自己", "贤惠"]},
78
+ # {:chinese=>"一年一度的圣诞节购买礼物的热潮.",
79
+ # :pinyin=>" nián de shèng dàn jié gòu măi lĭ wù de rè cháo yī",
80
+ # :english=>"the annual Christmas gift-buying jag",
81
+ # :target_words=>["礼物", "购买", "圣诞节", "热潮", "一度"]}]
53
82
 
54
83
  # Save data to csv.
55
84
  # First parameter: path to file
56
85
  # Options:
57
86
  # Any supported option of Ruby's CSV libary
58
87
  anki.to_csv('in_the_wild_test.csv')
59
- # Sample output (2 sentences/lines out of 4511):
60
-
61
- # 只要我们有信心,就会战胜困难。,zhī yào wŏ men yŏu xìn xīn ,jiù huì zhàn shèng kùn nán 。,
62
- # "As long as we have confidence, we can overcome difficulties.",
63
- # 5_words,"[信心, 只要, 困难, 我们, 战胜]"
64
- # 至于他什么时候回来,我不知道。,zhì shén mo shí hòu huí lái ,wŏ bù zhī dào 。,
65
- # "As to what time he's due back, I'm just not sure.",
66
- # 5_words,"[什么, 回来, 时候, 知道, 至于]"
67
- ````
88
+ # Sample output: 2 sentences (csv rows) of 4431 sentences total
89
+ # (Note that we started out with 7248 sentences):
90
+
91
+ # 小红经常向别人夸示自己有多贤惠。,
92
+ # xiăo hóng jīng cháng xiàng bié rén kuā shì zì jĭ yŏu duō xián huì 。,
93
+ # Xiaohong always boasts that she is genial and prudent.,
94
+ # 4_words,"[别人, 经常, 自己, 贤惠]"
95
+ #
96
+ # 一年一度的圣诞节购买礼物的热潮.,
97
+ # yī nián yī dù de shèng dàn jié gòu măi lĭ wù de rè cháo yī,
98
+ # the annual Christmas gift-buying jag,
99
+ # 5_words,"[一度, 圣诞节, 热潮, 礼物, 购买]"
100
+
101
+
102
+
68
103
 
69
104
  #### Additional methods
70
105
 
71
- ```` ruby
72
106
  # List all words
73
107
  p anki.words.take(6)
74
108
  # => ["啊", "啊", "矮", "爱", "爱人", "安静"]
@@ -77,22 +111,25 @@ p anki.words.size
77
111
  # => 7251
78
112
 
79
113
  p anki.stored_sentences.take(2)
80
- # [{:word=>"吧", :chinese=>"放心吧,他做事向来把牢。",
81
- # :pinyin=>"fàng xīn ba ,tā zuò shì xiàng lái láo 。",
82
- # :english=>"Take it easy. You can always count on him."},
83
- # {:word=>"", :chinese=>"喝酒挂红的人一般都很能喝。",
84
- # :pinyin=>"hē jiŭ guà hóng de rén yī bān dōu hĕn néng hē 。",
85
- # :english=>"Those whose face turn red after drinking are normally heavy drinkers."}]
86
-
87
- # words not found
114
+ # [{:chinese=>"小红经常向别人夸示自己有多贤惠。",
115
+ # :pinyin=>"xiăo hóng jīng cháng xiàng bié rén kuā shì yŏu duō xián huì 。",
116
+ # :english=>"Xiaohong always boasts that she is genial and prudent.",
117
+ # :target_words=>["别人", "经常", "自己", "贤惠"]},
118
+ # {:chinese=>"一年一度的圣诞节购买礼物的热潮.",
119
+ # :pinyin=>" nián de shèng dàn jié gòu măi lĭ wù de rè cháo yī",
120
+ # :english=>"the annual Christmas gift-buying jag",
121
+ # :target_words=>["礼物", "购买", "圣诞节", "热潮", "一度"]}]
122
+
123
+ # Words not found on neither online dictionary.
88
124
  p anki.not_found
89
125
  # ["来回来去", "来看来讲", "深美"]
90
126
 
91
127
  # Number of unique characters in the selected sentences
92
128
  p anki.sentences_unique_chars.size
93
- # => 3290
129
+ # => 3232
94
130
  ````
95
131
 
132
+
96
133
  ## Documentation
97
134
  * [parse_words](http://rubydoc.info/github/bytesource/chinese_vocab/master/Chinese/Vocab.parse_words) - How to read in the Chinese words and correctly set the column number, Options:
98
135
  * The [supported options](http://ruby-doc.org/stdlib-1.9.3/libdoc/csv/rdoc/CSV.html#method-c-new) of Ruby's CSV library as well as the `:encoding` parameter. __Note__: `:encoding` is always set to `utf-8` and `:skip_blanks` to `true` internally.
@@ -25,7 +25,7 @@ module Chinese
25
25
  word.scan(/\p{Word}+/) # Returns an array of characters that belong together.
26
26
  end
27
27
 
28
- # Return true if every distince word (as defined by #distinct_words)
28
+ # Return true if every distinct word as defined by {#distinct_words}
29
29
  # can be found in the given sentence.
30
30
  def include_every_char?(word, sentence)
31
31
  characters = distinct_words(word)
@@ -54,7 +54,8 @@ module Chinese
54
54
  # http://stackoverflow.com/questions/377768/string-concatenation-and-ruby/378258#378258
55
55
  url = source[:url] + CGI.escape(word)
56
56
  # http://ruby-doc.org/stdlib-1.9.2/libdoc/timeout/rdoc/Timeout.html#method-c-timeout
57
- content = Timeout.timeout(20) { open(url) }
57
+ content = Timeout.timeout(30) { open(url) }
58
+ content = open(url)
58
59
  main_node = Nokogiri::HTML(content).css(source[:parent_sel]) # Returns a single node.
59
60
  return [] if main_node.to_a.empty?
60
61
 
@@ -91,7 +92,18 @@ module Chinese
91
92
  # 北边 => 树林边的河流向北方
92
93
  sentence_pairs = sentence_pairs.select { |cn, _| include_every_char?(word, cn) }
93
94
 
94
- sentence_pairs
95
+ # Only select Chinese sentences that are at least x times longer than the word (counting character length),
96
+ # as sometimes only the word itself is listed as a sentence (or a short expression that does not really
97
+ # count as a sentence).
98
+ # Exception: If the result is an empty array (= none of the sentences fulfill the length constrain)
99
+ # then just return the sentences selected so far.
100
+ sentence_pairs_selected_by_length_factor = sentence_pairs.select { |cn, _| sentence_times_longer_than_word?(cn, word, 2.2) }
101
+
102
+ unless sentence_pairs_selected_by_length_factor.empty?
103
+ sentence_pairs_selected_by_length_factor
104
+ else
105
+ sentence_pairs
106
+ end
95
107
  end
96
108
 
97
109
  def self.sentence(word, options={})
@@ -119,11 +131,15 @@ module Chinese
119
131
  pair[0].empty? || pair[1].empty?
120
132
  end
121
133
 
122
- # Despite its name returns the SECOND shortest sentence,
123
- # as the shortest result often is not a real sentence,
124
- # but a definition.
134
+
135
+ def self.sentence_times_longer_than_word?(sentence, word, factor)
136
+ sentence_chars = sentence.scan(/\p{Word}/)
137
+ word_chars = word.scan(/\p{Word}/)
138
+ sentence_chars.size >= (factor * word_chars.size)
139
+ end
140
+
125
141
  def self.shortest_size(sentence_pairs)
126
- sentence_pairs.sort_by {|(cn,_)| cn.length }.take(2).last
142
+ sentence_pairs.sort_by {|(cn,_)| cn.length }.first
127
143
  end
128
144
 
129
145
  def self.longest_size(sentence_pairs)
@@ -1,3 +1,3 @@
1
1
  module Chinese
2
- VERSION = "0.8.6"
2
+ VERSION = "0.9.0"
3
3
  end
@@ -4,6 +4,7 @@ require 'open-uri'
4
4
  require 'nokogiri'
5
5
  require 'cgi'
6
6
  require 'csv'
7
+ require 'set'
7
8
  require 'with_validations'
8
9
  require 'string_to_pinyin'
9
10
  require 'chinese_vocab/scraper'
@@ -20,7 +21,7 @@ module Chinese
20
21
  #
21
22
  # * Removing parentheses (with the content inside each parenthesis).
22
23
  # * Removing any slash (/) and only keeping the longest part.
23
- # * Removing '儿' for any word longer than two characters.
24
+ # * Removing trailing '儿' from any word longer than two characters.
24
25
  # * Removing non-word characters such as points and commas.
25
26
  # * Removing and duplicate words.
26
27
  #@return [Array<String>]
@@ -163,9 +164,11 @@ module Chinese
163
164
  @not_found = convert(not_found)
164
165
  size_a = words.size
165
166
  size_b = to_queue.size
166
- # puts "Size(words) = #{size_a}"
167
- # puts "Size(to_queue) = #{size_b}"
168
- # puts "Size(words+queue) = #{size_a+size_b}"
167
+ puts "Size(@not_found) = #{@not_found.size}"
168
+ puts "Size(words) = #{size_a}"
169
+ puts "Size(to_queue) = #{size_b}"
170
+ puts "Size(words+queue) = #{size_a+size_b}"
171
+ puts "Size(sentences) = #{to_queue.size}"
169
172
 
170
173
  # Remove file
171
174
  File.unlink(file_name)
@@ -183,7 +186,7 @@ module Chinese
183
186
 
184
187
  begin
185
188
  local_result = select_sentence(word, options)
186
- puts "Processing word: #{word}"
189
+ puts "Processing word: #{word} (#{from_queue.size} words left)"
187
190
  # rescue SocketError, Timeout::Error, Errno::ETIMEDOUT,
188
191
  # Errno::ECONNREFUSED, Errno::ECONNRESET, EOFError => e
189
192
  rescue Exception => e
@@ -268,21 +271,69 @@ module Chinese
268
271
  thread_count = validate { :thread_count }
269
272
  sentences = sentences(options)
270
273
 
271
- minimum_sentences = select_minimum_necessary_sentences(sentences)
274
+ # Remove those words that don't have a sentence
275
+ words = @words - @not_found
276
+ puts "Determining the target words for every sentence..."
277
+ sentences = add_target_words(sentences, words)
278
+
279
+ minimum_sentences = find_minimum_sentences(sentences, words)
280
+
272
281
  # :uwc = 'unique words count'
273
- with_uwc_tag = add_key(minimum_sentences, :uwc) {|row| uwc_tag(row[:target_words]) }
282
+ with_uwc_tag = add_key(minimum_sentences, :uwc) {|row| uwc_tag(row[:target_words]) }
274
283
  # :uws = 'unique words string'
275
284
  with_uwc_uws_tags = add_key(with_uwc_tag, :uws) do |row|
276
285
  words = row[:target_words].sort.join(', ')
277
286
  "[" + words + "]"
278
287
  end
279
288
  # Remove those keys we don't need anymore
280
- result = remove_keys(with_uwc_uws_tags, :target_words, :word)
289
+ result = remove_keys(with_uwc_uws_tags, :target_words, :word)
281
290
  @stored_sentences = result
282
291
  @stored_sentences
283
292
  end
284
293
 
285
294
 
295
+ def find_minimum_sentences(sentences, words)
296
+ min_sentences = []
297
+ # At the start the variable 'remaining words' contains all
298
+ # target words - minus those with no sentence found.
299
+ remaining_words = Set.new(words.dup)
300
+
301
+
302
+ # On every round:
303
+ # Finds the sentence with the most target words ('best sentence').
304
+ # Adds that sentence to the result array.
305
+ # Deletes all target words from the remaining words that are part of
306
+ # the best sentence.
307
+ while(!remaining_words.empty?) do
308
+ puts "Number of remaining_words: #{remaining_words.size}"
309
+ # puts "Take five: #{remaining_words.take(5)}"
310
+
311
+ # Return the sentence with the largest number of target words.
312
+ sentences = sentences.sort_by do |row|
313
+ # Returns a new array containing elements common to
314
+ # the two arrays, with no duplicates.
315
+ words_left = remaining_words.intersection(row[:target_words])
316
+
317
+ # Sort by size of words left first (in descsending order),
318
+ # if equal, sort by length of the Chinese sentence (in ascending order).
319
+ [-words_left.size, row[:chinese].size]
320
+ end
321
+
322
+ best_sentence = sentences.first
323
+
324
+ # Add the sentence with the largest number of
325
+ # target words to the result array.
326
+ min_sentences << best_sentence
327
+ # Remove the target words that are part of the
328
+ # best sentence from the remaining words.
329
+ remaining_words = remaining_words - best_sentence[:target_words]
330
+ end
331
+
332
+ # puts "Number of minimum sentences: #{min_sentences.size}"
333
+ min_sentences
334
+ end
335
+
336
+
286
337
  # Finds the unique Chinese characters from either the data in {#stored_sentences} or an
287
338
  # array of Chinese sentences passed as an argument.
288
339
  # @overload sentences_unique_chars(sentences)
@@ -458,12 +509,12 @@ module Chinese
458
509
  end
459
510
 
460
511
 
461
- def add_target_words(hash_array)
512
+ def add_target_words(hash_array, words)
462
513
  from_queue = Queue.new
463
514
  to_queue = Queue.new
464
515
  # semaphore = Mutex.new
465
516
  result = []
466
- words = @words
517
+ # words = @words
467
518
  hash_array.each {|hash| from_queue << hash}
468
519
 
469
520
  10.times.map {
@@ -502,9 +553,27 @@ module Chinese
502
553
  # first.nonzero? || (a[:chinese].size <=> b[:chinese].size) }
503
554
  end
504
555
 
556
+ # Calculates the number of occurences of every word of {#words} in {#stored_sentences}
557
+ # @return [Hash] Keys are the words in {#words} with the values indicating the number of
558
+ # occurences in {#stored_sentences}
559
+ def word_frequency
505
560
 
561
+ words.reduce({}) do |acc, word|
562
+ acc[word] = 0 # Set key with a default value of zero.
563
+
564
+ stored_sentences.each do |row|
565
+ sentence = row[:chinese]
566
+ acc[word] += 1 if include_every_char?(word, sentence)
567
+ end
568
+ acc
569
+ end
570
+ end
571
+
572
+
573
+ # @deprecated This method has been replaced by {#find_minimum_sentences}.
506
574
  def select_minimum_necessary_sentences(sentences)
507
- with_target_words = add_target_words(sentences)
575
+ words = @words - @not_found
576
+ with_target_words = add_target_words(sentences, words)
508
577
  rows = sort_by_target_word_count(with_target_words)
509
578
 
510
579
  selected_rows = []
@@ -529,6 +598,13 @@ module Chinese
529
598
  end
530
599
 
531
600
 
601
+ def occurrence_count(word_array, frequency)
602
+ word_array.reduce(0) do |acc, word|
603
+ acc + frequency[word]
604
+ end
605
+ end
606
+
607
+
532
608
  def remove_keys(hash_array, *keys)
533
609
  hash_array.map { |row| row.delete_keys(*keys) }
534
610
  end
@@ -572,7 +648,15 @@ module Chinese
572
648
  acc
573
649
  end
574
650
 
575
- matched_words.size == @words.size
651
+ # matched_words.size == @words.size
652
+
653
+ if matched_words.size == @words.size
654
+ true
655
+ else
656
+ puts "Words not found in sentences:"
657
+ p @words - matched_words
658
+ false
659
+ end
576
660
  end
577
661
 
578
662
 
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: chinese_vocab
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.8.6
4
+ version: 0.9.0
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-04-13 00:00:00.000000000 Z
12
+ date: 2012-04-20 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: with_validations