camdict 1.0.3 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 9375049d96a36f304ae7262da7047f6c4d6e0402
4
- data.tar.gz: a1256a02286a23bbd204bb0cd5de8a5670c5d12b
3
+ metadata.gz: 65f86517bb1f1674118ec92b379e0436fe9fdbec
4
+ data.tar.gz: 05b6e68ba21c6d9dbee96863ed2146ceea9d12e1
5
5
  SHA512:
6
- metadata.gz: 857e3d9ee01bdf43511371ec3f1b6bf6df45185f4078e2e1e245243307322a85b9e3bda3b2a77523fa3884b14dd27c62541c1d27a02d718b447e750546bdbf02
7
- data.tar.gz: b69da41647e14adefa47ee049df7b7967354f212479e6b3669d3c4e19e6a8385ae6222bd2df73becb745e3758af6e62df75c1652286f3cde459d5c796964f5d4
6
+ metadata.gz: f93f6da6b914e84efc540f1fc753b75034402f218048cbc825e478dc6310a609d4282ffb4edb31e34bd572a43093ce66c25c82b37436a9a1b66d0970724e4acc
7
+ data.tar.gz: bb5cc16f332a3c6c28b66f31c4e08b59e11d2aea7086c4ee87b65a58997116b9f8bd635f504c964c5e950b957b8ade3e985f7380f63de8124e5470e59d233a94
data/README.md CHANGED
@@ -1,8 +1,10 @@
1
1
  # A ruby gem - camdict
2
+ ![Build Status][travis-image][travis-link]
3
+ ![Code Climate][climate-image][climate-link]
2
4
 
3
- ## Introduction
5
+ ## Introduction
4
6
 
5
- The ruby gem camdict is a [Cambridge online dictionary][1] client.
7
+ The ruby gem camdict is a [Cambridge online dictionary][1] client.
6
8
  You could use this excellent dictionary with a browser, but now it is possible
7
9
  to use it with this ruby API in your code.
8
10
 
@@ -10,52 +12,45 @@ to use it with this ruby API in your code.
10
12
  `gem install camdict`
11
13
 
12
14
  ## Verification
13
- The gem can be tested by below commands in the directory where it's installed.
14
- `rake` - run all the testcases which don't need internet connection.
15
- `rake itest` - run all the testcases that need internet connection.
15
+ The gem can be tested by below commands in the directory where it's installed.
16
+ `rake` - run all the testcases which don't need internet connection.
17
+ `rake itest` - run all the testcases that need internet connection.
16
18
  `rake testall` - run all above tests.
17
19
 
18
- One test may fail if the gem nokogiri hasn't pulled in the fix [here][2]. But
19
- it is safe to apply the patch to your nokogiri copy.
20
-
21
20
  ## Usage
22
21
 
23
22
  ```ruby
24
23
  require 'camdict'
25
24
 
26
25
  # Look up a new word
27
- word = Camdict::Word.new "health"
28
-
29
- # get all definitions for this word from remote dictionary and select the
30
- # first one. A word usually has many definitions.
31
- health = word.definitions.first
26
+ word = Camdict::Word.new 'health'
32
27
 
33
28
  # Print the part of speech
34
29
  puts health.part_of_speech #=> noun
35
30
 
36
- # One definition may have more than one explanations.
37
- # Just look at the details of the first one.
38
- explanation1 = health.explanations.first
39
-
40
- # What's the meaning
41
- puts explanation1.meaning #=>
42
- # the condition of the body and the degree to which it is free from
43
- # illness, or the state of being well:
44
-
45
- # And it may have some useful example sentences.
46
- explanation1.examples.each { |e|
47
- puts e.sentence #=>
48
- # to be in good/poor health
49
- # Regular exercise is good for your health.
50
- # I had to give up drinking for health reasons.
51
- # He gave up work because of ill health.
52
- }
31
+ # What's the first meaning
32
+ puts health.meaning #=>
33
+ # the condition of the body and the degree to which it is free from
34
+ # illness, or the state of being well:
35
+
36
+ # all meanings
37
+ puts health.meanings #=> in addition to above meaning, it prints
38
+ # the condition of something that changes or develops, such as an
39
+ # organization or system:
40
+
53
41
  ```
54
42
 
55
- There are some useful testing examples in test directory of this gem.
43
+ Need more? try `health.print` to show more data in a friendly format.
44
+
45
+ ## Versioning
46
+ The release of this gem follows the [semantic versioning rules][2].
56
47
 
57
48
  ## Licence MIT
58
- Copyright (c) 2014 Pan Gaoyong
49
+ Copyright (c) 2014-2017 Pan Gaoyong
59
50
 
60
51
  [1]: http://dictionary.cambridge.com "Cambridge"
61
- [2]: https://github.com/sparklemotion/nokogiri/pull/1020 "My Nokogiri Bug Fix"
52
+ [2]: http://semver.org
53
+ [travis-image]: https://travis-ci.org/pan/camdict.svg?branch=master
54
+ [travis-link]: https://travis-ci.org/pan/camdict
55
+ [climate-image]: https://codeclimate.com/github/pan/camdict/badges/gpa.svg
56
+ [climate-link]: https://codeclimate.com/github/pan/camdict
@@ -0,0 +1,37 @@
1
+ # frozen_string_literal: true
2
+ require 'camdict/string_ext'
3
+
4
+ module Camdict
5
+ # Extention: Refine Array class.
6
+ module ArrayExt
7
+ refine Array do
8
+ # Iterate an array and return two elements +a+ +b+ each time for handling.
9
+ def each_pair
10
+ len = length
11
+ i = 0
12
+ while i < len
13
+ a = at(i)
14
+ b = at(i + 1)
15
+ yield(a, b)
16
+ i += 2
17
+ end
18
+ end
19
+
20
+ # Test if a phrase array includes a +word+.
21
+ # ['blow your nose', 'blow a kiss to/at sb'].has?("a kiss at") #=> true
22
+ def has?(word)
23
+ expand.each { |phr| return true if phr.include? word }
24
+ false
25
+ end
26
+
27
+ using Camdict::StringExt
28
+
29
+ # Expand a phrase array into a flattened one. Example,
30
+ # ['blow your nose', 'blow a kiss to/at sb'] #=>
31
+ # ['blow your nose', 'blow a kiss to sb', 'blow a kiss at sb']
32
+ def expand
33
+ map { |p| p&.flatten || p }.flatten
34
+ end
35
+ end
36
+ end
37
+ end
@@ -1,117 +1,171 @@
1
- require "camdict/http_client"
1
+ # frozen_string_literal: true
2
+ require 'camdict/http_client'
3
+ require 'camdict/string_ext'
4
+ require 'camdict/exception'
2
5
 
3
6
  module Camdict
4
-
5
- # The client downloads all the useful data about a word or phrase from
6
- # remote Cambridge dictionaries, but not includes the extended data.
7
+ # The client downloads all the useful data about a word or phrase from
8
+ # remote Cambridge dictionaries, but not includes the extended data.
7
9
  # For example,
8
- # when the word "mind" is searched, all its four exactly matched entries are
9
- # downloaded. However, separated entries like "turn of mind" & "open mind"
10
+ # when the word "mind" is searched, the exactly matched entry is downloaded.
11
+ # However, other related entries like "turn of mind" & "open mind"
10
12
  # are not included.
11
- class Client
12
-
13
- # Default dictionary is english-chinese-simplified.
14
- # Other possible +dict+ values:
15
- # british, american-english, business-english, learner-english.
16
- def initialize(dict=nil)
17
- @dictionary = dict || "english-chinese-simplified"
18
- end
19
-
20
-
21
- # Get a word's html definition(s) by searching it from the web dictionary.
22
- # The returned result could be an empty array when nothing is found, or
23
- # is an array with a hash element,
24
- # [{ word => html definition }],
25
- # or many hash elements when it has multiple entries,
26
- # [{ entry_id => html definition }, ...].
27
- # Normally, when a +word+ has more than one meanings, its entry ID format is
28
- # like word_nn. Otherwise it's just the word itself.
13
+ class Client < HTTP::Client
14
+ attr_reader :dictionary
15
+ # Default dictionary is British english.
16
+ # Other possible +dict+ values:
17
+ # english-chinese-simplified, learner-english,
18
+ # essential-british-english, essential-american-english, etc.
19
+ def initialize(dict = nil)
20
+ @dictionary = dict || 'english'
21
+ end
22
+
23
+ # Get a word's html definition from the web dictionary.
24
+ # The returned result could be an empty string when nothing is found, or
25
+ # its html definition
29
26
  def html_definition(word)
30
27
  html = fetch(word)
31
- return [] if html.nil?
32
- html_defs = []
33
- # some words return their only definition directly, such as aluminium.
34
- if definition_page? html
35
- # entry id is just the word when there is only one definition
36
- html_defs << { word => di_head(html) + di_body(html) }
28
+ if html
29
+ di_extracted(html)
37
30
  else
38
- # returned page could be a spelling check suggestion page in case it is
39
- # not found, or the found page with all matched entries and related.
40
- # when entry urls are not found, they are empty and spelling suggestion
41
- # pages. So mentry_links() returns an empty array. Otherwise, it returns
42
- # all the exactly matched entry links.
43
- matched_urls = mentry_links(word, html)
44
- unless matched_urls.empty?
45
- matched_urls.each { |url|
46
- html_defs << { entry_id(url) => get_htmldef(url) }
47
- }
48
- end
31
+ search(word)
49
32
  end
50
- html_defs
51
33
  end
52
34
 
53
- # Get a word html page source by its entry +url+.
35
+ # Get a word html definition page by its entry +url+.
54
36
  def get_htmldef(url)
55
- html = Camdict::HTTP::Client.get_html(url)
56
- di_head(html) + di_body(html)
37
+ di_extracted get_html(url)
38
+ end
39
+
40
+ # search a word with this URL
41
+ def search_url(word)
42
+ "#{host}/search/#{@dictionary}/?q=#{word}"
43
+ end
44
+
45
+ def word_url(word)
46
+ "#{host}/dictionary/#{@dictionary}/#{encode(word).downcase}"
57
47
  end
58
48
 
59
49
  private
60
50
 
51
+ def search(word)
52
+ html = try_search(word)
53
+ return '' unless html
54
+ # some words return their only definition directly, such as plagiarism.
55
+ if single_def?(html)
56
+ di_extracted(html)
57
+ else
58
+ multiple_entries(word, html).join
59
+ end
60
+ end
61
+
62
+ def host
63
+ 'http://dictionary.cambridge.org'
64
+ end
65
+
66
+ # returned page could be a spelling check suggestion page in case it is
67
+ # not found, or the found page with all matched entries and related.
68
+ # when entry urls are not found, they are empty and spelling suggestion
69
+ # pages. It returns all the exactly matched entry links otherwise, raise
70
+ # exception WordNotFound.
71
+ def multiple_entries(word, html)
72
+ html_defs = []
73
+ mentry_links(word, html).each do |url|
74
+ html_content = get_htmldef(url)
75
+ html_defs << html_content if html_content
76
+ end
77
+ raise WordNotFound, "#{word} not found" if html_defs.empty?
78
+ html_defs
79
+ end
80
+
81
+ def try_search(word)
82
+ get_html(search_url(word))
83
+ rescue OpenURI::HTTPError => e
84
+ # When a word does not match any definitions, it returns 404 not found.
85
+ return if e.message[0..2] == '404'
86
+ end
87
+
61
88
  # Fetch word searching result page.
62
89
  # Returned result is either just a single definition page if there is only
63
- # one entry, or a result page listing all possible entries, or spelling
90
+ # one entry, or a result page listing all possible entries, or spelling
64
91
  # check result. All results are objects of Nokogiri::HTML.
65
92
  def fetch(w)
66
- # search a word with this URL
67
- search_url = "http://dictionary.cambridge.org/search/#{@dictionary}/?q="
68
- url = search_url + w
69
- begin
70
- Camdict::HTTP::Client.get_html(url)
71
- rescue OpenURI::HTTPError => e
72
- # "404" == e.message[0..2], When a word is not found, it returns 404
73
- # Not Found and spelling suggestions page.
74
- end
93
+ ret = get_html(word_url(w))
94
+ ret if definition_page? ret
75
95
  end
76
96
 
77
- # To determine whether or not the input object of Nokogiri::HTML is a page
97
+ # To determine whether or not the input object of Nokogiri::HTML is a page
78
98
  # of a word definition. Return true if it has a source structure like this,
79
99
  # <div class="di-head">
80
100
  # <div class="di-title">
81
101
  # <h1 class="hw">
82
102
  # This works for the translation page too, like English-Spanish.
83
103
  def single_def?(html)
84
- node = html.css(".di-head .di-title .hw")
85
- ! node.empty?
104
+ node = html.css('.di-head .di-title .hw')
105
+ !node.empty?
86
106
  end
87
107
 
88
108
  # Find out matched entry links from search result page
89
- # <ul class="result-list">
90
- # <li><a href="entry_link1">
91
- # <li><a href="entry_link2">
92
- # The search result html page should include above piece of code.
93
- # The extended links are filtered out and the matched word or phrase's
109
+ # <ul class="prefix-block">
110
+ # <li><a href="entry_link">
111
+ # The search result html page should include above piece of code.
112
+ # The extended links are filtered out and the matched word or phrase's
94
113
  # links are kept. An array of them are returned.
95
- # For example, when the searched word is "related", entry links are like,
96
- # http://dictionary.cambridge.org/dictionary/british/related_1
97
- # http://dictionary.cambridge.org/dictionary/british/related_2
98
- # http://dictionary.cambridge.org/dictionary/british/stress-related
114
+ # For example, when the searched word is "related", entry links are like,
115
+ # http://dictionary.cambridge.org/dictionary/english/related
116
+ # http://dictionary.cambridge.org/dictionary/english/relate
99
117
  # ...
100
- # Returned result should only contain the first two.
118
+ # Returned result should only contain the first one.
101
119
  # Input html is an object of Nokogiri::HTML.
102
120
  def mentry_links(word, html)
103
121
  # suppose the word is not found in the dictionary, so it is empty.
104
122
  links = []
105
- nodes = html.css(".result-list a")
106
- # when found
107
- unless nodes.empty?
108
- nodes.each { |a|
109
- links << a['href'] if matched_word?(word, a)
110
- }
111
- end
123
+ nodes = html.css('.prefix-block a')
124
+ nodes.each { |a| links << a['href'] if matched_word?(word, a) }
112
125
  links
113
126
  end
114
127
 
128
+ # Extract definition head and body from Nokogiri::HTML, discard share links
129
+ def di_extracted(html)
130
+ body = di_body(html)
131
+ # searching aluminium returns an American or British english page
132
+ # saparately, below condition filter out American english result
133
+ return if body.empty?
134
+ body.css('.share').each { |s| body.delete s }
135
+ body
136
+ end
137
+
138
+ # Return definition body in html source
139
+ def di_body(html)
140
+ html.css("#{tab_css} .di-body")
141
+ end
142
+
143
+ # the css selecting a tab
144
+ def tab_css
145
+ "[#{tab}]"
146
+ end
147
+
148
+ # the tab attributes according to dictionary name
149
+ def tab
150
+ case @dictionary
151
+ when 'english'
152
+ 'data-tab="ds-british"'
153
+ end
154
+ end
155
+
156
+ # get the last part of http://dictionary.cambridge.org/british/related_1
157
+ def entry_id(url)
158
+ url.split('/').last
159
+ end
160
+
161
+ # phrase with space and single quote has to be replaced with dash
162
+ def encode(word)
163
+ word.gsub(/[ ']/, '-')
164
+ end
165
+
166
+ alias definition_page? single_def?
167
+
168
+ using Camdict::StringExt
115
169
  # Return true if the searched word matches the one on result page.
116
170
  # Node is an object of Nokogiri::Node
117
171
  # <li>
@@ -119,36 +173,18 @@ module Camdict
119
173
  # <b class="phrase">out of mind, or
120
174
  # <b class="hw">turn of mind, or
121
175
  # <b class="w">mind-numbingly
122
- # Match criterion: the queried word should equal to the result word;
176
+ # Match criterion: the queried word should equal to the result word;
123
177
  # the result phrase should be flattened, which should equal to the
124
178
  # queried phrase.
125
179
  def matched_word?(word, node)
126
- li = node.css(".base")
180
+ li = node.css('.base')
181
+ return false if li.empty?
127
182
  resword = li.size == 1 ? li.text : li[0].text
128
- if resword.include? '/' or resword.include? ';'
183
+ if resword.include?('/') || resword.include?(';')
129
184
  resword.flatten.include?(word)
130
185
  else
131
186
  word == resword
132
187
  end
133
188
  end
134
-
135
- # Return definition head in html source
136
- def di_head(html)
137
- html.css(".cdo-section-title-hw").to_html(:save_with=>0) +
138
- html.css(".di-info").to_html(:save_with=>0)
139
- end
140
-
141
- # Return definition body in html source
142
- def di_body(html)
143
- html.css(".di-body").to_html(:save_with=>0)
144
- end
145
-
146
- # get the last part of http://dictionary.cambridge.org/british/related_1
147
- def entry_id(url)
148
- url.split('/').last
149
- end
150
-
151
- alias :definition_page? :single_def?
152
-
153
189
  end
154
190
  end
@@ -1,161 +1,43 @@
1
+ # frozen_string_literal: true
2
+ require 'camdict/string_ext'
3
+
1
4
  module Camdict
5
+ # some common private methods used to extract Nokogiri nodes
2
6
  module Common
3
-
4
- # Extend String class.
5
- String.class_eval do
6
- # 'blow a kiss to/at sb'.flatten =>
7
- # %q(blow a kiss to sb, blow a kiss at sb)
8
- # if it doesn't include a slash, returns stripped string
9
- def flatten
10
- str = self.strip
11
- # remove the space surrounding '/'
12
- str = str.gsub /\s*\/\s*/, '/'
13
- return str unless str.include? '/'
14
- len = str.length
15
- ret = []
16
- # when two strings are passed in separated with ';', then separate them
17
- if pos = str.index(';')
18
- ret += str[0..pos-1].flatten
19
- ret += str[pos+1..len-1].flatten
20
- return ret
21
- end
22
- # when a string has round brackets meaning optional part
23
- if str.include? '('
24
- head, bracket, tail = str.partition(/\(.*\)/)
25
- unless bracket.empty?
26
- ret << (head.strip + tail).flatten
27
- result = bracket.delete("()").flatten
28
- result = [result] if result.is_a? String
29
- result.each { |s|
30
- ret << (head + s + tail).flatten
31
- }
32
- end
33
- return ret.flatten
34
- end
35
- j=0 # count of the alternative words, 'to/at' has two.
36
- b=[] # b[]/e[] index of the beginning/end of alternative words
37
- e=[]
38
- # set this flag when next word is expected an alternate word after slash
39
- include_next = false
40
- for i in 0..len-1
41
- c = str[i]
42
- case c
43
- # valid char in a word
44
- when /[[:alnum:]\-']/
45
- if b[j].nil?
46
- b[j] = i
47
- e[j] = i
48
- else
49
- e[j] = i
50
- end
51
- # char means a word has ended
52
- when " ", "!", "?", ",", "."
53
- if include_next
54
- break
55
- else
56
- b[j] = nil
57
- e[j] = nil
58
- end
59
- # 'or' separator
60
- when "/"
61
- j += 1
62
- include_next = true
63
- else
64
- raise NotImplementedError, "char '#{c}' found in '#{self}'."
65
- end
66
- end
67
- if j > 0
68
- for i in (0..j)
69
- # alternative word is not the last word and not at the beginning
70
- if (e[j]+1 < len) && (b[0] > 0)
71
- ret << str[0..b[0]-1] + str[b[i]..e[i]] + str[e[j]+1..len-1]
72
- elsif (e[j]+1 == len) && (b[0] > 0)
73
- ret << str[0..b[0]-1] + str[b[i]..e[i]]
74
- elsif (e[j]+1 < len) && (b[0] == 0)
75
- ret << str[b[i]..e[i]] + str[e[j]+1..len-1]
76
- else
77
- ret << str[b[i]..e[i]]
78
- end
79
- end
80
- end
81
- ret
82
- end
83
-
84
- # Test whether a String includes the +word+. It's useful while testing
85
- # a variable which might be an array of phrase or just a single phrase.
86
- def has?(word)
87
- self.include? word
88
- end
89
- end
90
-
91
- # Extend Array class.
92
- Array.class_eval do
93
- # Expand a phrase array into a flattened one. Example,
94
- # ['blow your nose', 'blow a kiss to/at sb'] #=>
95
- # ['blow your nose', 'blow a kiss to sb', 'blow a kiss at sb']
96
- def expand
97
- ret = self.map { |p|
98
- p.flatten if p.is_a? String
99
- }
100
- ret.flatten
101
- end
102
-
103
-
104
- # Test if a phrase array includes a +word+.
105
- # ['blow your nose', 'blow a kiss to/at sb'].has?("a kiss at") #=>true
106
- def has?(word)
107
- self.expand.each { |phr|
108
- return true if phr.include? word
109
- }
110
- false
111
- end
112
-
113
- # Iterate an array and return two elements +a+ +b+ each time for handling.
114
- def each_pair
115
- len = self.length
116
- i = 0
117
- while (i < len)
118
- a = self.at(i)
119
- b = self.at(i+1)
120
- yield(a, b)
121
- i += 2
122
- end
123
- end
124
- end
125
-
126
7
  private
8
+
127
9
  # Get the text selected by the css +selector+.
128
- def css_text(selector)
129
- node = @html.css(selector)
10
+ def css_text(html, selector)
11
+ node = html.css(selector)
130
12
  node.text unless node.empty?
131
13
  end
132
14
 
133
- # Get sth by the css +selector+ for the derived word inside its runon node
134
- def derived_css(selector)
135
- runon = @html.css(".runon")
136
- runon.each { |r|
15
+ # Get sth by the css +selector+ for the derived word inside its runon node
16
+ def derived_css(html, selector)
17
+ runon = html.css('.runon')
18
+ runon.each do |r|
137
19
  n = r.css('[title="Derived word"]')
138
- if n.text == @word
20
+ if n.text == @word
139
21
  node = r.css(selector)
140
22
  yield(node)
141
23
  end
142
- }
24
+ end
143
25
  end
144
26
 
27
+ using Camdict::StringExt
28
+
145
29
  # Get sth by the css +selector+ for the phrase inside the node phrase-block
146
- def phrase_css(selector)
147
- phbs = @html.css(".phrase-block")
148
- phbs.each { |phb|
30
+ def phrase_css(html, selector)
31
+ phbs = html.css('.phrase-block')
32
+ phbs.each do |phb|
149
33
  nodes = phb.css('.phrase, .v[title="Variant form"]')
150
- nodes.each { |n|
151
- if n.text.flatten.has? @word
152
- node = phb.css(selector)
153
- yield(node)
154
- break
155
- end
156
- }
157
- }
34
+ nodes.each do |n|
35
+ next unless n.text.flatten.has? @word
36
+ node = phb.css(selector)
37
+ yield(node)
38
+ break
39
+ end
40
+ end
158
41
  end
159
-
160
42
  end
161
43
  end