wiki-api 0.0.2 → 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,15 +1,15 @@
1
1
  ---
2
2
  SHA1:
3
3
  metadata.gz: !binary |-
4
- NjgxOGUxZjQ2MWQ2MjNhMDA2ZGUwMTRhOGI4MWFlOGQ3MzI4MWFjOA==
4
+ NjQ3MjZkMDdmNTg2YjdhZDRmM2E3MjU4ZjA1Y2IwOGYzODEwZTFkMA==
5
5
  data.tar.gz: !binary |-
6
- ZmZkNDFhMzc0ZTNmZDBlYTFmMTIwMmU5ZDgzYTQ2YjM0ZTk1ZmQzYg==
6
+ YWE4Mzc4ZjRlYTBjNGE4MTkyYmE0OGFkOTJkMDViZTI0MjQ5MGFiMw==
7
7
  SHA512:
8
8
  metadata.gz: !binary |-
9
- NGM4YTU2MjQ3Njk1MzJkMDhlYjcxODYxNDFkNzRlODI5MjMwNmU5ZGEzZmJj
10
- MjhjZjYxYzcxMmYzYjA0YzA3NzdlYTJhMjM0ZTllNzgyMDk0MGJiNjBiZWRl
11
- N2Y5YzMwZWZjZmY3NWQ0YmJiMjdiOTkwOTU1ZmE4MDg5Njk4M2Y=
9
+ OTNhMTZkNjMwNzJiMzU5YWE0ZDZiNzRlZWU5ZDJjM2Q1NTA5ZWRiN2IzY2Mw
10
+ MmU1ZDk0ODZhN2U4ODYwNjY0ZjdmY2U5ZTFkMDk4ZDA2MzIyODUzNjE0YzVl
11
+ OGE2ZmFmOTYyOWY2MWIyNGNlNmU5NjYwOTNkMGNhNjllOWM0YzQ=
12
12
  data.tar.gz: !binary |-
13
- MGZlMTYzZTgzZWE3YmYzZmIyMjc0OTZhMGY0NDEwYzJmNmFiMTZkNDM3OGM2
14
- Mjc1MDdjMzQ3MjM1NmVlODM3Mzg5ZTViMGRmOGI2NzE1NDZjODJhZTA2MjI5
15
- NWE3YmI4MDYxY2I4NGM3MGUwNzAzNjQ3YjMwODU5NDBlMWYxZDM=
13
+ YjgzZGEzYzhhOWFmNzZhMjRlMWFiYmJiY2Q3N2EwOGQwZTBjY2Q0NzYxNWE2
14
+ ODc5NmMyNmYyODMyNmVmMjFmYzhhOTAzMTUzZTBmODU2OTMwY2RhYjg0Mjkz
15
+ Yjk3NjMzNGFlZGViYzQyOGQ5YzVjM2MzMjIyNWVlOWRhOTU0MDk=
data/README.md CHANGED
@@ -1,13 +1,17 @@
1
1
  # Wiki::Api
2
2
 
3
- Wiki API is a gem (Ruby on Rails) that interfaces with the MediaWiki API (https://www.mediawiki.org/wiki/API:Main_page). This gem is more than a interface, it has abstract classes like: Page on which you can request page parameters (like headlines, and text blocks within headlines).
3
+ Wiki API is a gem (Ruby on Rails) that interfaces with the MediaWiki API (https://www.mediawiki.org/wiki/API:Main_page). This gem is more than a interface, it has abstract classes for Page and Headline parsing. You're able to iterate through these headlines, and access data accordingly.
4
4
 
5
- NOTE: nokogiri is used for background parsing of HTML. Because I believe there is no point of wrapping internals (composing) for this purpose, nokogiri nodes elements etc. are exposed (http://nokogiri.org/Nokogiri.html) through the wiki-api.
5
+ NOTE: This gem has a nokogiri (http://nokogiri.org/Nokogiri.html) backend (for HTML parsing). Major components: Page, Headline, Block, ListItem, and Link are wrappers for easy data access, however it's still possible to retreive the raw HTML within these objects.
6
6
 
7
7
  Requests to the MediaWiki API use the following URI structure:
8
8
 
9
9
  http(s)://somemediawiki.org/w/api.php?action=parse&format=json&page="anypage"
10
10
 
11
+ # RDoc (rdoc.info)
12
+
13
+ http://rdoc.info/github/dblommesteijn/wiki-api/frames/file/README.md
14
+
11
15
 
12
16
  ### Dependencies (production)
13
17
 
@@ -15,27 +19,27 @@ Requests to the MediaWiki API use the following URI structure:
15
19
  * nokogiri
16
20
 
17
21
 
18
- ### Roadmap
22
+ ### Feature Roadmap
19
23
 
20
- * Version (0.0.2) (current)
24
+ * Version (0.1.0)
21
25
 
22
- Index important words per block, page, list item;
26
+ Major current release with several core changes.
23
27
 
24
- Parse objects for more elements within a Page.
28
+ * Version (0.1.1)
29
+
30
+ No features determined yet (please drop me a line if you're interested in additions).
25
31
 
26
32
 
27
33
  ### Changelog
28
34
 
29
- * Version (0.0.1) -> (0.0.2)
30
-
31
- Nested ListItems, Links (within Page)
35
+ * Version (0.0.2) -> (current)
32
36
 
33
- Search on Page headline (ignore case, and underscore)
37
+ PageLink URI without global config Exception resolved
34
38
 
39
+ Reverse (parent) object lookup
35
40
 
36
- ### Known Issues
41
+ Nested PageHeadline objects
37
42
 
38
- None discovered thus far.
39
43
 
40
44
 
41
45
  ## Installation
@@ -71,13 +75,16 @@ Wiki::Api::Connect.config = CONFIG
71
75
 
72
76
  ## Usage
73
77
 
74
- ### Query a Page
78
+ ### Query a Page and Headline
75
79
 
76
80
  Requesting headlines from a given page.
77
81
 
78
82
  ```ruby
79
83
  page = Wiki::Api::Page.new name: "Wiktionary:Welcome,_newcomers"
80
- page.headlines.each do |headline|
84
+ # the root headline equals the pagename
85
+ puts page.root_headline.name
86
+ # iterate next level of headlines
87
+ page.root_headline.headlines.each do |headline_name, headline|
81
88
  # printing headline name (PageHeadline)
82
89
  puts headline.name
83
90
  end
@@ -87,29 +94,28 @@ Getting headlines for a given name.
87
94
 
88
95
  ```ruby
89
96
  page = Wiki::Api::Page.new name: "Wiktionary:Welcome,_newcomers"
90
- page.headline("Wiktionary:Welcome,_newcomers").each do |headline|
91
- # printing headline name (PageHeadline)
92
- puts headline.name
93
- end
97
+ # lookup headline by name (underscore and case are ignored)
98
+ headline = page.root_headline.headline("editing wiktionary").first
99
+ # printing headline name (PageHeadline)
100
+ puts headline.name
101
+ # get the type of nested headline (html h1,2,3,4 etc.)
102
+ puts headline.type
94
103
  ```
95
104
 
96
105
  ### Basic Page structure
97
106
 
98
107
  ```ruby
99
108
  page = Wiki::Api::Page.new name: "Wiktionary:Welcome,_newcomers"
100
-
101
109
  # iterate PageHeadline objects
102
- page.headlines.each do |headline|
103
-
110
+ page.root_headline.headlines.each do |headline_name, headline|
104
111
  # exposing nokogiri internal elements
105
112
  elements = headline.elements.flatten
106
113
  elements.each do |element|
107
- # access Nokogiri::XML::*
114
+ # print will result in: Nokogiri::XML::Text or Nokogiri::XML::Element
115
+ puts element.class
108
116
  end
109
-
110
117
  # string representation of all nested text
111
118
  block.to_texts
112
-
113
119
  # iterate PageListItem objects
114
120
  block.list_items.each do |list_item|
115
121
  # string representation of nested text
@@ -136,7 +142,7 @@ end
136
142
  ```
137
143
 
138
144
 
139
- ### Example using Global config (https://en.wikipedia.org/wiki/Ruby_on_rails)
145
+ ### Example using Global config (https://en.wikipedia.org/wiki/Ruby_on_Rails)
140
146
 
141
147
  This is a example of querying wikipedia.org on the page: "Ruby_on_rails", and printing the References headline links for each list item.
142
148
 
@@ -146,35 +152,32 @@ CONFIG = { uri: "https://en.wikipedia.org" }
146
152
  Wiki::Api::Connect.config = CONFIG
147
153
 
148
154
  # querying the page
149
- page = Wiki::Api::Page.new name: "Ruby_on_rails"
155
+ page = Wiki::Api::Page.new name: "Ruby_on_Rails"
150
156
 
151
157
  # get headlines with name Reference (there can be multiple headlines with the same name!)
152
- headlines = page.headline "References"
158
+ headlines = page.root_headline.headline "References"
153
159
 
154
160
  # iterate headlines
155
161
  headlines.each do |headline|
156
162
  # iterate list items on the given headline
157
163
  headline.block.list_items.each do |list_item|
158
-
159
164
  # print the uri of all links
160
165
  puts list_item.links.map{ |l| l.uri }
161
-
162
166
  end
163
167
  end
164
168
  ```
165
169
 
166
170
 
167
-
168
- ### Example passing URI (https://en.wikipedia.org/wiki/Ruby_on_rails)
171
+ ### Example passing URI (https://en.wikipedia.org/wiki/Ruby_on_Rails)
169
172
 
170
173
  This is the same example as the one above, except for setting a global config to direct the requests to a given URI.
171
174
 
172
175
  ```ruby
173
176
  # querying the page
174
- page = Wiki::Api::Page.new name: "Ruby_on_rails", uri: "https://en.wikipedia.org"
177
+ page = Wiki::Api::Page.new name: "Ruby_on_Rails", uri: "https://en.wikipedia.org"
175
178
 
176
179
  # get headlines with name Reference (there can be multiple headlines with the same name!)
177
- headlines = page.headline "References"
180
+ headlines = page.root_headline.headline "References"
178
181
 
179
182
  # iterate headlines
180
183
  headlines.each do |headline|
@@ -189,4 +192,32 @@ end
189
192
  ```
190
193
 
191
194
 
195
+ ### Example searching headlines
196
+
197
+ This example shows how the headlines can be searched. For more info check: https://github.com/dblommesteijn/wiki-api/blob/master/lib/wiki/api/page.rb#L97
198
+
199
+
200
+ ```ruby
201
+ # querying the page
202
+ page = Wiki::Api::Page.new name: "Ruby_on_Rails", uri: "https://en.wikipedia.org"
203
+
204
+ # NOTE: the following are all valid headline names:
205
+ # request headline (by literal name)
206
+ headlines = page.root_headline.headline "Philosophy_and_design"
207
+ puts headlines.map{|h| h.name}
208
+ # request headline (by downcase name)
209
+ headlines = page.root_headline.headline "philosophy_and_design"
210
+ puts headlines.map{|h| h.name}
211
+ # request headline (by human name)
212
+ headlines = page.root_headline.headline "philosophy and design"
213
+ puts headlines.map{|h| h.name}
214
+
215
+ # NOTE2: headlines are matched on headline.start_with?(requested_headline)
216
+ # because of start_with? compare this should work as well!
217
+ headlines = page.root_headline.headline "philosophy"
218
+ puts headlines.map{|h| h.name}
219
+ ```
220
+
221
+
222
+
192
223
 
@@ -7,12 +7,13 @@ module Wiki
7
7
 
8
8
  class Connect
9
9
 
10
- attr_accessor :uri, :api_path, :api_options, :http, :request, :response, :html, :parsed
10
+ attr_accessor :uri, :api_path, :api_options, :http, :request, :response, :html, :parsed, :file
11
11
 
12
12
  def initialize(options={})
13
13
  @@config ||= nil
14
14
  options.merge! @@config unless @@config.nil?
15
15
  self.uri = options[:uri] if options.include? :uri
16
+ self.file = options[:file] if options.include? :file
16
17
  self.api_path = options[:api_path] if options.include? :api_path
17
18
  self.api_options = options[:api_options] if options.include? :api_options
18
19
 
@@ -38,12 +39,25 @@ module Wiki
38
39
 
39
40
  def page page_name
40
41
  self.api_options[:page] = page_name
41
- self.connect
42
- response = self.response
43
- json = JSON.parse response.body, {symbolize_names: true}
44
- raise json[:error][:code] unless valid? json, response
45
- self.html = json[:parse][:text]
46
- self.parsed = Nokogiri::HTML self.html[:*]
42
+ # parse page by uri
43
+ if !self.uri.nil? && self.file.nil?
44
+ self.connect
45
+ response = self.response
46
+ json = JSON.parse response.body, {symbolize_names: true}
47
+ raise json[:error][:code] unless valid? json, response
48
+ self.html = json[:parse][:text]
49
+ self.parsed = Nokogiri::HTML self.html[:*]
50
+ # parse page by file
51
+ elsif !self.file.nil?
52
+ f = File.open(self.file)
53
+ # self.parsed = Nokogiri::HTML self.html[:*]
54
+ self.parsed = Nokogiri::HTML(f)
55
+ f.close
56
+ # invalid config, raise exception
57
+ else
58
+ raise "no :uri or :file config found!"
59
+ end
60
+ self.parsed
47
61
  end
48
62
 
49
63
  class << self
@@ -1,49 +1,34 @@
1
1
  module Wiki
2
2
  module Api
3
3
 
4
+ # MediaWiki Page, collection of all html information plus it's page title
4
5
  class Page
5
6
 
6
- attr_accessor :name, :parsed_page, :uri
7
+ attr_accessor :name, :parsed_page, :uri, :parent
7
8
 
8
9
  def initialize(options={})
9
10
  self.name = options[:name] if options.include? :name
10
- uri = options[:uri] if options.include? :uri
11
-
12
- @@config ||= nil
13
- if @@config.nil? || !uri.nil?
14
- # use the connection to collect HTML pages for parsing
15
- @connect = Wiki::Api::Connect.new uri: uri
16
- else
17
- # using a local HTML file for parsing
18
- end
11
+ self.uri = options[:uri] if options.include? :uri
12
+ @connect = Wiki::Api::Connect.new uri: uri
19
13
  end
20
14
 
21
- def headlines
22
- headlines = []
23
- self.parse_blocks.each do |headline_name, elements|
24
- headline = PageHeadline.new name: headline_name
25
- elements.each do |element|
26
- # nokogiri element
27
- headline.block << element
28
- end
29
- headlines << headline
30
- end
31
- headlines
15
+ def connect
16
+ @connect
32
17
  end
33
18
 
34
- def headline headline_name
35
- headlines = []
36
- self.parse_blocks(headline_name).each do |headline_name, elements|
37
- headline = PageHeadline.new name: headline_name
38
- elements.each do |element|
39
- # nokogiri element
40
- headline.block << element
41
- end
42
- headlines << headline
43
- end
44
- headlines
19
+
20
+ # collect all headlines, keep original page formatting
21
+ def root_headline
22
+ self.parse_blocks
45
23
  end
46
24
 
25
+ # # collect headlines by given name, this will flatten the nested headlines
26
+ # def flat_headlines_by_name headline_name
27
+ # raise "not yet implemented!"
28
+ # # TODO: implement flattening of headlines within the root headline
29
+ # # ALT: breath search option in the root of the first headline
30
+ # self.parse_blocks(headline_name)
31
+ # end
47
32
 
48
33
 
49
34
  def to_html
@@ -55,22 +40,8 @@ module Wiki
55
40
  self.parse_page = nil
56
41
  end
57
42
 
58
- class << self
59
- def config=(config = {})
60
- @@config = config
61
- end
62
- end
63
-
64
- protected
65
-
66
43
  def load_page!
67
- if @@config.nil?
68
- self.parsed_page ||= @connect.page self.name
69
- elsif self.parsed_page.nil?
70
- f = File.open(@@config[:file])
71
- self.parsed_page = Nokogiri::HTML(f)
72
- f.close
73
- end
44
+ self.parsed_page ||= @connect.page self.name
74
45
  end
75
46
 
76
47
 
@@ -81,11 +52,12 @@ module Wiki
81
52
 
82
53
  # get headline nodes by span class
83
54
  xs = self.parsed_page.xpath("//span[@class='mw-headline']")
55
+
84
56
  # filter single headline by name (ignore case)
85
57
  xs = self.filter_headline xs, headline_name unless headline_name.nil?
86
58
 
87
59
  # NOTE: first_part has no id attribute and thus cannot be filtered or processed within xpath (xs)
88
- if headline_name == self.name || headline_name.nil?
60
+ if headline_name.nil? || headline_name.start_with?(self.name.downcase)
89
61
  x = self.first_part
90
62
  result[self.name] ||= []
91
63
  result[self.name] << (self.collect_elements(x.parent))
@@ -95,11 +67,12 @@ module Wiki
95
67
  xs.each do |x|
96
68
  headline = x.attributes["id"].value
97
69
  elements = self.collect_elements x.parent.next
98
- result[headline] ||= []
70
+ result[headline] ||= []
99
71
  result[headline] << elements
100
72
  end
101
73
 
102
- result
74
+ # create root object
75
+ PageHeadline.new parent: self, name: result.first[0], headlines: result, level: 0
103
76
  end
104
77
 
105
78
  # harvest first part of the page (missing heading and class="mw-headline")
@@ -1,20 +1,22 @@
1
1
  module Wiki
2
2
  module Api
3
3
 
4
+ # Collection of elements for segmented per headline
4
5
  class PageBlock
5
6
 
6
- attr_accessor :elements
7
+ attr_accessor :elements, :parent
7
8
 
8
9
  def initialize options={}
10
+ self.parent = options[:parent] if options.include? :parent
9
11
  self.elements = []
10
12
  end
11
13
 
12
14
  def << value
15
+ # value.first.previous.name
13
16
  self.elements << value
14
17
  end
15
18
 
16
19
  def to_texts
17
- # TODO: perhaps we should wrap the elements with objects??
18
20
  texts = []
19
21
  self.elements.flatten.each do |element|
20
22
  text = Wiki::Api::Util.element_to_text element if element.is_a? Nokogiri::XML::Element
@@ -28,14 +30,14 @@ module Wiki
28
30
  def list_items
29
31
  # TODO: perhaps we should wrap the elements with objects, and request a li per element??
30
32
  self.search("li").map do |list_item|
31
- PageListItem.new element: list_item
33
+ PageListItem.new parent: self, element: list_item
32
34
  end
33
35
  end
34
36
 
35
37
  def links
36
38
  # TODO: perhaps we should wrap the elements with objects, and request a li per element??
37
39
  self.search("a").map do |a|
38
- PageLink.new element: a
40
+ PageLink.new parent: self, element: a
39
41
  end
40
42
  end
41
43
 
@@ -1,20 +1,115 @@
1
1
  module Wiki
2
2
  module Api
3
3
 
4
+ # Headline for a page (class="mw-healine")
4
5
  class PageHeadline
5
6
 
6
- attr_accessor :name, :block
7
+ require 'json'
8
+
9
+ LEVEL = ["text", "h1", "h2", "h3", "h4", "h5", "h6"]
10
+
11
+ attr_accessor :name, :block, :parent, :headlines, :level
7
12
 
8
13
  def initialize options={}
9
14
  self.name = options[:name] if options.include? :name
10
- self.block = PageBlock.new
15
+ self.parent = options[:parent] if options.include? :parent
16
+ self.level = options[:level] if options.include? :level
17
+ options[:headlines] ||= []
18
+ self.headlines ||= {}
19
+
20
+ # store elements in a block
21
+ self.block = PageBlock.new parent: self
22
+ if options[:headlines].include? self.name
23
+ options[:headlines][self.name].each do |element|
24
+ self.block << element
25
+ end
26
+ end
27
+
28
+ # collect nested headlines
29
+ headlines = options[:headlines]
30
+ # remove self from list
31
+ headlines.delete self.name
32
+ nested_headlines = self.nested_headlines headlines, self.name, self.level
33
+
34
+ # iterate nested headlines, and call recursive
35
+ nested_headlines.each do |headline_name, value|
36
+ level = LEVEL.index value.first.first.previous.name
37
+ self.headlines[headline_name] = (PageHeadline.new parent: self, name: headline_name, headlines: headlines, level: level)
38
+ end
11
39
  end
12
40
 
13
41
  def elements
14
42
  self.block.elements
15
43
  end
16
44
 
45
+ def type
46
+ self.block.elements.first.first.previous.name
47
+ end
48
+
49
+ # get headline by name
50
+ def headline name
51
+ name = name.downcase.gsub(" ", "_")
52
+ self.headlines.reject do |k,v|
53
+ !k.downcase.start_with?(name)
54
+ end.values()
55
+ end
56
+
57
+ # recursive headline search
58
+ # def headline_by_name name, depth = 1
59
+ # name = name.downcase.gsub(" ", "_")
60
+ # ret = []
61
+ # self.headlines.each do |k,v|
62
+ # ret << v if k.downcase.start_with?(name)
63
+ # next if v.headlines.empty?
64
+ # if depth > 0
65
+ # q = v.headline_by_name name, (depth - 1)
66
+ # ret.concat q
67
+ # end
68
+ # end
69
+ # ret
70
+ # end
17
71
 
72
+ # headline exists for current headline
73
+ def has_headline? name
74
+ name = name.downcase.gsub(" ", "_")
75
+ self.headlines.each do |k,v|
76
+ return true if k.downcase.start_with?(name)
77
+ end
78
+ false
79
+ end
80
+
81
+ def to_hash
82
+ ret = {name: self.name, headlines: [], type: self.type}
83
+ self.headlines.each do |headline_name, headline|
84
+ ret[:headlines] << headline.to_hash
85
+ end
86
+ ret
87
+ end
88
+
89
+ def to_pretty_json
90
+ JSON.pretty_generate self.to_hash
91
+ end
92
+
93
+ protected
94
+
95
+ # filter nested headlines (elements) from a parent headline (by name)
96
+ def nested_headlines headlines, name, original_level
97
+ ret = {}
98
+ init_level = nil
99
+ # iterate headlines, skip already done onces
100
+ #headlines.drop(headline_index + 1).each do |headline|
101
+ headlines.to_a.each do |name, value|
102
+ level = LEVEL.index value.first.first.previous.name
103
+ init_level ||= level
104
+ # lower level indicate nest end
105
+ break if level <= original_level
106
+ break if level < init_level
107
+ # higher level indicates nested items, these will be processed recursive
108
+ next if init_level != level
109
+ ret[name] = value
110
+ end
111
+ ret
112
+ end
18
113
 
19
114
  end
20
115