wiki-api 0.1.0 → 0.1.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +5 -13
- data/.rubocop.yml +24 -0
- data/.travis.yml +12 -0
- data/Gemfile +2 -0
- data/README.md +60 -62
- data/Rakefile +13 -1
- data/bin/console +8 -0
- data/lib/wiki/api/connect.rb +48 -38
- data/lib/wiki/api/page.rb +35 -42
- data/lib/wiki/api/page_block.rb +16 -17
- data/lib/wiki/api/page_headline.rb +51 -50
- data/lib/wiki/api/page_link.rb +13 -14
- data/lib/wiki/api/page_list_item.rb +10 -13
- data/lib/wiki/api/util.rb +18 -20
- data/lib/wiki/api/version.rb +3 -1
- data/lib/wiki/api.rb +9 -8
- data/test/test_helper.rb +4 -7
- data/test/unit/wiki_connect.rb +18 -25
- data/test/unit/wiki_page_offline.rb +144 -111
- data/wiki-api.gemspec +20 -17
- metadata +53 -34
checksums.yaml
CHANGED
@@ -1,15 +1,7 @@
|
|
1
1
|
---
|
2
|
-
|
3
|
-
metadata.gz:
|
4
|
-
|
5
|
-
data.tar.gz: !binary |-
|
6
|
-
YWE4Mzc4ZjRlYTBjNGE4MTkyYmE0OGFkOTJkMDViZTI0MjQ5MGFiMw==
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: cd978cd4dad89ddc8098d6abafcd6325ec6c0c4a4a5e5b8e93855bc118314b27
|
4
|
+
data.tar.gz: c5ead46deb2d10310823d4b639046058cf087a29cb6a0413a5e3addc64037b92
|
7
5
|
SHA512:
|
8
|
-
metadata.gz:
|
9
|
-
|
10
|
-
MmU1ZDk0ODZhN2U4ODYwNjY0ZjdmY2U5ZTFkMDk4ZDA2MzIyODUzNjE0YzVl
|
11
|
-
OGE2ZmFmOTYyOWY2MWIyNGNlNmU5NjYwOTNkMGNhNjllOWM0YzQ=
|
12
|
-
data.tar.gz: !binary |-
|
13
|
-
YjgzZGEzYzhhOWFmNzZhMjRlMWFiYmJiY2Q3N2EwOGQwZTBjY2Q0NzYxNWE2
|
14
|
-
ODc5NmMyNmYyODMyNmVmMjFmYzhhOTAzMTUzZTBmODU2OTMwY2RhYjg0Mjkz
|
15
|
-
Yjk3NjMzNGFlZGViYzQyOGQ5YzVjM2MzMjIyNWVlOWRhOTU0MDk=
|
6
|
+
metadata.gz: fcb6e3991c12a415a79b4c109091a41dbe45bff7ee3040a1a4283ddc2625522cfca767c65cba45e0f29bb13d410f082b78337de25d0bfd2bd9e0bd1591a36c24
|
7
|
+
data.tar.gz: 3a78fa474766c4cc10c44eb3e8a90ed95c1ddac1f306afa878da2ccf7b75e4fd179fc7933499f261c408cdd2f396d3613a6d74361bdad160cb3c13727aaa135c
|
data/.rubocop.yml
ADDED
@@ -0,0 +1,24 @@
|
|
1
|
+
AllCops:
|
2
|
+
SuggestExtensions: false
|
3
|
+
Style/ClassVars:
|
4
|
+
Enabled: false
|
5
|
+
Style/Documentation:
|
6
|
+
Enabled: false
|
7
|
+
Style/MethodCallWithArgsParentheses:
|
8
|
+
Enabled: true
|
9
|
+
Metrics/AbcSize:
|
10
|
+
Enabled: false
|
11
|
+
Metrics/ClassLength:
|
12
|
+
Enabled: false
|
13
|
+
Metrics/CyclomaticComplexity:
|
14
|
+
Enabled: false
|
15
|
+
Metrics/PerceivedComplexity:
|
16
|
+
Enabled: false
|
17
|
+
Metrics/MethodLength:
|
18
|
+
Enabled: false
|
19
|
+
Naming/MethodParameterName:
|
20
|
+
Enabled: false
|
21
|
+
Naming/PredicateName:
|
22
|
+
Enabled: false
|
23
|
+
Lint/RescueException:
|
24
|
+
Enabled: false
|
data/.travis.yml
ADDED
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -1,47 +1,20 @@
|
|
1
1
|
# Wiki::Api
|
2
2
|
|
3
|
-
|
3
|
+
[![Build Status](https://travis-ci.org/dblommesteijn/wiki-api.svg?branch=master)](https://travis-ci.org/dblommesteijn/wiki-api) [![Code Climate](https://codeclimate.com/github/dblommesteijn/wiki-api.png)](https://codeclimate.com/github/dblommesteijn/wiki-api)
|
4
4
|
|
5
|
-
|
5
|
+
Wiki API is a gem (Ruby on Rails) that interfaces with the MediaWiki API (https://www.mediawiki.org/wiki/API:Main_page). This gem is more than a interface, it has abstract classes for Page and Headline parsing. You're able to iterate through these headlines, and access data accordingly.
|
6
|
+
|
7
|
+
NOTE: This gem has a nokogiri (http://nokogiri.org/Nokogiri.html) backend (for HTML parsing). Major components: `Page`, `Headline`, `Block`, `ListItem`, and `Link` are wrappers for easy data access, however it's still possible to retreive the raw HTML within these objects.
|
6
8
|
|
7
9
|
Requests to the MediaWiki API use the following URI structure:
|
8
10
|
|
9
11
|
http(s)://somemediawiki.org/w/api.php?action=parse&format=json&page="anypage"
|
10
12
|
|
11
|
-
|
12
|
-
|
13
|
-
http://rdoc.info/github/dblommesteijn/wiki-api/frames/file/README.md
|
14
|
-
|
13
|
+
### Dependencies
|
15
14
|
|
16
|
-
### Dependencies (production)
|
17
|
-
|
18
|
-
* json
|
19
15
|
* nokogiri
|
20
16
|
|
21
17
|
|
22
|
-
### Feature Roadmap
|
23
|
-
|
24
|
-
* Version (0.1.0)
|
25
|
-
|
26
|
-
Major current release with several core changes.
|
27
|
-
|
28
|
-
* Version (0.1.1)
|
29
|
-
|
30
|
-
No features determined yet (please drop me a line if you're interested in additions).
|
31
|
-
|
32
|
-
|
33
|
-
### Changelog
|
34
|
-
|
35
|
-
* Version (0.0.2) -> (current)
|
36
|
-
|
37
|
-
PageLink URI without global config Exception resolved
|
38
|
-
|
39
|
-
Reverse (parent) object lookup
|
40
|
-
|
41
|
-
Nested PageHeadline objects
|
42
|
-
|
43
|
-
|
44
|
-
|
45
18
|
## Installation
|
46
19
|
|
47
20
|
Add this line to your application's Gemfile (bundler):
|
@@ -56,23 +29,29 @@ Or install it yourself (RubyGems):
|
|
56
29
|
|
57
30
|
$ gem install wiki-api
|
58
31
|
|
32
|
+
Or try it from this repository (local) in a console:
|
33
|
+
|
34
|
+
$ bin/console
|
35
|
+
|
59
36
|
|
60
37
|
## Setup
|
61
38
|
|
62
39
|
Define a configuration for your connection (initialize script), this example uses wiktionary.org.
|
63
|
-
NOTE: it can connect to both HTTP and HTTPS MediaWikis
|
64
|
-
|
65
|
-
```ruby
|
66
|
-
CONFIG = { uri: "http://en.wiktionary.org" }
|
67
|
-
```
|
40
|
+
NOTE: it can connect to both HTTP and HTTPS MediaWikis (however you'll get a 302 response from MediaWiki)
|
68
41
|
|
69
42
|
Setup default configuration (initialize script)
|
70
43
|
|
71
44
|
```ruby
|
72
|
-
Wiki::Api::Connect.config =
|
45
|
+
Wiki::Api::Connect.config = { uri: 'https://en.wiktionary.org' }
|
73
46
|
```
|
74
47
|
|
75
48
|
|
49
|
+
## Running tests
|
50
|
+
|
51
|
+
```bash
|
52
|
+
$ rake test
|
53
|
+
```
|
54
|
+
|
76
55
|
## Usage
|
77
56
|
|
78
57
|
### Query a Page and Headline
|
@@ -80,7 +59,7 @@ Wiki::Api::Connect.config = CONFIG
|
|
80
59
|
Requesting headlines from a given page.
|
81
60
|
|
82
61
|
```ruby
|
83
|
-
page = Wiki::Api::Page.new
|
62
|
+
page = Wiki::Api::Page.new(name: 'Wiktionary:Welcome,_newcomers')
|
84
63
|
# the root headline equals the pagename
|
85
64
|
puts page.root_headline.name
|
86
65
|
# iterate next level of headlines
|
@@ -93,9 +72,9 @@ end
|
|
93
72
|
Getting headlines for a given name.
|
94
73
|
|
95
74
|
```ruby
|
96
|
-
page = Wiki::Api::Page.new
|
75
|
+
page = Wiki::Api::Page.new(name: 'Wiktionary:Welcome,_newcomers')
|
97
76
|
# lookup headline by name (underscore and case are ignored)
|
98
|
-
headline = page.root_headline.headline(
|
77
|
+
headline = page.root_headline.headline('editing wiktionary').first
|
99
78
|
# printing headline name (PageHeadline)
|
100
79
|
puts headline.name
|
101
80
|
# get the type of nested headline (html h1,2,3,4 etc.)
|
@@ -105,7 +84,7 @@ puts headline.type
|
|
105
84
|
### Basic Page structure
|
106
85
|
|
107
86
|
```ruby
|
108
|
-
page = Wiki::Api::Page.new
|
87
|
+
page = Wiki::Api::Page.new(name: 'Wiktionary:Welcome,_newcomers')
|
109
88
|
# iterate PageHeadline objects
|
110
89
|
page.root_headline.headlines.each do |headline_name, headline|
|
111
90
|
# exposing nokogiri internal elements
|
@@ -114,6 +93,7 @@ page.root_headline.headlines.each do |headline_name, headline|
|
|
114
93
|
# print will result in: Nokogiri::XML::Text or Nokogiri::XML::Element
|
115
94
|
puts element.class
|
116
95
|
end
|
96
|
+
|
117
97
|
# string representation of all nested text
|
118
98
|
block.to_texts
|
119
99
|
# iterate PageListItem objects
|
@@ -137,7 +117,6 @@ page.root_headline.headlines.each do |headline_name, headline|
|
|
137
117
|
# string representation of nested text
|
138
118
|
link.to_text
|
139
119
|
end
|
140
|
-
|
141
120
|
end
|
142
121
|
```
|
143
122
|
|
@@ -148,21 +127,20 @@ This is a example of querying wikipedia.org on the page: "Ruby_on_rails", and pr
|
|
148
127
|
|
149
128
|
```ruby
|
150
129
|
# setting a target config
|
151
|
-
|
152
|
-
Wiki::Api::Connect.config = CONFIG
|
130
|
+
Wiki::Api::Connect.config = { uri: 'https://en.wikipedia.org' }
|
153
131
|
|
154
132
|
# querying the page
|
155
|
-
page = Wiki::Api::Page.new
|
133
|
+
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails')
|
156
134
|
|
157
135
|
# get headlines with name Reference (there can be multiple headlines with the same name!)
|
158
|
-
headlines = page.root_headline.headline
|
136
|
+
headlines = page.root_headline.headline('References')
|
159
137
|
|
160
138
|
# iterate headlines
|
161
139
|
headlines.each do |headline|
|
162
140
|
# iterate list items on the given headline
|
163
141
|
headline.block.list_items.each do |list_item|
|
164
142
|
# print the uri of all links
|
165
|
-
puts list_item.links.map
|
143
|
+
puts list_item.links.map(&:uri)
|
166
144
|
end
|
167
145
|
end
|
168
146
|
```
|
@@ -174,19 +152,17 @@ This is the same example as the one above, except for setting a global config to
|
|
174
152
|
|
175
153
|
```ruby
|
176
154
|
# querying the page
|
177
|
-
page = Wiki::Api::Page.new
|
155
|
+
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails', uri: 'https://en.wikipedia.org')
|
178
156
|
|
179
157
|
# get headlines with name Reference (there can be multiple headlines with the same name!)
|
180
|
-
headlines = page.root_headline.headline
|
158
|
+
headlines = page.root_headline.headline('References')
|
181
159
|
|
182
160
|
# iterate headlines
|
183
161
|
headlines.each do |headline|
|
184
162
|
# iterate list items on the given headline
|
185
163
|
headline.block.list_items.each do |list_item|
|
186
|
-
|
187
164
|
# print the uri of all links
|
188
|
-
puts list_item.links.map
|
189
|
-
|
165
|
+
puts list_item.links.map(&:uri)
|
190
166
|
end
|
191
167
|
end
|
192
168
|
```
|
@@ -199,25 +175,47 @@ This example shows how the headlines can be searched. For more info check: https
|
|
199
175
|
|
200
176
|
```ruby
|
201
177
|
# querying the page
|
202
|
-
page = Wiki::Api::Page.new
|
178
|
+
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails', uri: 'https://en.wikipedia.org')
|
203
179
|
|
204
180
|
# NOTE: the following are all valid headline names:
|
205
181
|
# request headline (by literal name)
|
206
|
-
headlines = page.root_headline.headline
|
207
|
-
puts headlines.map
|
182
|
+
headlines = page.root_headline.headline('Philosophy_and_design')
|
183
|
+
puts headlines.map(&:name)
|
208
184
|
# request headline (by downcase name)
|
209
|
-
headlines = page.root_headline.headline
|
210
|
-
puts headlines.map
|
185
|
+
headlines = page.root_headline.headline('philosophy_and_design')
|
186
|
+
puts headlines.map(&:name)
|
211
187
|
# request headline (by human name)
|
212
|
-
headlines = page.root_headline.headline
|
213
|
-
puts headlines.map
|
188
|
+
headlines = page.root_headline.headline('philosophy and design')
|
189
|
+
puts headlines.map(&:name)
|
214
190
|
|
215
191
|
# NOTE2: headlines are matched on headline.start_with?(requested_headline)
|
216
192
|
# because of start_with? compare this should work as well!
|
217
|
-
headlines = page.root_headline.headline
|
218
|
-
puts headlines.map
|
193
|
+
headlines = page.root_headline.headline('philosophy')
|
194
|
+
puts headlines.map(&:name)
|
219
195
|
```
|
220
196
|
|
221
197
|
|
198
|
+
### Example searching headlines in depth
|
222
199
|
|
200
|
+
Recursive search on all nested headlines, including in depth searches.
|
201
|
+
|
202
|
+
```ruby
|
203
|
+
# querying the page
|
204
|
+
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails', uri: 'https://en.wikipedia.org')
|
205
|
+
# get root
|
206
|
+
root_headline = page.root_headline
|
207
|
+
# lookup 'ramework structure' on current level
|
208
|
+
headline = root_headline.headline_in_depth('framework structure').first
|
209
|
+
puts headline.name
|
210
|
+
# NOTE: lookup of nested headlines does not work with the headline function (because 'Framework_structure' is nested within 'Technical_overview')
|
211
|
+
headline = root_headline.headline('framework structure').first
|
212
|
+
# depth can be limited adding the depth parameter
|
213
|
+
# NOTE: the example below will return nil, 'Framework_structure' is nested beyond depth = 0!
|
214
|
+
depth = 0
|
215
|
+
headline = root_headline.headline_in_depth('framework structure', depth).first
|
216
|
+
# increasing depth search will show the requested headline
|
217
|
+
depth = 5
|
218
|
+
headline = root_headline.headline_in_depth('framework structure', depth).first
|
219
|
+
puts headline.name
|
220
|
+
```
|
223
221
|
|
data/Rakefile
CHANGED
@@ -1 +1,13 @@
|
|
1
|
-
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require 'bundler/gem_tasks'
|
4
|
+
require 'rake/testtask'
|
5
|
+
|
6
|
+
Rake::TestTask.new do |t|
|
7
|
+
t.libs << 'test'
|
8
|
+
tfs = FileList['test/unit/*.rb']
|
9
|
+
t.test_files = tfs
|
10
|
+
t.verbose = true
|
11
|
+
end
|
12
|
+
|
13
|
+
task default: %i[build install]
|
data/bin/console
ADDED
data/lib/wiki/api/connect.rb
CHANGED
@@ -1,85 +1,95 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
1
3
|
require 'net/http'
|
2
4
|
require 'json'
|
3
5
|
require 'nokogiri'
|
4
6
|
|
5
7
|
module Wiki
|
6
8
|
module Api
|
7
|
-
|
8
9
|
class Connect
|
9
|
-
|
10
10
|
attr_accessor :uri, :api_path, :api_options, :http, :request, :response, :html, :parsed, :file
|
11
11
|
|
12
|
-
def initialize(options={})
|
13
|
-
@@config ||=
|
14
|
-
|
15
|
-
self.
|
16
|
-
self.
|
17
|
-
self.
|
18
|
-
self.api_options = options[:api_options] if options.include? :api_options
|
12
|
+
def initialize(options = {})
|
13
|
+
@@config ||= {}
|
14
|
+
self.uri = options[:uri] || @@config[:uri]
|
15
|
+
self.file = options[:file] || @@config[:file]
|
16
|
+
self.api_path = options[:api_path] || @@config[:api_path]
|
17
|
+
self.api_options = options[:api_options] || @@config[:api_options]
|
19
18
|
|
20
19
|
# defaults
|
21
|
-
self.api_path ||=
|
22
|
-
self.api_options ||= {action:
|
20
|
+
self.api_path ||= '/w/api.php'
|
21
|
+
self.api_options ||= { action: 'parse', format: 'json', page: '' }
|
23
22
|
|
24
23
|
# errors
|
25
|
-
raise
|
24
|
+
raise('no uri given') if uri.nil?
|
26
25
|
end
|
27
26
|
|
28
27
|
def connect
|
29
28
|
uri = URI("#{self.uri}#{self.api_path}")
|
30
|
-
uri.query = URI.encode_www_form
|
29
|
+
uri.query = URI.encode_www_form(self.api_options)
|
31
30
|
self.http = Net::HTTP.new(uri.host, uri.port)
|
32
|
-
if uri.scheme ==
|
33
|
-
|
34
|
-
#self.http.verify_mode = OpenSSL::SSL::VERIFY_NONE
|
31
|
+
if uri.scheme == 'https'
|
32
|
+
http.use_ssl = true
|
33
|
+
# self.http.verify_mode = OpenSSL::SSL::VERIFY_NONE
|
35
34
|
end
|
36
35
|
self.request = Net::HTTP::Get.new(uri.request_uri)
|
37
|
-
self.response =
|
36
|
+
self.response = http.request(request)
|
38
37
|
end
|
39
38
|
|
40
|
-
def page
|
39
|
+
def page(page_name)
|
41
40
|
self.api_options[:page] = page_name
|
42
41
|
# parse page by uri
|
43
|
-
if !
|
44
|
-
self.
|
45
|
-
response = self.response
|
46
|
-
json = JSON.parse response.body, {symbolize_names: true}
|
47
|
-
raise json[:error][:code] unless valid? json, response
|
48
|
-
self.html = json[:parse][:text]
|
49
|
-
self.parsed = Nokogiri::HTML self.html[:*]
|
42
|
+
if !uri.nil? && file.nil?
|
43
|
+
self.parsed = parse_from_uri(response)
|
50
44
|
# parse page by file
|
51
|
-
elsif !
|
52
|
-
|
53
|
-
# self.parsed = Nokogiri::HTML self.html[:*]
|
54
|
-
self.parsed = Nokogiri::HTML(f)
|
55
|
-
f.close
|
45
|
+
elsif !file.nil?
|
46
|
+
self.parsed = parse_from_file(file)
|
56
47
|
# invalid config, raise exception
|
57
48
|
else
|
58
|
-
raise
|
49
|
+
raise('no :uri or :file config found!')
|
59
50
|
end
|
60
|
-
|
51
|
+
parsed
|
52
|
+
end
|
53
|
+
|
54
|
+
def parse_from_uri(response)
|
55
|
+
connect
|
56
|
+
# rubocop:disable Lint/ShadowedArgument
|
57
|
+
response = self.response
|
58
|
+
# rubocop:enable Lint/ShadowedArgument
|
59
|
+
json = JSON.parse(response.body, { symbolize_names: true })
|
60
|
+
raise(json[:error][:code]) unless valid?(json, response)
|
61
|
+
|
62
|
+
self.html = json[:parse][:text]
|
63
|
+
self.parsed = Nokogiri::HTML(html[:*])
|
64
|
+
end
|
65
|
+
|
66
|
+
def parse_from_file(file)
|
67
|
+
f = File.open(file)
|
68
|
+
ret = Nokogiri::HTML(f)
|
69
|
+
f.close
|
70
|
+
ret
|
61
71
|
end
|
62
72
|
|
63
73
|
class << self
|
64
74
|
def config=(config = {})
|
65
75
|
@@config = config
|
66
76
|
end
|
77
|
+
|
67
78
|
def config
|
68
79
|
@@config ||= []
|
69
80
|
end
|
70
81
|
end
|
71
82
|
|
72
83
|
protected
|
73
|
-
|
84
|
+
|
85
|
+
def valid?(json, response)
|
74
86
|
b = []
|
75
87
|
# valid http response
|
76
|
-
b << (response.is_a?
|
88
|
+
b << (response.is_a?(Net::HTTPOK))
|
77
89
|
# not an invalid api response handle
|
78
|
-
b << (!json.include?
|
90
|
+
b << (!json.include?(:error))
|
79
91
|
!b.include?(false)
|
80
92
|
end
|
81
|
-
|
82
93
|
end
|
83
|
-
|
84
94
|
end
|
85
|
-
end
|
95
|
+
end
|
data/lib/wiki/api/page.rb
CHANGED
@@ -1,25 +1,22 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
1
3
|
module Wiki
|
2
4
|
module Api
|
3
|
-
|
4
5
|
# MediaWiki Page, collection of all html information plus it's page title
|
5
6
|
class Page
|
6
|
-
|
7
7
|
attr_accessor :name, :parsed_page, :uri, :parent
|
8
8
|
|
9
|
-
def initialize(options={})
|
10
|
-
self.name = options[:name] if options.include?
|
11
|
-
self.uri = options[:uri] if options.include?
|
12
|
-
@connect = Wiki::Api::Connect.new
|
13
|
-
end
|
14
|
-
|
15
|
-
def connect
|
16
|
-
@connect
|
9
|
+
def initialize(options = {})
|
10
|
+
self.name = options[:name] if options.include?(:name)
|
11
|
+
self.uri = options[:uri] if options.include?(:uri)
|
12
|
+
@connect = Wiki::Api::Connect.new(uri:)
|
17
13
|
end
|
18
14
|
|
15
|
+
attr_reader :connect
|
19
16
|
|
20
17
|
# collect all headlines, keep original page formatting
|
21
18
|
def root_headline
|
22
|
-
|
19
|
+
parse_blocks
|
23
20
|
end
|
24
21
|
|
25
22
|
# # collect headlines by given name, this will flatten the nested headlines
|
@@ -30,10 +27,9 @@ module Wiki
|
|
30
27
|
# self.parse_blocks(headline_name)
|
31
28
|
# end
|
32
29
|
|
33
|
-
|
34
30
|
def to_html
|
35
|
-
|
36
|
-
|
31
|
+
load_page!
|
32
|
+
parsed_page.to_xhtml(indent: 3, indent_text: ' ')
|
37
33
|
end
|
38
34
|
|
39
35
|
def reset!
|
@@ -41,69 +37,66 @@ module Wiki
|
|
41
37
|
end
|
42
38
|
|
43
39
|
def load_page!
|
44
|
-
self.parsed_page ||= @connect.page
|
40
|
+
self.parsed_page ||= @connect.page(name)
|
45
41
|
end
|
46
42
|
|
47
|
-
|
48
43
|
# parse blocks
|
49
|
-
def parse_blocks
|
50
|
-
|
44
|
+
def parse_blocks(headline_name = nil)
|
45
|
+
load_page!
|
51
46
|
result = {}
|
52
47
|
|
53
48
|
# get headline nodes by span class
|
54
|
-
|
49
|
+
headlines = self.parsed_page.xpath("//span[@class='mw-headline']")
|
55
50
|
|
56
51
|
# filter single headline by name (ignore case)
|
57
|
-
|
52
|
+
headlines = filter_headline(headlines, headline_name) unless headline_name.nil?
|
58
53
|
|
59
54
|
# NOTE: first_part has no id attribute and thus cannot be filtered or processed within xpath (xs)
|
60
|
-
if headline_name.nil? || headline_name.start_with?(
|
61
|
-
x =
|
62
|
-
result[
|
63
|
-
result[
|
55
|
+
if headline_name.nil? || headline_name.start_with?(name.downcase)
|
56
|
+
x = first_part
|
57
|
+
result[name] ||= []
|
58
|
+
result[name] << (collect_elements(x.parent))
|
64
59
|
end
|
65
60
|
|
66
61
|
# append all blocks
|
67
|
-
|
68
|
-
|
69
|
-
elements =
|
70
|
-
result[
|
71
|
-
result[
|
62
|
+
headlines.each do |headline|
|
63
|
+
headline_value = headline.attributes['id'].value
|
64
|
+
elements = collect_elements(headline.parent.next)
|
65
|
+
result[headline_value] ||= []
|
66
|
+
result[headline_value] << elements
|
72
67
|
end
|
73
68
|
|
74
69
|
# create root object
|
75
|
-
PageHeadline.new
|
70
|
+
PageHeadline.new(parent: self, name: result.first[0], headlines: result, level: 0)
|
76
71
|
end
|
77
72
|
|
78
73
|
# harvest first part of the page (missing heading and class="mw-headline")
|
79
74
|
def first_part
|
80
|
-
self.parsed_page ||= @connect.page
|
81
|
-
self.parsed_page.search(
|
75
|
+
self.parsed_page ||= @connect.page(name)
|
76
|
+
self.parsed_page.search('p').first.children.first
|
82
77
|
end
|
83
78
|
|
84
79
|
# collect elements within headlines (not nested properties, but next elements)
|
85
|
-
def collect_elements
|
80
|
+
def collect_elements(element)
|
86
81
|
# capture first element name
|
87
82
|
elements = []
|
88
83
|
# iterate text until next headline
|
89
|
-
|
84
|
+
loop do
|
90
85
|
elements << element
|
91
86
|
element = element.next
|
92
|
-
break if element.nil? || element.to_html.include?(
|
87
|
+
break if element.nil? || element.to_html.include?('class="mw-headline"')
|
93
88
|
end
|
94
89
|
elements
|
95
90
|
end
|
96
91
|
|
97
|
-
def filter_headline
|
92
|
+
def filter_headline(xs, headline_name)
|
98
93
|
# transform name to a wiki_id (downcase and space replace with underscore)
|
99
|
-
headline_name = headline_name.downcase.gsub(
|
94
|
+
headline_name = headline_name.downcase.gsub(' ', '_')
|
100
95
|
# reject not matching id's
|
101
|
-
xs.
|
102
|
-
|
96
|
+
xs.select do |t|
|
97
|
+
t.attributes['id'].value.downcase.start_with?(headline_name)
|
103
98
|
end
|
104
99
|
end
|
105
|
-
|
106
100
|
end
|
107
|
-
|
108
101
|
end
|
109
|
-
end
|
102
|
+
end
|