wiki-api 0.1.0 → 0.1.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +5 -13
- data/.rubocop.yml +24 -0
- data/.travis.yml +12 -0
- data/Gemfile +2 -0
- data/README.md +60 -62
- data/Rakefile +13 -1
- data/bin/console +8 -0
- data/lib/wiki/api/connect.rb +48 -38
- data/lib/wiki/api/page.rb +35 -42
- data/lib/wiki/api/page_block.rb +16 -17
- data/lib/wiki/api/page_headline.rb +51 -50
- data/lib/wiki/api/page_link.rb +13 -14
- data/lib/wiki/api/page_list_item.rb +10 -13
- data/lib/wiki/api/util.rb +18 -20
- data/lib/wiki/api/version.rb +3 -1
- data/lib/wiki/api.rb +9 -8
- data/test/test_helper.rb +4 -7
- data/test/unit/wiki_connect.rb +18 -25
- data/test/unit/wiki_page_offline.rb +144 -111
- data/wiki-api.gemspec +20 -17
- metadata +53 -34
checksums.yaml
CHANGED
@@ -1,15 +1,7 @@
|
|
1
1
|
---
|
2
|
-
|
3
|
-
metadata.gz:
|
4
|
-
|
5
|
-
data.tar.gz: !binary |-
|
6
|
-
YWE4Mzc4ZjRlYTBjNGE4MTkyYmE0OGFkOTJkMDViZTI0MjQ5MGFiMw==
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: cd978cd4dad89ddc8098d6abafcd6325ec6c0c4a4a5e5b8e93855bc118314b27
|
4
|
+
data.tar.gz: c5ead46deb2d10310823d4b639046058cf087a29cb6a0413a5e3addc64037b92
|
7
5
|
SHA512:
|
8
|
-
metadata.gz:
|
9
|
-
|
10
|
-
MmU1ZDk0ODZhN2U4ODYwNjY0ZjdmY2U5ZTFkMDk4ZDA2MzIyODUzNjE0YzVl
|
11
|
-
OGE2ZmFmOTYyOWY2MWIyNGNlNmU5NjYwOTNkMGNhNjllOWM0YzQ=
|
12
|
-
data.tar.gz: !binary |-
|
13
|
-
YjgzZGEzYzhhOWFmNzZhMjRlMWFiYmJiY2Q3N2EwOGQwZTBjY2Q0NzYxNWE2
|
14
|
-
ODc5NmMyNmYyODMyNmVmMjFmYzhhOTAzMTUzZTBmODU2OTMwY2RhYjg0Mjkz
|
15
|
-
Yjk3NjMzNGFlZGViYzQyOGQ5YzVjM2MzMjIyNWVlOWRhOTU0MDk=
|
6
|
+
metadata.gz: fcb6e3991c12a415a79b4c109091a41dbe45bff7ee3040a1a4283ddc2625522cfca767c65cba45e0f29bb13d410f082b78337de25d0bfd2bd9e0bd1591a36c24
|
7
|
+
data.tar.gz: 3a78fa474766c4cc10c44eb3e8a90ed95c1ddac1f306afa878da2ccf7b75e4fd179fc7933499f261c408cdd2f396d3613a6d74361bdad160cb3c13727aaa135c
|
data/.rubocop.yml
ADDED
@@ -0,0 +1,24 @@
|
|
1
|
+
AllCops:
|
2
|
+
SuggestExtensions: false
|
3
|
+
Style/ClassVars:
|
4
|
+
Enabled: false
|
5
|
+
Style/Documentation:
|
6
|
+
Enabled: false
|
7
|
+
Style/MethodCallWithArgsParentheses:
|
8
|
+
Enabled: true
|
9
|
+
Metrics/AbcSize:
|
10
|
+
Enabled: false
|
11
|
+
Metrics/ClassLength:
|
12
|
+
Enabled: false
|
13
|
+
Metrics/CyclomaticComplexity:
|
14
|
+
Enabled: false
|
15
|
+
Metrics/PerceivedComplexity:
|
16
|
+
Enabled: false
|
17
|
+
Metrics/MethodLength:
|
18
|
+
Enabled: false
|
19
|
+
Naming/MethodParameterName:
|
20
|
+
Enabled: false
|
21
|
+
Naming/PredicateName:
|
22
|
+
Enabled: false
|
23
|
+
Lint/RescueException:
|
24
|
+
Enabled: false
|
data/.travis.yml
ADDED
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -1,47 +1,20 @@
|
|
1
1
|
# Wiki::Api
|
2
2
|
|
3
|
-
|
3
|
+
[](https://travis-ci.org/dblommesteijn/wiki-api) [](https://codeclimate.com/github/dblommesteijn/wiki-api)
|
4
4
|
|
5
|
-
|
5
|
+
Wiki API is a gem (Ruby on Rails) that interfaces with the MediaWiki API (https://www.mediawiki.org/wiki/API:Main_page). This gem is more than a interface, it has abstract classes for Page and Headline parsing. You're able to iterate through these headlines, and access data accordingly.
|
6
|
+
|
7
|
+
NOTE: This gem has a nokogiri (http://nokogiri.org/Nokogiri.html) backend (for HTML parsing). Major components: `Page`, `Headline`, `Block`, `ListItem`, and `Link` are wrappers for easy data access, however it's still possible to retreive the raw HTML within these objects.
|
6
8
|
|
7
9
|
Requests to the MediaWiki API use the following URI structure:
|
8
10
|
|
9
11
|
http(s)://somemediawiki.org/w/api.php?action=parse&format=json&page="anypage"
|
10
12
|
|
11
|
-
|
12
|
-
|
13
|
-
http://rdoc.info/github/dblommesteijn/wiki-api/frames/file/README.md
|
14
|
-
|
13
|
+
### Dependencies
|
15
14
|
|
16
|
-
### Dependencies (production)
|
17
|
-
|
18
|
-
* json
|
19
15
|
* nokogiri
|
20
16
|
|
21
17
|
|
22
|
-
### Feature Roadmap
|
23
|
-
|
24
|
-
* Version (0.1.0)
|
25
|
-
|
26
|
-
Major current release with several core changes.
|
27
|
-
|
28
|
-
* Version (0.1.1)
|
29
|
-
|
30
|
-
No features determined yet (please drop me a line if you're interested in additions).
|
31
|
-
|
32
|
-
|
33
|
-
### Changelog
|
34
|
-
|
35
|
-
* Version (0.0.2) -> (current)
|
36
|
-
|
37
|
-
PageLink URI without global config Exception resolved
|
38
|
-
|
39
|
-
Reverse (parent) object lookup
|
40
|
-
|
41
|
-
Nested PageHeadline objects
|
42
|
-
|
43
|
-
|
44
|
-
|
45
18
|
## Installation
|
46
19
|
|
47
20
|
Add this line to your application's Gemfile (bundler):
|
@@ -56,23 +29,29 @@ Or install it yourself (RubyGems):
|
|
56
29
|
|
57
30
|
$ gem install wiki-api
|
58
31
|
|
32
|
+
Or try it from this repository (local) in a console:
|
33
|
+
|
34
|
+
$ bin/console
|
35
|
+
|
59
36
|
|
60
37
|
## Setup
|
61
38
|
|
62
39
|
Define a configuration for your connection (initialize script), this example uses wiktionary.org.
|
63
|
-
NOTE: it can connect to both HTTP and HTTPS MediaWikis
|
64
|
-
|
65
|
-
```ruby
|
66
|
-
CONFIG = { uri: "http://en.wiktionary.org" }
|
67
|
-
```
|
40
|
+
NOTE: it can connect to both HTTP and HTTPS MediaWikis (however you'll get a 302 response from MediaWiki)
|
68
41
|
|
69
42
|
Setup default configuration (initialize script)
|
70
43
|
|
71
44
|
```ruby
|
72
|
-
Wiki::Api::Connect.config =
|
45
|
+
Wiki::Api::Connect.config = { uri: 'https://en.wiktionary.org' }
|
73
46
|
```
|
74
47
|
|
75
48
|
|
49
|
+
## Running tests
|
50
|
+
|
51
|
+
```bash
|
52
|
+
$ rake test
|
53
|
+
```
|
54
|
+
|
76
55
|
## Usage
|
77
56
|
|
78
57
|
### Query a Page and Headline
|
@@ -80,7 +59,7 @@ Wiki::Api::Connect.config = CONFIG
|
|
80
59
|
Requesting headlines from a given page.
|
81
60
|
|
82
61
|
```ruby
|
83
|
-
page = Wiki::Api::Page.new
|
62
|
+
page = Wiki::Api::Page.new(name: 'Wiktionary:Welcome,_newcomers')
|
84
63
|
# the root headline equals the pagename
|
85
64
|
puts page.root_headline.name
|
86
65
|
# iterate next level of headlines
|
@@ -93,9 +72,9 @@ end
|
|
93
72
|
Getting headlines for a given name.
|
94
73
|
|
95
74
|
```ruby
|
96
|
-
page = Wiki::Api::Page.new
|
75
|
+
page = Wiki::Api::Page.new(name: 'Wiktionary:Welcome,_newcomers')
|
97
76
|
# lookup headline by name (underscore and case are ignored)
|
98
|
-
headline = page.root_headline.headline(
|
77
|
+
headline = page.root_headline.headline('editing wiktionary').first
|
99
78
|
# printing headline name (PageHeadline)
|
100
79
|
puts headline.name
|
101
80
|
# get the type of nested headline (html h1,2,3,4 etc.)
|
@@ -105,7 +84,7 @@ puts headline.type
|
|
105
84
|
### Basic Page structure
|
106
85
|
|
107
86
|
```ruby
|
108
|
-
page = Wiki::Api::Page.new
|
87
|
+
page = Wiki::Api::Page.new(name: 'Wiktionary:Welcome,_newcomers')
|
109
88
|
# iterate PageHeadline objects
|
110
89
|
page.root_headline.headlines.each do |headline_name, headline|
|
111
90
|
# exposing nokogiri internal elements
|
@@ -114,6 +93,7 @@ page.root_headline.headlines.each do |headline_name, headline|
|
|
114
93
|
# print will result in: Nokogiri::XML::Text or Nokogiri::XML::Element
|
115
94
|
puts element.class
|
116
95
|
end
|
96
|
+
|
117
97
|
# string representation of all nested text
|
118
98
|
block.to_texts
|
119
99
|
# iterate PageListItem objects
|
@@ -137,7 +117,6 @@ page.root_headline.headlines.each do |headline_name, headline|
|
|
137
117
|
# string representation of nested text
|
138
118
|
link.to_text
|
139
119
|
end
|
140
|
-
|
141
120
|
end
|
142
121
|
```
|
143
122
|
|
@@ -148,21 +127,20 @@ This is a example of querying wikipedia.org on the page: "Ruby_on_rails", and pr
|
|
148
127
|
|
149
128
|
```ruby
|
150
129
|
# setting a target config
|
151
|
-
|
152
|
-
Wiki::Api::Connect.config = CONFIG
|
130
|
+
Wiki::Api::Connect.config = { uri: 'https://en.wikipedia.org' }
|
153
131
|
|
154
132
|
# querying the page
|
155
|
-
page = Wiki::Api::Page.new
|
133
|
+
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails')
|
156
134
|
|
157
135
|
# get headlines with name Reference (there can be multiple headlines with the same name!)
|
158
|
-
headlines = page.root_headline.headline
|
136
|
+
headlines = page.root_headline.headline('References')
|
159
137
|
|
160
138
|
# iterate headlines
|
161
139
|
headlines.each do |headline|
|
162
140
|
# iterate list items on the given headline
|
163
141
|
headline.block.list_items.each do |list_item|
|
164
142
|
# print the uri of all links
|
165
|
-
puts list_item.links.map
|
143
|
+
puts list_item.links.map(&:uri)
|
166
144
|
end
|
167
145
|
end
|
168
146
|
```
|
@@ -174,19 +152,17 @@ This is the same example as the one above, except for setting a global config to
|
|
174
152
|
|
175
153
|
```ruby
|
176
154
|
# querying the page
|
177
|
-
page = Wiki::Api::Page.new
|
155
|
+
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails', uri: 'https://en.wikipedia.org')
|
178
156
|
|
179
157
|
# get headlines with name Reference (there can be multiple headlines with the same name!)
|
180
|
-
headlines = page.root_headline.headline
|
158
|
+
headlines = page.root_headline.headline('References')
|
181
159
|
|
182
160
|
# iterate headlines
|
183
161
|
headlines.each do |headline|
|
184
162
|
# iterate list items on the given headline
|
185
163
|
headline.block.list_items.each do |list_item|
|
186
|
-
|
187
164
|
# print the uri of all links
|
188
|
-
puts list_item.links.map
|
189
|
-
|
165
|
+
puts list_item.links.map(&:uri)
|
190
166
|
end
|
191
167
|
end
|
192
168
|
```
|
@@ -199,25 +175,47 @@ This example shows how the headlines can be searched. For more info check: https
|
|
199
175
|
|
200
176
|
```ruby
|
201
177
|
# querying the page
|
202
|
-
page = Wiki::Api::Page.new
|
178
|
+
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails', uri: 'https://en.wikipedia.org')
|
203
179
|
|
204
180
|
# NOTE: the following are all valid headline names:
|
205
181
|
# request headline (by literal name)
|
206
|
-
headlines = page.root_headline.headline
|
207
|
-
puts headlines.map
|
182
|
+
headlines = page.root_headline.headline('Philosophy_and_design')
|
183
|
+
puts headlines.map(&:name)
|
208
184
|
# request headline (by downcase name)
|
209
|
-
headlines = page.root_headline.headline
|
210
|
-
puts headlines.map
|
185
|
+
headlines = page.root_headline.headline('philosophy_and_design')
|
186
|
+
puts headlines.map(&:name)
|
211
187
|
# request headline (by human name)
|
212
|
-
headlines = page.root_headline.headline
|
213
|
-
puts headlines.map
|
188
|
+
headlines = page.root_headline.headline('philosophy and design')
|
189
|
+
puts headlines.map(&:name)
|
214
190
|
|
215
191
|
# NOTE2: headlines are matched on headline.start_with?(requested_headline)
|
216
192
|
# because of start_with? compare this should work as well!
|
217
|
-
headlines = page.root_headline.headline
|
218
|
-
puts headlines.map
|
193
|
+
headlines = page.root_headline.headline('philosophy')
|
194
|
+
puts headlines.map(&:name)
|
219
195
|
```
|
220
196
|
|
221
197
|
|
198
|
+
### Example searching headlines in depth
|
222
199
|
|
200
|
+
Recursive search on all nested headlines, including in depth searches.
|
201
|
+
|
202
|
+
```ruby
|
203
|
+
# querying the page
|
204
|
+
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails', uri: 'https://en.wikipedia.org')
|
205
|
+
# get root
|
206
|
+
root_headline = page.root_headline
|
207
|
+
# lookup 'ramework structure' on current level
|
208
|
+
headline = root_headline.headline_in_depth('framework structure').first
|
209
|
+
puts headline.name
|
210
|
+
# NOTE: lookup of nested headlines does not work with the headline function (because 'Framework_structure' is nested within 'Technical_overview')
|
211
|
+
headline = root_headline.headline('framework structure').first
|
212
|
+
# depth can be limited adding the depth parameter
|
213
|
+
# NOTE: the example below will return nil, 'Framework_structure' is nested beyond depth = 0!
|
214
|
+
depth = 0
|
215
|
+
headline = root_headline.headline_in_depth('framework structure', depth).first
|
216
|
+
# increasing depth search will show the requested headline
|
217
|
+
depth = 5
|
218
|
+
headline = root_headline.headline_in_depth('framework structure', depth).first
|
219
|
+
puts headline.name
|
220
|
+
```
|
223
221
|
|
data/Rakefile
CHANGED
@@ -1 +1,13 @@
|
|
1
|
-
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require 'bundler/gem_tasks'
|
4
|
+
require 'rake/testtask'
|
5
|
+
|
6
|
+
Rake::TestTask.new do |t|
|
7
|
+
t.libs << 'test'
|
8
|
+
tfs = FileList['test/unit/*.rb']
|
9
|
+
t.test_files = tfs
|
10
|
+
t.verbose = true
|
11
|
+
end
|
12
|
+
|
13
|
+
task default: %i[build install]
|
data/bin/console
ADDED
data/lib/wiki/api/connect.rb
CHANGED
@@ -1,85 +1,95 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
1
3
|
require 'net/http'
|
2
4
|
require 'json'
|
3
5
|
require 'nokogiri'
|
4
6
|
|
5
7
|
module Wiki
|
6
8
|
module Api
|
7
|
-
|
8
9
|
class Connect
|
9
|
-
|
10
10
|
attr_accessor :uri, :api_path, :api_options, :http, :request, :response, :html, :parsed, :file
|
11
11
|
|
12
|
-
def initialize(options={})
|
13
|
-
@@config ||=
|
14
|
-
|
15
|
-
self.
|
16
|
-
self.
|
17
|
-
self.
|
18
|
-
self.api_options = options[:api_options] if options.include? :api_options
|
12
|
+
def initialize(options = {})
|
13
|
+
@@config ||= {}
|
14
|
+
self.uri = options[:uri] || @@config[:uri]
|
15
|
+
self.file = options[:file] || @@config[:file]
|
16
|
+
self.api_path = options[:api_path] || @@config[:api_path]
|
17
|
+
self.api_options = options[:api_options] || @@config[:api_options]
|
19
18
|
|
20
19
|
# defaults
|
21
|
-
self.api_path ||=
|
22
|
-
self.api_options ||= {action:
|
20
|
+
self.api_path ||= '/w/api.php'
|
21
|
+
self.api_options ||= { action: 'parse', format: 'json', page: '' }
|
23
22
|
|
24
23
|
# errors
|
25
|
-
raise
|
24
|
+
raise('no uri given') if uri.nil?
|
26
25
|
end
|
27
26
|
|
28
27
|
def connect
|
29
28
|
uri = URI("#{self.uri}#{self.api_path}")
|
30
|
-
uri.query = URI.encode_www_form
|
29
|
+
uri.query = URI.encode_www_form(self.api_options)
|
31
30
|
self.http = Net::HTTP.new(uri.host, uri.port)
|
32
|
-
if uri.scheme ==
|
33
|
-
|
34
|
-
#self.http.verify_mode = OpenSSL::SSL::VERIFY_NONE
|
31
|
+
if uri.scheme == 'https'
|
32
|
+
http.use_ssl = true
|
33
|
+
# self.http.verify_mode = OpenSSL::SSL::VERIFY_NONE
|
35
34
|
end
|
36
35
|
self.request = Net::HTTP::Get.new(uri.request_uri)
|
37
|
-
self.response =
|
36
|
+
self.response = http.request(request)
|
38
37
|
end
|
39
38
|
|
40
|
-
def page
|
39
|
+
def page(page_name)
|
41
40
|
self.api_options[:page] = page_name
|
42
41
|
# parse page by uri
|
43
|
-
if !
|
44
|
-
self.
|
45
|
-
response = self.response
|
46
|
-
json = JSON.parse response.body, {symbolize_names: true}
|
47
|
-
raise json[:error][:code] unless valid? json, response
|
48
|
-
self.html = json[:parse][:text]
|
49
|
-
self.parsed = Nokogiri::HTML self.html[:*]
|
42
|
+
if !uri.nil? && file.nil?
|
43
|
+
self.parsed = parse_from_uri(response)
|
50
44
|
# parse page by file
|
51
|
-
elsif !
|
52
|
-
|
53
|
-
# self.parsed = Nokogiri::HTML self.html[:*]
|
54
|
-
self.parsed = Nokogiri::HTML(f)
|
55
|
-
f.close
|
45
|
+
elsif !file.nil?
|
46
|
+
self.parsed = parse_from_file(file)
|
56
47
|
# invalid config, raise exception
|
57
48
|
else
|
58
|
-
raise
|
49
|
+
raise('no :uri or :file config found!')
|
59
50
|
end
|
60
|
-
|
51
|
+
parsed
|
52
|
+
end
|
53
|
+
|
54
|
+
def parse_from_uri(response)
|
55
|
+
connect
|
56
|
+
# rubocop:disable Lint/ShadowedArgument
|
57
|
+
response = self.response
|
58
|
+
# rubocop:enable Lint/ShadowedArgument
|
59
|
+
json = JSON.parse(response.body, { symbolize_names: true })
|
60
|
+
raise(json[:error][:code]) unless valid?(json, response)
|
61
|
+
|
62
|
+
self.html = json[:parse][:text]
|
63
|
+
self.parsed = Nokogiri::HTML(html[:*])
|
64
|
+
end
|
65
|
+
|
66
|
+
def parse_from_file(file)
|
67
|
+
f = File.open(file)
|
68
|
+
ret = Nokogiri::HTML(f)
|
69
|
+
f.close
|
70
|
+
ret
|
61
71
|
end
|
62
72
|
|
63
73
|
class << self
|
64
74
|
def config=(config = {})
|
65
75
|
@@config = config
|
66
76
|
end
|
77
|
+
|
67
78
|
def config
|
68
79
|
@@config ||= []
|
69
80
|
end
|
70
81
|
end
|
71
82
|
|
72
83
|
protected
|
73
|
-
|
84
|
+
|
85
|
+
def valid?(json, response)
|
74
86
|
b = []
|
75
87
|
# valid http response
|
76
|
-
b << (response.is_a?
|
88
|
+
b << (response.is_a?(Net::HTTPOK))
|
77
89
|
# not an invalid api response handle
|
78
|
-
b << (!json.include?
|
90
|
+
b << (!json.include?(:error))
|
79
91
|
!b.include?(false)
|
80
92
|
end
|
81
|
-
|
82
93
|
end
|
83
|
-
|
84
94
|
end
|
85
|
-
end
|
95
|
+
end
|
data/lib/wiki/api/page.rb
CHANGED
@@ -1,25 +1,22 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
1
3
|
module Wiki
|
2
4
|
module Api
|
3
|
-
|
4
5
|
# MediaWiki Page, collection of all html information plus it's page title
|
5
6
|
class Page
|
6
|
-
|
7
7
|
attr_accessor :name, :parsed_page, :uri, :parent
|
8
8
|
|
9
|
-
def initialize(options={})
|
10
|
-
self.name = options[:name] if options.include?
|
11
|
-
self.uri = options[:uri] if options.include?
|
12
|
-
@connect = Wiki::Api::Connect.new
|
13
|
-
end
|
14
|
-
|
15
|
-
def connect
|
16
|
-
@connect
|
9
|
+
def initialize(options = {})
|
10
|
+
self.name = options[:name] if options.include?(:name)
|
11
|
+
self.uri = options[:uri] if options.include?(:uri)
|
12
|
+
@connect = Wiki::Api::Connect.new(uri:)
|
17
13
|
end
|
18
14
|
|
15
|
+
attr_reader :connect
|
19
16
|
|
20
17
|
# collect all headlines, keep original page formatting
|
21
18
|
def root_headline
|
22
|
-
|
19
|
+
parse_blocks
|
23
20
|
end
|
24
21
|
|
25
22
|
# # collect headlines by given name, this will flatten the nested headlines
|
@@ -30,10 +27,9 @@ module Wiki
|
|
30
27
|
# self.parse_blocks(headline_name)
|
31
28
|
# end
|
32
29
|
|
33
|
-
|
34
30
|
def to_html
|
35
|
-
|
36
|
-
|
31
|
+
load_page!
|
32
|
+
parsed_page.to_xhtml(indent: 3, indent_text: ' ')
|
37
33
|
end
|
38
34
|
|
39
35
|
def reset!
|
@@ -41,69 +37,66 @@ module Wiki
|
|
41
37
|
end
|
42
38
|
|
43
39
|
def load_page!
|
44
|
-
self.parsed_page ||= @connect.page
|
40
|
+
self.parsed_page ||= @connect.page(name)
|
45
41
|
end
|
46
42
|
|
47
|
-
|
48
43
|
# parse blocks
|
49
|
-
def parse_blocks
|
50
|
-
|
44
|
+
def parse_blocks(headline_name = nil)
|
45
|
+
load_page!
|
51
46
|
result = {}
|
52
47
|
|
53
48
|
# get headline nodes by span class
|
54
|
-
|
49
|
+
headlines = self.parsed_page.xpath("//span[@class='mw-headline']")
|
55
50
|
|
56
51
|
# filter single headline by name (ignore case)
|
57
|
-
|
52
|
+
headlines = filter_headline(headlines, headline_name) unless headline_name.nil?
|
58
53
|
|
59
54
|
# NOTE: first_part has no id attribute and thus cannot be filtered or processed within xpath (xs)
|
60
|
-
if headline_name.nil? || headline_name.start_with?(
|
61
|
-
x =
|
62
|
-
result[
|
63
|
-
result[
|
55
|
+
if headline_name.nil? || headline_name.start_with?(name.downcase)
|
56
|
+
x = first_part
|
57
|
+
result[name] ||= []
|
58
|
+
result[name] << (collect_elements(x.parent))
|
64
59
|
end
|
65
60
|
|
66
61
|
# append all blocks
|
67
|
-
|
68
|
-
|
69
|
-
elements =
|
70
|
-
result[
|
71
|
-
result[
|
62
|
+
headlines.each do |headline|
|
63
|
+
headline_value = headline.attributes['id'].value
|
64
|
+
elements = collect_elements(headline.parent.next)
|
65
|
+
result[headline_value] ||= []
|
66
|
+
result[headline_value] << elements
|
72
67
|
end
|
73
68
|
|
74
69
|
# create root object
|
75
|
-
PageHeadline.new
|
70
|
+
PageHeadline.new(parent: self, name: result.first[0], headlines: result, level: 0)
|
76
71
|
end
|
77
72
|
|
78
73
|
# harvest first part of the page (missing heading and class="mw-headline")
|
79
74
|
def first_part
|
80
|
-
self.parsed_page ||= @connect.page
|
81
|
-
self.parsed_page.search(
|
75
|
+
self.parsed_page ||= @connect.page(name)
|
76
|
+
self.parsed_page.search('p').first.children.first
|
82
77
|
end
|
83
78
|
|
84
79
|
# collect elements within headlines (not nested properties, but next elements)
|
85
|
-
def collect_elements
|
80
|
+
def collect_elements(element)
|
86
81
|
# capture first element name
|
87
82
|
elements = []
|
88
83
|
# iterate text until next headline
|
89
|
-
|
84
|
+
loop do
|
90
85
|
elements << element
|
91
86
|
element = element.next
|
92
|
-
break if element.nil? || element.to_html.include?(
|
87
|
+
break if element.nil? || element.to_html.include?('class="mw-headline"')
|
93
88
|
end
|
94
89
|
elements
|
95
90
|
end
|
96
91
|
|
97
|
-
def filter_headline
|
92
|
+
def filter_headline(xs, headline_name)
|
98
93
|
# transform name to a wiki_id (downcase and space replace with underscore)
|
99
|
-
headline_name = headline_name.downcase.gsub(
|
94
|
+
headline_name = headline_name.downcase.gsub(' ', '_')
|
100
95
|
# reject not matching id's
|
101
|
-
xs.
|
102
|
-
|
96
|
+
xs.select do |t|
|
97
|
+
t.attributes['id'].value.downcase.start_with?(headline_name)
|
103
98
|
end
|
104
99
|
end
|
105
|
-
|
106
100
|
end
|
107
|
-
|
108
101
|
end
|
109
|
-
end
|
102
|
+
end
|