wiki-api 0.0.2 → 0.1.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +5 -13
- data/.rubocop.yml +24 -0
- data/.travis.yml +12 -0
- data/Gemfile +2 -0
- data/README.md +93 -64
- data/Rakefile +13 -1
- data/bin/console +8 -0
- data/lib/wiki/api/connect.rb +52 -28
- data/lib/wiki/api/page.rb +48 -82
- data/lib/wiki/api/page_block.rb +19 -18
- data/lib/wiki/api/page_headline.rb +104 -8
- data/lib/wiki/api/page_link.rb +18 -14
- data/lib/wiki/api/page_list_item.rb +12 -13
- data/lib/wiki/api/util.rb +24 -15
- data/lib/wiki/api/version.rb +3 -1
- data/lib/wiki/api.rb +9 -8
- data/test/test_helper.rb +4 -7
- data/test/unit/files/Wiktionary_program.html +4232 -0
- data/test/unit/wiki_connect.rb +18 -25
- data/test/unit/wiki_page_offline.rb +295 -0
- data/wiki-api.gemspec +20 -17
- metadata +57 -38
- data/test/unit/wiki_page_config.rb +0 -45
- data/test/unit/wiki_page_object.rb +0 -229
checksums.yaml
CHANGED
@@ -1,15 +1,7 @@
|
|
1
1
|
---
|
2
|
-
|
3
|
-
metadata.gz:
|
4
|
-
|
5
|
-
data.tar.gz: !binary |-
|
6
|
-
ZmZkNDFhMzc0ZTNmZDBlYTFmMTIwMmU5ZDgzYTQ2YjM0ZTk1ZmQzYg==
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: cd978cd4dad89ddc8098d6abafcd6325ec6c0c4a4a5e5b8e93855bc118314b27
|
4
|
+
data.tar.gz: c5ead46deb2d10310823d4b639046058cf087a29cb6a0413a5e3addc64037b92
|
7
5
|
SHA512:
|
8
|
-
metadata.gz:
|
9
|
-
|
10
|
-
MjhjZjYxYzcxMmYzYjA0YzA3NzdlYTJhMjM0ZTllNzgyMDk0MGJiNjBiZWRl
|
11
|
-
N2Y5YzMwZWZjZmY3NWQ0YmJiMjdiOTkwOTU1ZmE4MDg5Njk4M2Y=
|
12
|
-
data.tar.gz: !binary |-
|
13
|
-
MGZlMTYzZTgzZWE3YmYzZmIyMjc0OTZhMGY0NDEwYzJmNmFiMTZkNDM3OGM2
|
14
|
-
Mjc1MDdjMzQ3MjM1NmVlODM3Mzg5ZTViMGRmOGI2NzE1NDZjODJhZTA2MjI5
|
15
|
-
NWE3YmI4MDYxY2I4NGM3MGUwNzAzNjQ3YjMwODU5NDBlMWYxZDM=
|
6
|
+
metadata.gz: fcb6e3991c12a415a79b4c109091a41dbe45bff7ee3040a1a4283ddc2625522cfca767c65cba45e0f29bb13d410f082b78337de25d0bfd2bd9e0bd1591a36c24
|
7
|
+
data.tar.gz: 3a78fa474766c4cc10c44eb3e8a90ed95c1ddac1f306afa878da2ccf7b75e4fd179fc7933499f261c408cdd2f396d3613a6d74361bdad160cb3c13727aaa135c
|
data/.rubocop.yml
ADDED
@@ -0,0 +1,24 @@
|
|
1
|
+
AllCops:
|
2
|
+
SuggestExtensions: false
|
3
|
+
Style/ClassVars:
|
4
|
+
Enabled: false
|
5
|
+
Style/Documentation:
|
6
|
+
Enabled: false
|
7
|
+
Style/MethodCallWithArgsParentheses:
|
8
|
+
Enabled: true
|
9
|
+
Metrics/AbcSize:
|
10
|
+
Enabled: false
|
11
|
+
Metrics/ClassLength:
|
12
|
+
Enabled: false
|
13
|
+
Metrics/CyclomaticComplexity:
|
14
|
+
Enabled: false
|
15
|
+
Metrics/PerceivedComplexity:
|
16
|
+
Enabled: false
|
17
|
+
Metrics/MethodLength:
|
18
|
+
Enabled: false
|
19
|
+
Naming/MethodParameterName:
|
20
|
+
Enabled: false
|
21
|
+
Naming/PredicateName:
|
22
|
+
Enabled: false
|
23
|
+
Lint/RescueException:
|
24
|
+
Enabled: false
|
data/.travis.yml
ADDED
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -1,43 +1,20 @@
|
|
1
1
|
# Wiki::Api
|
2
2
|
|
3
|
-
|
3
|
+
[![Build Status](https://travis-ci.org/dblommesteijn/wiki-api.svg?branch=master)](https://travis-ci.org/dblommesteijn/wiki-api) [![Code Climate](https://codeclimate.com/github/dblommesteijn/wiki-api.png)](https://codeclimate.com/github/dblommesteijn/wiki-api)
|
4
4
|
|
5
|
-
|
5
|
+
Wiki API is a gem (Ruby on Rails) that interfaces with the MediaWiki API (https://www.mediawiki.org/wiki/API:Main_page). This gem is more than a interface, it has abstract classes for Page and Headline parsing. You're able to iterate through these headlines, and access data accordingly.
|
6
|
+
|
7
|
+
NOTE: This gem has a nokogiri (http://nokogiri.org/Nokogiri.html) backend (for HTML parsing). Major components: `Page`, `Headline`, `Block`, `ListItem`, and `Link` are wrappers for easy data access, however it's still possible to retreive the raw HTML within these objects.
|
6
8
|
|
7
9
|
Requests to the MediaWiki API use the following URI structure:
|
8
10
|
|
9
11
|
http(s)://somemediawiki.org/w/api.php?action=parse&format=json&page="anypage"
|
10
12
|
|
13
|
+
### Dependencies
|
11
14
|
|
12
|
-
### Dependencies (production)
|
13
|
-
|
14
|
-
* json
|
15
15
|
* nokogiri
|
16
16
|
|
17
17
|
|
18
|
-
### Roadmap
|
19
|
-
|
20
|
-
* Version (0.0.2) (current)
|
21
|
-
|
22
|
-
Index important words per block, page, list item;
|
23
|
-
|
24
|
-
Parse objects for more elements within a Page.
|
25
|
-
|
26
|
-
|
27
|
-
### Changelog
|
28
|
-
|
29
|
-
* Version (0.0.1) -> (0.0.2)
|
30
|
-
|
31
|
-
Nested ListItems, Links (within Page)
|
32
|
-
|
33
|
-
Search on Page headline (ignore case, and underscore)
|
34
|
-
|
35
|
-
|
36
|
-
### Known Issues
|
37
|
-
|
38
|
-
None discovered thus far.
|
39
|
-
|
40
|
-
|
41
18
|
## Installation
|
42
19
|
|
43
20
|
Add this line to your application's Gemfile (bundler):
|
@@ -52,32 +29,41 @@ Or install it yourself (RubyGems):
|
|
52
29
|
|
53
30
|
$ gem install wiki-api
|
54
31
|
|
32
|
+
Or try it from this repository (local) in a console:
|
33
|
+
|
34
|
+
$ bin/console
|
35
|
+
|
55
36
|
|
56
37
|
## Setup
|
57
38
|
|
58
39
|
Define a configuration for your connection (initialize script), this example uses wiktionary.org.
|
59
|
-
NOTE: it can connect to both HTTP and HTTPS MediaWikis
|
60
|
-
|
61
|
-
```ruby
|
62
|
-
CONFIG = { uri: "http://en.wiktionary.org" }
|
63
|
-
```
|
40
|
+
NOTE: it can connect to both HTTP and HTTPS MediaWikis (however you'll get a 302 response from MediaWiki)
|
64
41
|
|
65
42
|
Setup default configuration (initialize script)
|
66
43
|
|
67
44
|
```ruby
|
68
|
-
Wiki::Api::Connect.config =
|
45
|
+
Wiki::Api::Connect.config = { uri: 'https://en.wiktionary.org' }
|
69
46
|
```
|
70
47
|
|
71
48
|
|
49
|
+
## Running tests
|
50
|
+
|
51
|
+
```bash
|
52
|
+
$ rake test
|
53
|
+
```
|
54
|
+
|
72
55
|
## Usage
|
73
56
|
|
74
|
-
### Query a Page
|
57
|
+
### Query a Page and Headline
|
75
58
|
|
76
59
|
Requesting headlines from a given page.
|
77
60
|
|
78
61
|
```ruby
|
79
|
-
page = Wiki::Api::Page.new
|
80
|
-
|
62
|
+
page = Wiki::Api::Page.new(name: 'Wiktionary:Welcome,_newcomers')
|
63
|
+
# the root headline equals the pagename
|
64
|
+
puts page.root_headline.name
|
65
|
+
# iterate next level of headlines
|
66
|
+
page.root_headline.headlines.each do |headline_name, headline|
|
81
67
|
# printing headline name (PageHeadline)
|
82
68
|
puts headline.name
|
83
69
|
end
|
@@ -86,30 +72,30 @@ end
|
|
86
72
|
Getting headlines for a given name.
|
87
73
|
|
88
74
|
```ruby
|
89
|
-
page = Wiki::Api::Page.new
|
90
|
-
|
91
|
-
|
92
|
-
|
93
|
-
|
75
|
+
page = Wiki::Api::Page.new(name: 'Wiktionary:Welcome,_newcomers')
|
76
|
+
# lookup headline by name (underscore and case are ignored)
|
77
|
+
headline = page.root_headline.headline('editing wiktionary').first
|
78
|
+
# printing headline name (PageHeadline)
|
79
|
+
puts headline.name
|
80
|
+
# get the type of nested headline (html h1,2,3,4 etc.)
|
81
|
+
puts headline.type
|
94
82
|
```
|
95
83
|
|
96
84
|
### Basic Page structure
|
97
85
|
|
98
86
|
```ruby
|
99
|
-
page = Wiki::Api::Page.new
|
100
|
-
|
87
|
+
page = Wiki::Api::Page.new(name: 'Wiktionary:Welcome,_newcomers')
|
101
88
|
# iterate PageHeadline objects
|
102
|
-
page.headlines.each do |headline|
|
103
|
-
|
89
|
+
page.root_headline.headlines.each do |headline_name, headline|
|
104
90
|
# exposing nokogiri internal elements
|
105
91
|
elements = headline.elements.flatten
|
106
92
|
elements.each do |element|
|
107
|
-
#
|
93
|
+
# print will result in: Nokogiri::XML::Text or Nokogiri::XML::Element
|
94
|
+
puts element.class
|
108
95
|
end
|
109
96
|
|
110
97
|
# string representation of all nested text
|
111
98
|
block.to_texts
|
112
|
-
|
113
99
|
# iterate PageListItem objects
|
114
100
|
block.list_items.each do |list_item|
|
115
101
|
# string representation of nested text
|
@@ -131,62 +117,105 @@ page.headlines.each do |headline|
|
|
131
117
|
# string representation of nested text
|
132
118
|
link.to_text
|
133
119
|
end
|
134
|
-
|
135
120
|
end
|
136
121
|
```
|
137
122
|
|
138
123
|
|
139
|
-
### Example using Global config (https://en.wikipedia.org/wiki/
|
124
|
+
### Example using Global config (https://en.wikipedia.org/wiki/Ruby_on_Rails)
|
140
125
|
|
141
126
|
This is a example of querying wikipedia.org on the page: "Ruby_on_rails", and printing the References headline links for each list item.
|
142
127
|
|
143
128
|
```ruby
|
144
129
|
# setting a target config
|
145
|
-
|
146
|
-
Wiki::Api::Connect.config = CONFIG
|
130
|
+
Wiki::Api::Connect.config = { uri: 'https://en.wikipedia.org' }
|
147
131
|
|
148
132
|
# querying the page
|
149
|
-
page = Wiki::Api::Page.new
|
133
|
+
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails')
|
150
134
|
|
151
135
|
# get headlines with name Reference (there can be multiple headlines with the same name!)
|
152
|
-
headlines = page.headline
|
136
|
+
headlines = page.root_headline.headline('References')
|
153
137
|
|
154
138
|
# iterate headlines
|
155
139
|
headlines.each do |headline|
|
156
140
|
# iterate list items on the given headline
|
157
141
|
headline.block.list_items.each do |list_item|
|
158
|
-
|
159
142
|
# print the uri of all links
|
160
|
-
puts list_item.links.map
|
161
|
-
|
143
|
+
puts list_item.links.map(&:uri)
|
162
144
|
end
|
163
145
|
end
|
164
146
|
```
|
165
147
|
|
166
148
|
|
167
|
-
|
168
|
-
### Example passing URI (https://en.wikipedia.org/wiki/Ruby_on_rails)
|
149
|
+
### Example passing URI (https://en.wikipedia.org/wiki/Ruby_on_Rails)
|
169
150
|
|
170
151
|
This is the same example as the one above, except for setting a global config to direct the requests to a given URI.
|
171
152
|
|
172
153
|
```ruby
|
173
154
|
# querying the page
|
174
|
-
page = Wiki::Api::Page.new
|
155
|
+
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails', uri: 'https://en.wikipedia.org')
|
175
156
|
|
176
157
|
# get headlines with name Reference (there can be multiple headlines with the same name!)
|
177
|
-
headlines = page.headline
|
158
|
+
headlines = page.root_headline.headline('References')
|
178
159
|
|
179
160
|
# iterate headlines
|
180
161
|
headlines.each do |headline|
|
181
162
|
# iterate list items on the given headline
|
182
163
|
headline.block.list_items.each do |list_item|
|
183
|
-
|
184
164
|
# print the uri of all links
|
185
|
-
puts list_item.links.map
|
186
|
-
|
165
|
+
puts list_item.links.map(&:uri)
|
187
166
|
end
|
188
167
|
end
|
189
168
|
```
|
190
169
|
|
191
170
|
|
171
|
+
### Example searching headlines
|
172
|
+
|
173
|
+
This example shows how the headlines can be searched. For more info check: https://github.com/dblommesteijn/wiki-api/blob/master/lib/wiki/api/page.rb#L97
|
174
|
+
|
175
|
+
|
176
|
+
```ruby
|
177
|
+
# querying the page
|
178
|
+
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails', uri: 'https://en.wikipedia.org')
|
179
|
+
|
180
|
+
# NOTE: the following are all valid headline names:
|
181
|
+
# request headline (by literal name)
|
182
|
+
headlines = page.root_headline.headline('Philosophy_and_design')
|
183
|
+
puts headlines.map(&:name)
|
184
|
+
# request headline (by downcase name)
|
185
|
+
headlines = page.root_headline.headline('philosophy_and_design')
|
186
|
+
puts headlines.map(&:name)
|
187
|
+
# request headline (by human name)
|
188
|
+
headlines = page.root_headline.headline('philosophy and design')
|
189
|
+
puts headlines.map(&:name)
|
190
|
+
|
191
|
+
# NOTE2: headlines are matched on headline.start_with?(requested_headline)
|
192
|
+
# because of start_with? compare this should work as well!
|
193
|
+
headlines = page.root_headline.headline('philosophy')
|
194
|
+
puts headlines.map(&:name)
|
195
|
+
```
|
196
|
+
|
197
|
+
|
198
|
+
### Example searching headlines in depth
|
199
|
+
|
200
|
+
Recursive search on all nested headlines, including in depth searches.
|
201
|
+
|
202
|
+
```ruby
|
203
|
+
# querying the page
|
204
|
+
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails', uri: 'https://en.wikipedia.org')
|
205
|
+
# get root
|
206
|
+
root_headline = page.root_headline
|
207
|
+
# lookup 'ramework structure' on current level
|
208
|
+
headline = root_headline.headline_in_depth('framework structure').first
|
209
|
+
puts headline.name
|
210
|
+
# NOTE: lookup of nested headlines does not work with the headline function (because 'Framework_structure' is nested within 'Technical_overview')
|
211
|
+
headline = root_headline.headline('framework structure').first
|
212
|
+
# depth can be limited adding the depth parameter
|
213
|
+
# NOTE: the example below will return nil, 'Framework_structure' is nested beyond depth = 0!
|
214
|
+
depth = 0
|
215
|
+
headline = root_headline.headline_in_depth('framework structure', depth).first
|
216
|
+
# increasing depth search will show the requested headline
|
217
|
+
depth = 5
|
218
|
+
headline = root_headline.headline_in_depth('framework structure', depth).first
|
219
|
+
puts headline.name
|
220
|
+
```
|
192
221
|
|
data/Rakefile
CHANGED
@@ -1 +1,13 @@
|
|
1
|
-
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require 'bundler/gem_tasks'
|
4
|
+
require 'rake/testtask'
|
5
|
+
|
6
|
+
Rake::TestTask.new do |t|
|
7
|
+
t.libs << 'test'
|
8
|
+
tfs = FileList['test/unit/*.rb']
|
9
|
+
t.test_files = tfs
|
10
|
+
t.verbose = true
|
11
|
+
end
|
12
|
+
|
13
|
+
task default: %i[build install]
|
data/bin/console
ADDED
data/lib/wiki/api/connect.rb
CHANGED
@@ -1,71 +1,95 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
1
3
|
require 'net/http'
|
2
4
|
require 'json'
|
3
5
|
require 'nokogiri'
|
4
6
|
|
5
7
|
module Wiki
|
6
8
|
module Api
|
7
|
-
|
8
9
|
class Connect
|
10
|
+
attr_accessor :uri, :api_path, :api_options, :http, :request, :response, :html, :parsed, :file
|
9
11
|
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
@@config
|
14
|
-
|
15
|
-
self.
|
16
|
-
self.api_path = options[:api_path] if options.include? :api_path
|
17
|
-
self.api_options = options[:api_options] if options.include? :api_options
|
12
|
+
def initialize(options = {})
|
13
|
+
@@config ||= {}
|
14
|
+
self.uri = options[:uri] || @@config[:uri]
|
15
|
+
self.file = options[:file] || @@config[:file]
|
16
|
+
self.api_path = options[:api_path] || @@config[:api_path]
|
17
|
+
self.api_options = options[:api_options] || @@config[:api_options]
|
18
18
|
|
19
19
|
# defaults
|
20
|
-
self.api_path ||=
|
21
|
-
self.api_options ||= {action:
|
20
|
+
self.api_path ||= '/w/api.php'
|
21
|
+
self.api_options ||= { action: 'parse', format: 'json', page: '' }
|
22
22
|
|
23
23
|
# errors
|
24
|
-
raise
|
24
|
+
raise('no uri given') if uri.nil?
|
25
25
|
end
|
26
26
|
|
27
27
|
def connect
|
28
28
|
uri = URI("#{self.uri}#{self.api_path}")
|
29
|
-
uri.query = URI.encode_www_form
|
29
|
+
uri.query = URI.encode_www_form(self.api_options)
|
30
30
|
self.http = Net::HTTP.new(uri.host, uri.port)
|
31
|
-
if uri.scheme ==
|
32
|
-
|
33
|
-
#self.http.verify_mode = OpenSSL::SSL::VERIFY_NONE
|
31
|
+
if uri.scheme == 'https'
|
32
|
+
http.use_ssl = true
|
33
|
+
# self.http.verify_mode = OpenSSL::SSL::VERIFY_NONE
|
34
34
|
end
|
35
35
|
self.request = Net::HTTP::Get.new(uri.request_uri)
|
36
|
-
self.response =
|
36
|
+
self.response = http.request(request)
|
37
37
|
end
|
38
38
|
|
39
|
-
def page
|
39
|
+
def page(page_name)
|
40
40
|
self.api_options[:page] = page_name
|
41
|
-
|
41
|
+
# parse page by uri
|
42
|
+
if !uri.nil? && file.nil?
|
43
|
+
self.parsed = parse_from_uri(response)
|
44
|
+
# parse page by file
|
45
|
+
elsif !file.nil?
|
46
|
+
self.parsed = parse_from_file(file)
|
47
|
+
# invalid config, raise exception
|
48
|
+
else
|
49
|
+
raise('no :uri or :file config found!')
|
50
|
+
end
|
51
|
+
parsed
|
52
|
+
end
|
53
|
+
|
54
|
+
def parse_from_uri(response)
|
55
|
+
connect
|
56
|
+
# rubocop:disable Lint/ShadowedArgument
|
42
57
|
response = self.response
|
43
|
-
|
44
|
-
|
58
|
+
# rubocop:enable Lint/ShadowedArgument
|
59
|
+
json = JSON.parse(response.body, { symbolize_names: true })
|
60
|
+
raise(json[:error][:code]) unless valid?(json, response)
|
61
|
+
|
45
62
|
self.html = json[:parse][:text]
|
46
|
-
self.parsed = Nokogiri::HTML
|
63
|
+
self.parsed = Nokogiri::HTML(html[:*])
|
64
|
+
end
|
65
|
+
|
66
|
+
def parse_from_file(file)
|
67
|
+
f = File.open(file)
|
68
|
+
ret = Nokogiri::HTML(f)
|
69
|
+
f.close
|
70
|
+
ret
|
47
71
|
end
|
48
72
|
|
49
73
|
class << self
|
50
74
|
def config=(config = {})
|
51
75
|
@@config = config
|
52
76
|
end
|
77
|
+
|
53
78
|
def config
|
54
79
|
@@config ||= []
|
55
80
|
end
|
56
81
|
end
|
57
82
|
|
58
83
|
protected
|
59
|
-
|
84
|
+
|
85
|
+
def valid?(json, response)
|
60
86
|
b = []
|
61
87
|
# valid http response
|
62
|
-
b << (response.is_a?
|
88
|
+
b << (response.is_a?(Net::HTTPOK))
|
63
89
|
# not an invalid api response handle
|
64
|
-
b << (!json.include?
|
90
|
+
b << (!json.include?(:error))
|
65
91
|
!b.include?(false)
|
66
92
|
end
|
67
|
-
|
68
93
|
end
|
69
|
-
|
70
94
|
end
|
71
|
-
end
|
95
|
+
end
|
data/lib/wiki/api/page.rb
CHANGED
@@ -1,136 +1,102 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
1
3
|
module Wiki
|
2
4
|
module Api
|
3
|
-
|
5
|
+
# MediaWiki Page, collection of all html information plus it's page title
|
4
6
|
class Page
|
7
|
+
attr_accessor :name, :parsed_page, :uri, :parent
|
5
8
|
|
6
|
-
|
7
|
-
|
8
|
-
|
9
|
-
|
10
|
-
uri = options[:uri] if options.include? :uri
|
11
|
-
|
12
|
-
@@config ||= nil
|
13
|
-
if @@config.nil? || !uri.nil?
|
14
|
-
# use the connection to collect HTML pages for parsing
|
15
|
-
@connect = Wiki::Api::Connect.new uri: uri
|
16
|
-
else
|
17
|
-
# using a local HTML file for parsing
|
18
|
-
end
|
9
|
+
def initialize(options = {})
|
10
|
+
self.name = options[:name] if options.include?(:name)
|
11
|
+
self.uri = options[:uri] if options.include?(:uri)
|
12
|
+
@connect = Wiki::Api::Connect.new(uri:)
|
19
13
|
end
|
20
14
|
|
21
|
-
|
22
|
-
headlines = []
|
23
|
-
self.parse_blocks.each do |headline_name, elements|
|
24
|
-
headline = PageHeadline.new name: headline_name
|
25
|
-
elements.each do |element|
|
26
|
-
# nokogiri element
|
27
|
-
headline.block << element
|
28
|
-
end
|
29
|
-
headlines << headline
|
30
|
-
end
|
31
|
-
headlines
|
32
|
-
end
|
15
|
+
attr_reader :connect
|
33
16
|
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
headline = PageHeadline.new name: headline_name
|
38
|
-
elements.each do |element|
|
39
|
-
# nokogiri element
|
40
|
-
headline.block << element
|
41
|
-
end
|
42
|
-
headlines << headline
|
43
|
-
end
|
44
|
-
headlines
|
17
|
+
# collect all headlines, keep original page formatting
|
18
|
+
def root_headline
|
19
|
+
parse_blocks
|
45
20
|
end
|
46
21
|
|
47
|
-
|
22
|
+
# # collect headlines by given name, this will flatten the nested headlines
|
23
|
+
# def flat_headlines_by_name headline_name
|
24
|
+
# raise "not yet implemented!"
|
25
|
+
# # TODO: implement flattening of headlines within the root headline
|
26
|
+
# # ALT: breath search option in the root of the first headline
|
27
|
+
# self.parse_blocks(headline_name)
|
28
|
+
# end
|
48
29
|
|
49
30
|
def to_html
|
50
|
-
|
51
|
-
|
31
|
+
load_page!
|
32
|
+
parsed_page.to_xhtml(indent: 3, indent_text: ' ')
|
52
33
|
end
|
53
34
|
|
54
35
|
def reset!
|
55
36
|
self.parse_page = nil
|
56
37
|
end
|
57
38
|
|
58
|
-
class << self
|
59
|
-
def config=(config = {})
|
60
|
-
@@config = config
|
61
|
-
end
|
62
|
-
end
|
63
|
-
|
64
|
-
protected
|
65
|
-
|
66
39
|
def load_page!
|
67
|
-
|
68
|
-
self.parsed_page ||= @connect.page self.name
|
69
|
-
elsif self.parsed_page.nil?
|
70
|
-
f = File.open(@@config[:file])
|
71
|
-
self.parsed_page = Nokogiri::HTML(f)
|
72
|
-
f.close
|
73
|
-
end
|
40
|
+
self.parsed_page ||= @connect.page(name)
|
74
41
|
end
|
75
42
|
|
76
|
-
|
77
43
|
# parse blocks
|
78
|
-
def parse_blocks
|
79
|
-
|
44
|
+
def parse_blocks(headline_name = nil)
|
45
|
+
load_page!
|
80
46
|
result = {}
|
81
47
|
|
82
48
|
# get headline nodes by span class
|
83
|
-
|
49
|
+
headlines = self.parsed_page.xpath("//span[@class='mw-headline']")
|
50
|
+
|
84
51
|
# filter single headline by name (ignore case)
|
85
|
-
|
52
|
+
headlines = filter_headline(headlines, headline_name) unless headline_name.nil?
|
86
53
|
|
87
54
|
# NOTE: first_part has no id attribute and thus cannot be filtered or processed within xpath (xs)
|
88
|
-
if headline_name
|
89
|
-
x =
|
90
|
-
result[
|
91
|
-
result[
|
55
|
+
if headline_name.nil? || headline_name.start_with?(name.downcase)
|
56
|
+
x = first_part
|
57
|
+
result[name] ||= []
|
58
|
+
result[name] << (collect_elements(x.parent))
|
92
59
|
end
|
93
60
|
|
94
61
|
# append all blocks
|
95
|
-
|
96
|
-
|
97
|
-
elements =
|
98
|
-
result[
|
99
|
-
result[
|
62
|
+
headlines.each do |headline|
|
63
|
+
headline_value = headline.attributes['id'].value
|
64
|
+
elements = collect_elements(headline.parent.next)
|
65
|
+
result[headline_value] ||= []
|
66
|
+
result[headline_value] << elements
|
100
67
|
end
|
101
68
|
|
102
|
-
|
69
|
+
# create root object
|
70
|
+
PageHeadline.new(parent: self, name: result.first[0], headlines: result, level: 0)
|
103
71
|
end
|
104
72
|
|
105
73
|
# harvest first part of the page (missing heading and class="mw-headline")
|
106
74
|
def first_part
|
107
|
-
self.parsed_page ||= @connect.page
|
108
|
-
self.parsed_page.search(
|
75
|
+
self.parsed_page ||= @connect.page(name)
|
76
|
+
self.parsed_page.search('p').first.children.first
|
109
77
|
end
|
110
78
|
|
111
79
|
# collect elements within headlines (not nested properties, but next elements)
|
112
|
-
def collect_elements
|
80
|
+
def collect_elements(element)
|
113
81
|
# capture first element name
|
114
82
|
elements = []
|
115
83
|
# iterate text until next headline
|
116
|
-
|
84
|
+
loop do
|
117
85
|
elements << element
|
118
86
|
element = element.next
|
119
|
-
break if element.nil? || element.to_html.include?(
|
87
|
+
break if element.nil? || element.to_html.include?('class="mw-headline"')
|
120
88
|
end
|
121
89
|
elements
|
122
90
|
end
|
123
91
|
|
124
|
-
def filter_headline
|
92
|
+
def filter_headline(xs, headline_name)
|
125
93
|
# transform name to a wiki_id (downcase and space replace with underscore)
|
126
|
-
headline_name = headline_name.downcase.gsub(
|
94
|
+
headline_name = headline_name.downcase.gsub(' ', '_')
|
127
95
|
# reject not matching id's
|
128
|
-
xs.
|
129
|
-
|
96
|
+
xs.select do |t|
|
97
|
+
t.attributes['id'].value.downcase.start_with?(headline_name)
|
130
98
|
end
|
131
99
|
end
|
132
|
-
|
133
100
|
end
|
134
|
-
|
135
101
|
end
|
136
|
-
end
|
102
|
+
end
|