wiki-api 0.0.2 → 0.1.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +5 -13
- data/.rubocop.yml +24 -0
- data/.travis.yml +12 -0
- data/Gemfile +2 -0
- data/README.md +93 -64
- data/Rakefile +13 -1
- data/bin/console +8 -0
- data/lib/wiki/api/connect.rb +52 -28
- data/lib/wiki/api/page.rb +48 -82
- data/lib/wiki/api/page_block.rb +19 -18
- data/lib/wiki/api/page_headline.rb +104 -8
- data/lib/wiki/api/page_link.rb +18 -14
- data/lib/wiki/api/page_list_item.rb +12 -13
- data/lib/wiki/api/util.rb +24 -15
- data/lib/wiki/api/version.rb +3 -1
- data/lib/wiki/api.rb +9 -8
- data/test/test_helper.rb +4 -7
- data/test/unit/files/Wiktionary_program.html +4232 -0
- data/test/unit/wiki_connect.rb +18 -25
- data/test/unit/wiki_page_offline.rb +295 -0
- data/wiki-api.gemspec +20 -17
- metadata +57 -38
- data/test/unit/wiki_page_config.rb +0 -45
- data/test/unit/wiki_page_object.rb +0 -229
checksums.yaml
CHANGED
@@ -1,15 +1,7 @@
|
|
1
1
|
---
|
2
|
-
|
3
|
-
metadata.gz:
|
4
|
-
|
5
|
-
data.tar.gz: !binary |-
|
6
|
-
ZmZkNDFhMzc0ZTNmZDBlYTFmMTIwMmU5ZDgzYTQ2YjM0ZTk1ZmQzYg==
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: cd978cd4dad89ddc8098d6abafcd6325ec6c0c4a4a5e5b8e93855bc118314b27
|
4
|
+
data.tar.gz: c5ead46deb2d10310823d4b639046058cf087a29cb6a0413a5e3addc64037b92
|
7
5
|
SHA512:
|
8
|
-
metadata.gz:
|
9
|
-
|
10
|
-
MjhjZjYxYzcxMmYzYjA0YzA3NzdlYTJhMjM0ZTllNzgyMDk0MGJiNjBiZWRl
|
11
|
-
N2Y5YzMwZWZjZmY3NWQ0YmJiMjdiOTkwOTU1ZmE4MDg5Njk4M2Y=
|
12
|
-
data.tar.gz: !binary |-
|
13
|
-
MGZlMTYzZTgzZWE3YmYzZmIyMjc0OTZhMGY0NDEwYzJmNmFiMTZkNDM3OGM2
|
14
|
-
Mjc1MDdjMzQ3MjM1NmVlODM3Mzg5ZTViMGRmOGI2NzE1NDZjODJhZTA2MjI5
|
15
|
-
NWE3YmI4MDYxY2I4NGM3MGUwNzAzNjQ3YjMwODU5NDBlMWYxZDM=
|
6
|
+
metadata.gz: fcb6e3991c12a415a79b4c109091a41dbe45bff7ee3040a1a4283ddc2625522cfca767c65cba45e0f29bb13d410f082b78337de25d0bfd2bd9e0bd1591a36c24
|
7
|
+
data.tar.gz: 3a78fa474766c4cc10c44eb3e8a90ed95c1ddac1f306afa878da2ccf7b75e4fd179fc7933499f261c408cdd2f396d3613a6d74361bdad160cb3c13727aaa135c
|
data/.rubocop.yml
ADDED
@@ -0,0 +1,24 @@
|
|
1
|
+
AllCops:
|
2
|
+
SuggestExtensions: false
|
3
|
+
Style/ClassVars:
|
4
|
+
Enabled: false
|
5
|
+
Style/Documentation:
|
6
|
+
Enabled: false
|
7
|
+
Style/MethodCallWithArgsParentheses:
|
8
|
+
Enabled: true
|
9
|
+
Metrics/AbcSize:
|
10
|
+
Enabled: false
|
11
|
+
Metrics/ClassLength:
|
12
|
+
Enabled: false
|
13
|
+
Metrics/CyclomaticComplexity:
|
14
|
+
Enabled: false
|
15
|
+
Metrics/PerceivedComplexity:
|
16
|
+
Enabled: false
|
17
|
+
Metrics/MethodLength:
|
18
|
+
Enabled: false
|
19
|
+
Naming/MethodParameterName:
|
20
|
+
Enabled: false
|
21
|
+
Naming/PredicateName:
|
22
|
+
Enabled: false
|
23
|
+
Lint/RescueException:
|
24
|
+
Enabled: false
|
data/.travis.yml
ADDED
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -1,43 +1,20 @@
|
|
1
1
|
# Wiki::Api
|
2
2
|
|
3
|
-
|
3
|
+
[](https://travis-ci.org/dblommesteijn/wiki-api) [](https://codeclimate.com/github/dblommesteijn/wiki-api)
|
4
4
|
|
5
|
-
|
5
|
+
Wiki API is a gem (Ruby on Rails) that interfaces with the MediaWiki API (https://www.mediawiki.org/wiki/API:Main_page). This gem is more than a interface, it has abstract classes for Page and Headline parsing. You're able to iterate through these headlines, and access data accordingly.
|
6
|
+
|
7
|
+
NOTE: This gem has a nokogiri (http://nokogiri.org/Nokogiri.html) backend (for HTML parsing). Major components: `Page`, `Headline`, `Block`, `ListItem`, and `Link` are wrappers for easy data access, however it's still possible to retreive the raw HTML within these objects.
|
6
8
|
|
7
9
|
Requests to the MediaWiki API use the following URI structure:
|
8
10
|
|
9
11
|
http(s)://somemediawiki.org/w/api.php?action=parse&format=json&page="anypage"
|
10
12
|
|
13
|
+
### Dependencies
|
11
14
|
|
12
|
-
### Dependencies (production)
|
13
|
-
|
14
|
-
* json
|
15
15
|
* nokogiri
|
16
16
|
|
17
17
|
|
18
|
-
### Roadmap
|
19
|
-
|
20
|
-
* Version (0.0.2) (current)
|
21
|
-
|
22
|
-
Index important words per block, page, list item;
|
23
|
-
|
24
|
-
Parse objects for more elements within a Page.
|
25
|
-
|
26
|
-
|
27
|
-
### Changelog
|
28
|
-
|
29
|
-
* Version (0.0.1) -> (0.0.2)
|
30
|
-
|
31
|
-
Nested ListItems, Links (within Page)
|
32
|
-
|
33
|
-
Search on Page headline (ignore case, and underscore)
|
34
|
-
|
35
|
-
|
36
|
-
### Known Issues
|
37
|
-
|
38
|
-
None discovered thus far.
|
39
|
-
|
40
|
-
|
41
18
|
## Installation
|
42
19
|
|
43
20
|
Add this line to your application's Gemfile (bundler):
|
@@ -52,32 +29,41 @@ Or install it yourself (RubyGems):
|
|
52
29
|
|
53
30
|
$ gem install wiki-api
|
54
31
|
|
32
|
+
Or try it from this repository (local) in a console:
|
33
|
+
|
34
|
+
$ bin/console
|
35
|
+
|
55
36
|
|
56
37
|
## Setup
|
57
38
|
|
58
39
|
Define a configuration for your connection (initialize script), this example uses wiktionary.org.
|
59
|
-
NOTE: it can connect to both HTTP and HTTPS MediaWikis
|
60
|
-
|
61
|
-
```ruby
|
62
|
-
CONFIG = { uri: "http://en.wiktionary.org" }
|
63
|
-
```
|
40
|
+
NOTE: it can connect to both HTTP and HTTPS MediaWikis (however you'll get a 302 response from MediaWiki)
|
64
41
|
|
65
42
|
Setup default configuration (initialize script)
|
66
43
|
|
67
44
|
```ruby
|
68
|
-
Wiki::Api::Connect.config =
|
45
|
+
Wiki::Api::Connect.config = { uri: 'https://en.wiktionary.org' }
|
69
46
|
```
|
70
47
|
|
71
48
|
|
49
|
+
## Running tests
|
50
|
+
|
51
|
+
```bash
|
52
|
+
$ rake test
|
53
|
+
```
|
54
|
+
|
72
55
|
## Usage
|
73
56
|
|
74
|
-
### Query a Page
|
57
|
+
### Query a Page and Headline
|
75
58
|
|
76
59
|
Requesting headlines from a given page.
|
77
60
|
|
78
61
|
```ruby
|
79
|
-
page = Wiki::Api::Page.new
|
80
|
-
|
62
|
+
page = Wiki::Api::Page.new(name: 'Wiktionary:Welcome,_newcomers')
|
63
|
+
# the root headline equals the pagename
|
64
|
+
puts page.root_headline.name
|
65
|
+
# iterate next level of headlines
|
66
|
+
page.root_headline.headlines.each do |headline_name, headline|
|
81
67
|
# printing headline name (PageHeadline)
|
82
68
|
puts headline.name
|
83
69
|
end
|
@@ -86,30 +72,30 @@ end
|
|
86
72
|
Getting headlines for a given name.
|
87
73
|
|
88
74
|
```ruby
|
89
|
-
page = Wiki::Api::Page.new
|
90
|
-
|
91
|
-
|
92
|
-
|
93
|
-
|
75
|
+
page = Wiki::Api::Page.new(name: 'Wiktionary:Welcome,_newcomers')
|
76
|
+
# lookup headline by name (underscore and case are ignored)
|
77
|
+
headline = page.root_headline.headline('editing wiktionary').first
|
78
|
+
# printing headline name (PageHeadline)
|
79
|
+
puts headline.name
|
80
|
+
# get the type of nested headline (html h1,2,3,4 etc.)
|
81
|
+
puts headline.type
|
94
82
|
```
|
95
83
|
|
96
84
|
### Basic Page structure
|
97
85
|
|
98
86
|
```ruby
|
99
|
-
page = Wiki::Api::Page.new
|
100
|
-
|
87
|
+
page = Wiki::Api::Page.new(name: 'Wiktionary:Welcome,_newcomers')
|
101
88
|
# iterate PageHeadline objects
|
102
|
-
page.headlines.each do |headline|
|
103
|
-
|
89
|
+
page.root_headline.headlines.each do |headline_name, headline|
|
104
90
|
# exposing nokogiri internal elements
|
105
91
|
elements = headline.elements.flatten
|
106
92
|
elements.each do |element|
|
107
|
-
#
|
93
|
+
# print will result in: Nokogiri::XML::Text or Nokogiri::XML::Element
|
94
|
+
puts element.class
|
108
95
|
end
|
109
96
|
|
110
97
|
# string representation of all nested text
|
111
98
|
block.to_texts
|
112
|
-
|
113
99
|
# iterate PageListItem objects
|
114
100
|
block.list_items.each do |list_item|
|
115
101
|
# string representation of nested text
|
@@ -131,62 +117,105 @@ page.headlines.each do |headline|
|
|
131
117
|
# string representation of nested text
|
132
118
|
link.to_text
|
133
119
|
end
|
134
|
-
|
135
120
|
end
|
136
121
|
```
|
137
122
|
|
138
123
|
|
139
|
-
### Example using Global config (https://en.wikipedia.org/wiki/
|
124
|
+
### Example using Global config (https://en.wikipedia.org/wiki/Ruby_on_Rails)
|
140
125
|
|
141
126
|
This is a example of querying wikipedia.org on the page: "Ruby_on_rails", and printing the References headline links for each list item.
|
142
127
|
|
143
128
|
```ruby
|
144
129
|
# setting a target config
|
145
|
-
|
146
|
-
Wiki::Api::Connect.config = CONFIG
|
130
|
+
Wiki::Api::Connect.config = { uri: 'https://en.wikipedia.org' }
|
147
131
|
|
148
132
|
# querying the page
|
149
|
-
page = Wiki::Api::Page.new
|
133
|
+
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails')
|
150
134
|
|
151
135
|
# get headlines with name Reference (there can be multiple headlines with the same name!)
|
152
|
-
headlines = page.headline
|
136
|
+
headlines = page.root_headline.headline('References')
|
153
137
|
|
154
138
|
# iterate headlines
|
155
139
|
headlines.each do |headline|
|
156
140
|
# iterate list items on the given headline
|
157
141
|
headline.block.list_items.each do |list_item|
|
158
|
-
|
159
142
|
# print the uri of all links
|
160
|
-
puts list_item.links.map
|
161
|
-
|
143
|
+
puts list_item.links.map(&:uri)
|
162
144
|
end
|
163
145
|
end
|
164
146
|
```
|
165
147
|
|
166
148
|
|
167
|
-
|
168
|
-
### Example passing URI (https://en.wikipedia.org/wiki/Ruby_on_rails)
|
149
|
+
### Example passing URI (https://en.wikipedia.org/wiki/Ruby_on_Rails)
|
169
150
|
|
170
151
|
This is the same example as the one above, except for setting a global config to direct the requests to a given URI.
|
171
152
|
|
172
153
|
```ruby
|
173
154
|
# querying the page
|
174
|
-
page = Wiki::Api::Page.new
|
155
|
+
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails', uri: 'https://en.wikipedia.org')
|
175
156
|
|
176
157
|
# get headlines with name Reference (there can be multiple headlines with the same name!)
|
177
|
-
headlines = page.headline
|
158
|
+
headlines = page.root_headline.headline('References')
|
178
159
|
|
179
160
|
# iterate headlines
|
180
161
|
headlines.each do |headline|
|
181
162
|
# iterate list items on the given headline
|
182
163
|
headline.block.list_items.each do |list_item|
|
183
|
-
|
184
164
|
# print the uri of all links
|
185
|
-
puts list_item.links.map
|
186
|
-
|
165
|
+
puts list_item.links.map(&:uri)
|
187
166
|
end
|
188
167
|
end
|
189
168
|
```
|
190
169
|
|
191
170
|
|
171
|
+
### Example searching headlines
|
172
|
+
|
173
|
+
This example shows how the headlines can be searched. For more info check: https://github.com/dblommesteijn/wiki-api/blob/master/lib/wiki/api/page.rb#L97
|
174
|
+
|
175
|
+
|
176
|
+
```ruby
|
177
|
+
# querying the page
|
178
|
+
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails', uri: 'https://en.wikipedia.org')
|
179
|
+
|
180
|
+
# NOTE: the following are all valid headline names:
|
181
|
+
# request headline (by literal name)
|
182
|
+
headlines = page.root_headline.headline('Philosophy_and_design')
|
183
|
+
puts headlines.map(&:name)
|
184
|
+
# request headline (by downcase name)
|
185
|
+
headlines = page.root_headline.headline('philosophy_and_design')
|
186
|
+
puts headlines.map(&:name)
|
187
|
+
# request headline (by human name)
|
188
|
+
headlines = page.root_headline.headline('philosophy and design')
|
189
|
+
puts headlines.map(&:name)
|
190
|
+
|
191
|
+
# NOTE2: headlines are matched on headline.start_with?(requested_headline)
|
192
|
+
# because of start_with? compare this should work as well!
|
193
|
+
headlines = page.root_headline.headline('philosophy')
|
194
|
+
puts headlines.map(&:name)
|
195
|
+
```
|
196
|
+
|
197
|
+
|
198
|
+
### Example searching headlines in depth
|
199
|
+
|
200
|
+
Recursive search on all nested headlines, including in depth searches.
|
201
|
+
|
202
|
+
```ruby
|
203
|
+
# querying the page
|
204
|
+
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails', uri: 'https://en.wikipedia.org')
|
205
|
+
# get root
|
206
|
+
root_headline = page.root_headline
|
207
|
+
# lookup 'ramework structure' on current level
|
208
|
+
headline = root_headline.headline_in_depth('framework structure').first
|
209
|
+
puts headline.name
|
210
|
+
# NOTE: lookup of nested headlines does not work with the headline function (because 'Framework_structure' is nested within 'Technical_overview')
|
211
|
+
headline = root_headline.headline('framework structure').first
|
212
|
+
# depth can be limited adding the depth parameter
|
213
|
+
# NOTE: the example below will return nil, 'Framework_structure' is nested beyond depth = 0!
|
214
|
+
depth = 0
|
215
|
+
headline = root_headline.headline_in_depth('framework structure', depth).first
|
216
|
+
# increasing depth search will show the requested headline
|
217
|
+
depth = 5
|
218
|
+
headline = root_headline.headline_in_depth('framework structure', depth).first
|
219
|
+
puts headline.name
|
220
|
+
```
|
192
221
|
|
data/Rakefile
CHANGED
@@ -1 +1,13 @@
|
|
1
|
-
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require 'bundler/gem_tasks'
|
4
|
+
require 'rake/testtask'
|
5
|
+
|
6
|
+
Rake::TestTask.new do |t|
|
7
|
+
t.libs << 'test'
|
8
|
+
tfs = FileList['test/unit/*.rb']
|
9
|
+
t.test_files = tfs
|
10
|
+
t.verbose = true
|
11
|
+
end
|
12
|
+
|
13
|
+
task default: %i[build install]
|
data/bin/console
ADDED
data/lib/wiki/api/connect.rb
CHANGED
@@ -1,71 +1,95 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
1
3
|
require 'net/http'
|
2
4
|
require 'json'
|
3
5
|
require 'nokogiri'
|
4
6
|
|
5
7
|
module Wiki
|
6
8
|
module Api
|
7
|
-
|
8
9
|
class Connect
|
10
|
+
attr_accessor :uri, :api_path, :api_options, :http, :request, :response, :html, :parsed, :file
|
9
11
|
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
@@config
|
14
|
-
|
15
|
-
self.
|
16
|
-
self.api_path = options[:api_path] if options.include? :api_path
|
17
|
-
self.api_options = options[:api_options] if options.include? :api_options
|
12
|
+
def initialize(options = {})
|
13
|
+
@@config ||= {}
|
14
|
+
self.uri = options[:uri] || @@config[:uri]
|
15
|
+
self.file = options[:file] || @@config[:file]
|
16
|
+
self.api_path = options[:api_path] || @@config[:api_path]
|
17
|
+
self.api_options = options[:api_options] || @@config[:api_options]
|
18
18
|
|
19
19
|
# defaults
|
20
|
-
self.api_path ||=
|
21
|
-
self.api_options ||= {action:
|
20
|
+
self.api_path ||= '/w/api.php'
|
21
|
+
self.api_options ||= { action: 'parse', format: 'json', page: '' }
|
22
22
|
|
23
23
|
# errors
|
24
|
-
raise
|
24
|
+
raise('no uri given') if uri.nil?
|
25
25
|
end
|
26
26
|
|
27
27
|
def connect
|
28
28
|
uri = URI("#{self.uri}#{self.api_path}")
|
29
|
-
uri.query = URI.encode_www_form
|
29
|
+
uri.query = URI.encode_www_form(self.api_options)
|
30
30
|
self.http = Net::HTTP.new(uri.host, uri.port)
|
31
|
-
if uri.scheme ==
|
32
|
-
|
33
|
-
#self.http.verify_mode = OpenSSL::SSL::VERIFY_NONE
|
31
|
+
if uri.scheme == 'https'
|
32
|
+
http.use_ssl = true
|
33
|
+
# self.http.verify_mode = OpenSSL::SSL::VERIFY_NONE
|
34
34
|
end
|
35
35
|
self.request = Net::HTTP::Get.new(uri.request_uri)
|
36
|
-
self.response =
|
36
|
+
self.response = http.request(request)
|
37
37
|
end
|
38
38
|
|
39
|
-
def page
|
39
|
+
def page(page_name)
|
40
40
|
self.api_options[:page] = page_name
|
41
|
-
|
41
|
+
# parse page by uri
|
42
|
+
if !uri.nil? && file.nil?
|
43
|
+
self.parsed = parse_from_uri(response)
|
44
|
+
# parse page by file
|
45
|
+
elsif !file.nil?
|
46
|
+
self.parsed = parse_from_file(file)
|
47
|
+
# invalid config, raise exception
|
48
|
+
else
|
49
|
+
raise('no :uri or :file config found!')
|
50
|
+
end
|
51
|
+
parsed
|
52
|
+
end
|
53
|
+
|
54
|
+
def parse_from_uri(response)
|
55
|
+
connect
|
56
|
+
# rubocop:disable Lint/ShadowedArgument
|
42
57
|
response = self.response
|
43
|
-
|
44
|
-
|
58
|
+
# rubocop:enable Lint/ShadowedArgument
|
59
|
+
json = JSON.parse(response.body, { symbolize_names: true })
|
60
|
+
raise(json[:error][:code]) unless valid?(json, response)
|
61
|
+
|
45
62
|
self.html = json[:parse][:text]
|
46
|
-
self.parsed = Nokogiri::HTML
|
63
|
+
self.parsed = Nokogiri::HTML(html[:*])
|
64
|
+
end
|
65
|
+
|
66
|
+
def parse_from_file(file)
|
67
|
+
f = File.open(file)
|
68
|
+
ret = Nokogiri::HTML(f)
|
69
|
+
f.close
|
70
|
+
ret
|
47
71
|
end
|
48
72
|
|
49
73
|
class << self
|
50
74
|
def config=(config = {})
|
51
75
|
@@config = config
|
52
76
|
end
|
77
|
+
|
53
78
|
def config
|
54
79
|
@@config ||= []
|
55
80
|
end
|
56
81
|
end
|
57
82
|
|
58
83
|
protected
|
59
|
-
|
84
|
+
|
85
|
+
def valid?(json, response)
|
60
86
|
b = []
|
61
87
|
# valid http response
|
62
|
-
b << (response.is_a?
|
88
|
+
b << (response.is_a?(Net::HTTPOK))
|
63
89
|
# not an invalid api response handle
|
64
|
-
b << (!json.include?
|
90
|
+
b << (!json.include?(:error))
|
65
91
|
!b.include?(false)
|
66
92
|
end
|
67
|
-
|
68
93
|
end
|
69
|
-
|
70
94
|
end
|
71
|
-
end
|
95
|
+
end
|
data/lib/wiki/api/page.rb
CHANGED
@@ -1,136 +1,102 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
1
3
|
module Wiki
|
2
4
|
module Api
|
3
|
-
|
5
|
+
# MediaWiki Page, collection of all html information plus it's page title
|
4
6
|
class Page
|
7
|
+
attr_accessor :name, :parsed_page, :uri, :parent
|
5
8
|
|
6
|
-
|
7
|
-
|
8
|
-
|
9
|
-
|
10
|
-
uri = options[:uri] if options.include? :uri
|
11
|
-
|
12
|
-
@@config ||= nil
|
13
|
-
if @@config.nil? || !uri.nil?
|
14
|
-
# use the connection to collect HTML pages for parsing
|
15
|
-
@connect = Wiki::Api::Connect.new uri: uri
|
16
|
-
else
|
17
|
-
# using a local HTML file for parsing
|
18
|
-
end
|
9
|
+
def initialize(options = {})
|
10
|
+
self.name = options[:name] if options.include?(:name)
|
11
|
+
self.uri = options[:uri] if options.include?(:uri)
|
12
|
+
@connect = Wiki::Api::Connect.new(uri:)
|
19
13
|
end
|
20
14
|
|
21
|
-
|
22
|
-
headlines = []
|
23
|
-
self.parse_blocks.each do |headline_name, elements|
|
24
|
-
headline = PageHeadline.new name: headline_name
|
25
|
-
elements.each do |element|
|
26
|
-
# nokogiri element
|
27
|
-
headline.block << element
|
28
|
-
end
|
29
|
-
headlines << headline
|
30
|
-
end
|
31
|
-
headlines
|
32
|
-
end
|
15
|
+
attr_reader :connect
|
33
16
|
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
headline = PageHeadline.new name: headline_name
|
38
|
-
elements.each do |element|
|
39
|
-
# nokogiri element
|
40
|
-
headline.block << element
|
41
|
-
end
|
42
|
-
headlines << headline
|
43
|
-
end
|
44
|
-
headlines
|
17
|
+
# collect all headlines, keep original page formatting
|
18
|
+
def root_headline
|
19
|
+
parse_blocks
|
45
20
|
end
|
46
21
|
|
47
|
-
|
22
|
+
# # collect headlines by given name, this will flatten the nested headlines
|
23
|
+
# def flat_headlines_by_name headline_name
|
24
|
+
# raise "not yet implemented!"
|
25
|
+
# # TODO: implement flattening of headlines within the root headline
|
26
|
+
# # ALT: breath search option in the root of the first headline
|
27
|
+
# self.parse_blocks(headline_name)
|
28
|
+
# end
|
48
29
|
|
49
30
|
def to_html
|
50
|
-
|
51
|
-
|
31
|
+
load_page!
|
32
|
+
parsed_page.to_xhtml(indent: 3, indent_text: ' ')
|
52
33
|
end
|
53
34
|
|
54
35
|
def reset!
|
55
36
|
self.parse_page = nil
|
56
37
|
end
|
57
38
|
|
58
|
-
class << self
|
59
|
-
def config=(config = {})
|
60
|
-
@@config = config
|
61
|
-
end
|
62
|
-
end
|
63
|
-
|
64
|
-
protected
|
65
|
-
|
66
39
|
def load_page!
|
67
|
-
|
68
|
-
self.parsed_page ||= @connect.page self.name
|
69
|
-
elsif self.parsed_page.nil?
|
70
|
-
f = File.open(@@config[:file])
|
71
|
-
self.parsed_page = Nokogiri::HTML(f)
|
72
|
-
f.close
|
73
|
-
end
|
40
|
+
self.parsed_page ||= @connect.page(name)
|
74
41
|
end
|
75
42
|
|
76
|
-
|
77
43
|
# parse blocks
|
78
|
-
def parse_blocks
|
79
|
-
|
44
|
+
def parse_blocks(headline_name = nil)
|
45
|
+
load_page!
|
80
46
|
result = {}
|
81
47
|
|
82
48
|
# get headline nodes by span class
|
83
|
-
|
49
|
+
headlines = self.parsed_page.xpath("//span[@class='mw-headline']")
|
50
|
+
|
84
51
|
# filter single headline by name (ignore case)
|
85
|
-
|
52
|
+
headlines = filter_headline(headlines, headline_name) unless headline_name.nil?
|
86
53
|
|
87
54
|
# NOTE: first_part has no id attribute and thus cannot be filtered or processed within xpath (xs)
|
88
|
-
if headline_name
|
89
|
-
x =
|
90
|
-
result[
|
91
|
-
result[
|
55
|
+
if headline_name.nil? || headline_name.start_with?(name.downcase)
|
56
|
+
x = first_part
|
57
|
+
result[name] ||= []
|
58
|
+
result[name] << (collect_elements(x.parent))
|
92
59
|
end
|
93
60
|
|
94
61
|
# append all blocks
|
95
|
-
|
96
|
-
|
97
|
-
elements =
|
98
|
-
result[
|
99
|
-
result[
|
62
|
+
headlines.each do |headline|
|
63
|
+
headline_value = headline.attributes['id'].value
|
64
|
+
elements = collect_elements(headline.parent.next)
|
65
|
+
result[headline_value] ||= []
|
66
|
+
result[headline_value] << elements
|
100
67
|
end
|
101
68
|
|
102
|
-
|
69
|
+
# create root object
|
70
|
+
PageHeadline.new(parent: self, name: result.first[0], headlines: result, level: 0)
|
103
71
|
end
|
104
72
|
|
105
73
|
# harvest first part of the page (missing heading and class="mw-headline")
|
106
74
|
def first_part
|
107
|
-
self.parsed_page ||= @connect.page
|
108
|
-
self.parsed_page.search(
|
75
|
+
self.parsed_page ||= @connect.page(name)
|
76
|
+
self.parsed_page.search('p').first.children.first
|
109
77
|
end
|
110
78
|
|
111
79
|
# collect elements within headlines (not nested properties, but next elements)
|
112
|
-
def collect_elements
|
80
|
+
def collect_elements(element)
|
113
81
|
# capture first element name
|
114
82
|
elements = []
|
115
83
|
# iterate text until next headline
|
116
|
-
|
84
|
+
loop do
|
117
85
|
elements << element
|
118
86
|
element = element.next
|
119
|
-
break if element.nil? || element.to_html.include?(
|
87
|
+
break if element.nil? || element.to_html.include?('class="mw-headline"')
|
120
88
|
end
|
121
89
|
elements
|
122
90
|
end
|
123
91
|
|
124
|
-
def filter_headline
|
92
|
+
def filter_headline(xs, headline_name)
|
125
93
|
# transform name to a wiki_id (downcase and space replace with underscore)
|
126
|
-
headline_name = headline_name.downcase.gsub(
|
94
|
+
headline_name = headline_name.downcase.gsub(' ', '_')
|
127
95
|
# reject not matching id's
|
128
|
-
xs.
|
129
|
-
|
96
|
+
xs.select do |t|
|
97
|
+
t.attributes['id'].value.downcase.start_with?(headline_name)
|
130
98
|
end
|
131
99
|
end
|
132
|
-
|
133
100
|
end
|
134
|
-
|
135
101
|
end
|
136
|
-
end
|
102
|
+
end
|