infoboxer 0.1.2.1 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +13 -1
- data/README.md +26 -5
- data/bin/infoboxer +45 -0
- data/infoboxer.gemspec +1 -1
- data/lib/infoboxer/definitions/en.wikipedia.org.rb +0 -1
- data/lib/infoboxer/media_wiki/page.rb +13 -5
- data/lib/infoboxer/media_wiki/traits.rb +11 -5
- data/lib/infoboxer/media_wiki.rb +115 -67
- data/lib/infoboxer/navigation/shortcuts.rb +1 -1
- data/lib/infoboxer/parser/context.rb +16 -2
- data/lib/infoboxer/parser/image.rb +1 -1
- data/lib/infoboxer/parser/inline.rb +12 -3
- data/lib/infoboxer/parser/template.rb +3 -4
- data/lib/infoboxer/parser/util.rb +14 -3
- data/lib/infoboxer/tree/image.rb +1 -1
- data/lib/infoboxer/tree/nodes.rb +2 -2
- data/lib/infoboxer/tree/paragraphs.rb +1 -0
- data/lib/infoboxer/tree/table.rb +1 -1
- data/lib/infoboxer/tree/template.rb +9 -0
- data/lib/infoboxer/version.rb +4 -1
- data/lib/infoboxer.rb +87 -35
- data/regression/pages/list_of_countries.wiki +1493 -0
- data/regression/pages/ukrainian_galician_army.wiki +76 -0
- metadata +8 -5
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: d3081274989109208504796d1357e7ab78dd8981
|
4
|
+
data.tar.gz: 255f2ffa01c283fd11cbe1a1b308223d276c3b22
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 47ff1c7ac1f6e34ba4e5491cd7f5a6e180f18c02c4bf6061d08c6589ca3b66cd8ac1c600cc6e03dda244c3dd37a1986356e47d86f79602416e0eba021182fe00
|
7
|
+
data.tar.gz: c40d2bb3f4b2d336830d56e8b8cc2b126807022f409fe41fd63f0d229f139030b4ef9c18116f802016c3e33af6f4aba1f1766c03619d3e185cdce2b949d63bf6
|
data/CHANGELOG.md
CHANGED
@@ -1,5 +1,17 @@
|
|
1
1
|
# Infoboxer's change log
|
2
2
|
|
3
|
+
## 0.2.0 (2015-12-21)
|
4
|
+
|
5
|
+
* MediaWiki backend changed to (our own handcrafted)
|
6
|
+
[mediawiktory](https://github.com/molybdenum-99/mediawiktory);
|
7
|
+
* Added page lists fetching like `MediaWiki#category(categoryname)`,
|
8
|
+
`MediaWiki#search(search_phrase)`;
|
9
|
+
* `MediaWiki#get` now can fetch any number of pages at once (it was only
|
10
|
+
50 in previous versions);
|
11
|
+
* `bin/infoboxer` console added for quick experimenting;
|
12
|
+
* `Template#to_h` added for quick information extraction;
|
13
|
+
* many small bugfixes and echancements.
|
14
|
+
|
3
15
|
## 0.1.2.1 (2015-12-04)
|
4
16
|
|
5
17
|
* Small bug with newlines in templates fixed.
|
@@ -22,6 +34,6 @@ Basically, preparing for wider release!
|
|
22
34
|
|
23
35
|
## 0.1.0 (2015-08-07)
|
24
36
|
|
25
|
-
Initial (ok, I know it's typically called 0.
|
37
|
+
Initial (ok, I know it's typically called 0.0.1, but here's work of
|
26
38
|
three monthes, numerous documentations and examples and so on... so, let
|
27
39
|
it be 0.1.0).
|
data/README.md
CHANGED
@@ -4,6 +4,7 @@
|
|
4
4
|
[](https://travis-ci.org/molybdenum-99/infoboxer)
|
5
5
|
[](https://coveralls.io/github/molybdenum-99/infoboxer?branch=master)
|
6
6
|
[](https://codeclimate.com/github/molybdenum-99/infoboxer)
|
7
|
+
[](https://gitter.im/molybdenum-99)
|
7
8
|
|
8
9
|
**Infoboxer** is pure-Ruby Wikipedia (and generic MediaWiki) client and
|
9
10
|
parser, targeting information extraction (hence the name).
|
@@ -97,6 +98,25 @@ See [Navigation shortcuts](https://github.com/molybdenum-99/infoboxer/wiki/Navig
|
|
97
98
|
|
98
99
|
To put it all in one piece, also take a look at [Data extraction tips and tricks](https://github.com/molybdenum-99/infoboxer/wiki/Tips-and-tricks).
|
99
100
|
|
101
|
+
### infoboxer executable
|
102
|
+
|
103
|
+
Just try `infoboxer` command.
|
104
|
+
|
105
|
+
Without any options, it starts IRB session with infoboxer required and
|
106
|
+
included into main namespace.
|
107
|
+
|
108
|
+
With `-w` option, it provides a shortcut to MediaWiki instance you want.
|
109
|
+
Like this:
|
110
|
+
|
111
|
+
```
|
112
|
+
$ infoboxer -w https://en.wikipedia.org/w/api.php
|
113
|
+
> get('Argentina')
|
114
|
+
=> #<Page(title: "Argentina", url: "https://en.wikipedia.org/wiki/Argentina"): ....
|
115
|
+
```
|
116
|
+
|
117
|
+
You can also use shortcuts like `infoboxer -w wikipedia` for common
|
118
|
+
wikies (and, just for fun, `infoboxer -wikipedia` also).
|
119
|
+
|
100
120
|
## Advanced topics
|
101
121
|
|
102
122
|
* [Reasons](https://github.com/molybdenum-99/infoboxer/wiki/Reasons) for
|
@@ -114,9 +134,10 @@ To put it all in one piece, also take a look at [Data extraction tips and tricks
|
|
114
134
|
|
115
135
|
## Compatibility
|
116
136
|
|
117
|
-
As of now, Infoboxer reported to be compatible with any MRI Ruby since
|
118
|
-
|
119
|
-
|
137
|
+
As of now, Infoboxer reported to be compatible with any MRI Ruby since 2.0.0
|
138
|
+
(1.9.3 previously, dropped since Infoboxer 0.2.0). In Travis-CI tests,
|
139
|
+
JRuby is failing due to bug in old Java 7/Java 8 SSL certificate support
|
140
|
+
([see here](https://github.com/jruby/jruby/issues/2599)),
|
120
141
|
and Rubinius failing 3 specs of 500 by mystery, which is uninvestigated yet.
|
121
142
|
|
122
143
|
Therefore, those Ruby versions are excluded from Travis config, though,
|
@@ -129,10 +150,10 @@ they may still work for you.
|
|
129
150
|
* **NB**: ↑ this is "current version" link, but RubyDoc.info unfortunately
|
130
151
|
sometimes fails to update it to really _current_; in case you feel
|
131
152
|
something seriously underdocumented, please-please look at
|
132
|
-
[0.
|
153
|
+
[0.2.0 docs](http://www.rubydoc.info/gems/infoboxer/0.2.0).
|
133
154
|
* [Contributing](https://github.com/molybdenum-99/infoboxer/wiki/Contributing)
|
134
155
|
* [Roadmap](https://github.com/molybdenum-99/infoboxer/wiki/Roadmap)
|
135
156
|
|
136
157
|
## License
|
137
158
|
|
138
|
-
MIT.
|
159
|
+
[MIT](https://github.com/molybdenum-99/infoboxer/blob/master/LICENSE.txt).
|
data/bin/infoboxer
ADDED
@@ -0,0 +1,45 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
require 'rubygems'
|
3
|
+
require 'bundler/setup'
|
4
|
+
require 'infoboxer'
|
5
|
+
|
6
|
+
include Infoboxer
|
7
|
+
|
8
|
+
require 'optparse'
|
9
|
+
|
10
|
+
wiki_url = nil
|
11
|
+
|
12
|
+
OptionParser.new do |opts|
|
13
|
+
opts.banner = "Usage: bin/infoboxer [-w wiki_api_url]"
|
14
|
+
|
15
|
+
opts.on("-w", "--wiki WIKI_API_URL",
|
16
|
+
"Make wiki by WIKI_API_URL a default wiki, and use it with just get('Pagename')") do |w|
|
17
|
+
wiki_url = w
|
18
|
+
end
|
19
|
+
end.parse!
|
20
|
+
|
21
|
+
if wiki_url
|
22
|
+
if wiki_url =~ /^[a-z]+$/
|
23
|
+
wiki_url = case
|
24
|
+
when domain = Infoboxer::WIKIMEDIA_PROJECTS[wiki_url.to_sym]
|
25
|
+
"https://en.#{domain}/w/api.php"
|
26
|
+
when domain = Infoboxer::WIKIMEDIA_PROJECTS[('w' + wiki_url).to_sym]
|
27
|
+
"https://en.#{domain}/w/api.php"
|
28
|
+
else
|
29
|
+
fail("Unidentified wiki: #{wiki_url}")
|
30
|
+
end
|
31
|
+
end
|
32
|
+
|
33
|
+
DEFAULT_WIKI = Infoboxer.wiki(wiki_url)
|
34
|
+
puts "Default Wiki selected: #{wiki_url}.\nNow you can use `get('Pagename')`, `category('Categoryname')` and so on.\n\n"
|
35
|
+
[:raw, :get, :category, :search, :prefixsearch].each do |m|
|
36
|
+
define_method(m){|*arg|
|
37
|
+
DEFAULT_WIKI.send(m, *arg)
|
38
|
+
}
|
39
|
+
end
|
40
|
+
end
|
41
|
+
|
42
|
+
require 'irb'
|
43
|
+
ARGV.shift until ARGV.empty?
|
44
|
+
IRB.start
|
45
|
+
|
data/infoboxer.gemspec
CHANGED
@@ -29,7 +29,7 @@ Gem::Specification.new do |s|
|
|
29
29
|
|
30
30
|
s.add_dependency 'htmlentities'
|
31
31
|
s.add_dependency 'procme'
|
32
|
-
s.add_dependency '
|
32
|
+
s.add_dependency 'mediawiktory', '>= 0.0.2'
|
33
33
|
s.add_dependency 'addressable'
|
34
34
|
s.add_dependency 'terminal-table'
|
35
35
|
s.add_dependency 'backports'
|
@@ -7,15 +7,19 @@ module Infoboxer
|
|
7
7
|
# Alongside with document tree structure, knows document's title as
|
8
8
|
# represented by MediaWiki and human (non-API) URL.
|
9
9
|
class Page < Tree::Document
|
10
|
-
def initialize(client, children,
|
11
|
-
@client = client
|
12
|
-
super(children,
|
10
|
+
def initialize(client, children, source)
|
11
|
+
@client, @source = client, source
|
12
|
+
super(children, title: source.title, url: source.fullurl)
|
13
13
|
end
|
14
14
|
|
15
15
|
# Instance of {MediaWiki} which this page was received from
|
16
16
|
# @return {MediaWiki}
|
17
17
|
attr_reader :client
|
18
18
|
|
19
|
+
# Instance of MediaWiktory::Page class with source data
|
20
|
+
# @return {MediaWiktory::Page}
|
21
|
+
attr_reader :source
|
22
|
+
|
19
23
|
# @!attribute [r] title
|
20
24
|
# Page title.
|
21
25
|
# @return [String]
|
@@ -24,11 +28,15 @@ module Infoboxer
|
|
24
28
|
# Page friendly URL.
|
25
29
|
# @return [String]
|
26
30
|
|
27
|
-
def_readers :title, :url
|
31
|
+
def_readers :title, :url
|
32
|
+
|
33
|
+
def traits
|
34
|
+
client.traits
|
35
|
+
end
|
28
36
|
|
29
37
|
private
|
30
38
|
|
31
|
-
PARAMS_TO_INSPECT = [:url, :title
|
39
|
+
PARAMS_TO_INSPECT = [:url, :title] #, :domain]
|
32
40
|
|
33
41
|
def show_params
|
34
42
|
super(params.select{|k, v| PARAMS_TO_INSPECT.include?(k)})
|
@@ -68,14 +68,14 @@ module Infoboxer
|
|
68
68
|
|
69
69
|
def initialize(options = {})
|
70
70
|
@options = options
|
71
|
-
@
|
71
|
+
@file_namespace = [DEFAULTS[:file_namespace], namespace_aliases(options, 'File')].
|
72
72
|
flatten.compact.uniq
|
73
|
-
@
|
73
|
+
@category_namespace = [DEFAULTS[:category_namespace], namespace_aliases(options, 'Category')].
|
74
74
|
flatten.compact.uniq
|
75
75
|
end
|
76
76
|
|
77
77
|
# @private
|
78
|
-
attr_reader :
|
78
|
+
attr_reader :file_namespace, :category_namespace
|
79
79
|
|
80
80
|
# @private
|
81
81
|
def templates
|
@@ -84,9 +84,15 @@ module Infoboxer
|
|
84
84
|
|
85
85
|
private
|
86
86
|
|
87
|
+
def namespace_aliases(options, canonical)
|
88
|
+
namespace = (options[:namespaces] || []).detect{|v| v.canonical == canonical}
|
89
|
+
return nil unless namespace
|
90
|
+
[namespace['*'], *namespace.aliases]
|
91
|
+
end
|
92
|
+
|
87
93
|
DEFAULTS = {
|
88
|
-
|
89
|
-
|
94
|
+
file_namespace: 'File',
|
95
|
+
category_namespace: 'Category'
|
90
96
|
}
|
91
97
|
|
92
98
|
end
|
data/lib/infoboxer/media_wiki.rb
CHANGED
@@ -1,6 +1,7 @@
|
|
1
1
|
# encoding: utf-8
|
2
|
-
require 'rest-client'
|
3
|
-
require 'json'
|
2
|
+
#require 'rest-client'
|
3
|
+
#require 'json'
|
4
|
+
require 'mediawiktory'
|
4
5
|
require 'addressable/uri'
|
5
6
|
|
6
7
|
require_relative 'media_wiki/traits'
|
@@ -36,7 +37,7 @@ module Infoboxer
|
|
36
37
|
attr_accessor :user_agent
|
37
38
|
end
|
38
39
|
|
39
|
-
attr_reader :api_base_url
|
40
|
+
attr_reader :api_base_url, :traits
|
40
41
|
|
41
42
|
# Creating new MediaWiki client. {Infoboxer.wiki} provides shortcut
|
42
43
|
# for it, as well as shortcuts for some well-known wikis, like
|
@@ -49,7 +50,8 @@ module Infoboxer
|
|
49
50
|
# * `:user_agent` (also aliased as `:ua`) -- custom User-Agent header.
|
50
51
|
def initialize(api_base_url, options = {})
|
51
52
|
@api_base_url = Addressable::URI.parse(api_base_url)
|
52
|
-
@
|
53
|
+
@client = MediaWiktory::Client.new(api_base_url, user_agent: user_agent(options))
|
54
|
+
@traits = Traits.get(@api_base_url.host, namespaces: extract_namespaces)
|
53
55
|
end
|
54
56
|
|
55
57
|
# Receive "raw" data from Wikipedia (without parsing or wrapping in
|
@@ -57,18 +59,22 @@ module Infoboxer
|
|
57
59
|
#
|
58
60
|
# @return [Array<Hash>]
|
59
61
|
def raw(*titles)
|
60
|
-
|
61
|
-
|
62
|
-
|
62
|
+
titles.each_slice(50).map{|part|
|
63
|
+
@client.query.
|
64
|
+
titles(*part).
|
65
|
+
prop(revisions: {prop: :content}, info: {prop: :url}).
|
66
|
+
redirects(true). # FIXME: should be done transparently by MediaWiktory?
|
67
|
+
perform.pages
|
68
|
+
}.inject(:concat) # somehow flatten(1) fails!
|
63
69
|
end
|
64
70
|
|
65
|
-
# Receive list of parsed
|
71
|
+
# Receive list of parsed MediaWiki pages for list of titles provided.
|
66
72
|
# All pages are received with single query to MediaWiki API.
|
67
73
|
#
|
68
|
-
# **NB**:
|
69
|
-
#
|
70
|
-
#
|
71
|
-
#
|
74
|
+
# **NB**: if you are requesting more than 50 titles at once
|
75
|
+
# (MediaWiki limitation for single request), Infoboxer will do as
|
76
|
+
# many queries as necessary to extract them all (it will be like
|
77
|
+
# `(titles.count / 50.0).ceil` requests)
|
72
78
|
#
|
73
79
|
# @return [Tree::Nodes<Page>] array of parsed pages. Notes:
|
74
80
|
# * if you call `get` with only one title, one page will be
|
@@ -87,76 +93,118 @@ module Infoboxer
|
|
87
93
|
# NotFound.
|
88
94
|
#
|
89
95
|
def get(*titles)
|
90
|
-
pages = raw(*titles).
|
96
|
+
pages = raw(*titles).
|
97
|
+
tap{|pages| pages.detect(&:invalid?).tap{|i| i && fail(i.raw.invalidreason)}}.
|
98
|
+
select(&:exists?).
|
91
99
|
map{|raw|
|
92
|
-
traits = Traits.get(@api_base_url.host, extract_traits(raw))
|
93
|
-
|
94
100
|
Page.new(self,
|
95
|
-
Parser.paragraphs(raw
|
96
|
-
raw
|
101
|
+
Parser.paragraphs(raw.content, traits),
|
102
|
+
raw)
|
97
103
|
}
|
98
104
|
titles.count == 1 ? pages.first : Tree::Nodes[*pages]
|
99
105
|
end
|
100
106
|
|
101
|
-
|
107
|
+
# Receive list of parsed MediaWiki pages from specified category.
|
108
|
+
#
|
109
|
+
# **NB**: currently, this API **always** fetches all pages from
|
110
|
+
# category, there is no option to "take first 20 pages". Pages are
|
111
|
+
# fetched in 50-page batches, then parsed. So, for large category
|
112
|
+
# it can really take a while to fetch all pages.
|
113
|
+
#
|
114
|
+
# @param title Category title. You can use namespaceless title (like
|
115
|
+
# `"Countries in South America"`), title with namespace (like
|
116
|
+
# `"Category:Countries in South America"`) or title with local
|
117
|
+
# namespace (like `"Catégorie:Argentine"` for French Wikipedia)
|
118
|
+
#
|
119
|
+
# @return [Tree::Nodes<Page>] array of parsed pages.
|
120
|
+
#
|
121
|
+
def category(title)
|
122
|
+
title = normalize_category_title(title)
|
123
|
+
|
124
|
+
list(categorymembers: {title: title, limit: 50})
|
125
|
+
end
|
102
126
|
|
103
|
-
#
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
|
108
|
-
|
109
|
-
|
110
|
-
|
111
|
-
#
|
112
|
-
|
113
|
-
|
114
|
-
|
115
|
-
|
116
|
-
|
117
|
-
|
118
|
-
|
119
|
-
|
120
|
-
|
121
|
-
|
122
|
-
|
123
|
-
|
127
|
+
# Receive list of parsed MediaWiki pages for provided search query.
|
128
|
+
# See [MediaWiki API docs](https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bsearch)
|
129
|
+
# for details.
|
130
|
+
#
|
131
|
+
# **NB**: currently, this API **always** fetches all pages from
|
132
|
+
# category, there is no option to "take first 20 pages". Pages are
|
133
|
+
# fetched in 50-page batches, then parsed. So, for large category
|
134
|
+
# it can really take a while to fetch all pages.
|
135
|
+
#
|
136
|
+
# @param query Search query. For old installations, look at
|
137
|
+
# https://www.mediawiki.org/wiki/Help:Searching
|
138
|
+
# for search syntax. For new ones (including Wikipedia), see at
|
139
|
+
# https://www.mediawiki.org/wiki/Help:CirrusSearch.
|
140
|
+
#
|
141
|
+
# @return [Tree::Nodes<Page>] array of parsed pages.
|
142
|
+
#
|
143
|
+
def search(query)
|
144
|
+
list(search: {search: query, limit: 50})
|
145
|
+
end
|
146
|
+
|
147
|
+
# Receive list of parsed MediaWiki pages with titles startin from prefix.
|
148
|
+
# See [MediaWiki API docs](https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bprefixsearch)
|
149
|
+
# for details.
|
150
|
+
#
|
151
|
+
# **NB**: currently, this API **always** fetches all pages from
|
152
|
+
# category, there is no option to "take first 20 pages". Pages are
|
153
|
+
# fetched in 50-page batches, then parsed. So, for large category
|
154
|
+
# it can really take a while to fetch all pages.
|
155
|
+
#
|
156
|
+
# @param prefix page title prefix.
|
157
|
+
#
|
158
|
+
# @return [Tree::Nodes<Page>] array of parsed pages.
|
159
|
+
#
|
160
|
+
def prefixsearch(prefix)
|
161
|
+
list(prefixsearch: {search: prefix, limit: 100})
|
124
162
|
end
|
125
163
|
|
126
|
-
def
|
127
|
-
|
164
|
+
def inspect
|
165
|
+
"#<#{self.class}(#{@api_base_url.host})>"
|
128
166
|
end
|
129
167
|
|
130
|
-
|
131
|
-
|
132
|
-
|
133
|
-
|
134
|
-
|
135
|
-
|
136
|
-
|
168
|
+
private
|
169
|
+
|
170
|
+
def list(query)
|
171
|
+
response = @client.query.
|
172
|
+
generator(query).
|
173
|
+
prop(revisions: {prop: :content}, info: {prop: :url}).
|
174
|
+
redirects(true). # FIXME: should be done transparently by MediaWiktory?
|
175
|
+
perform
|
176
|
+
|
177
|
+
response.continue! while response.continue?
|
178
|
+
|
179
|
+
pages = response.pages.select(&:exists?).
|
180
|
+
map{|raw|
|
181
|
+
Page.new(self,
|
182
|
+
Parser.paragraphs(raw.content, traits),
|
183
|
+
raw)
|
184
|
+
}
|
185
|
+
|
186
|
+
Tree::Nodes[*pages]
|
137
187
|
end
|
138
188
|
|
139
|
-
def
|
140
|
-
|
141
|
-
|
189
|
+
def normalize_category_title(title)
|
190
|
+
# FIXME: shouldn't it go to MediaWiktory?..
|
191
|
+
namespace, titl = title.include?(':') ? title.split(':', 2) : [nil, title]
|
192
|
+
namespace, titl = nil, title unless traits.category_namespace.include?(namespace)
|
142
193
|
|
143
|
-
|
144
|
-
|
145
|
-
|
146
|
-
|
147
|
-
|
148
|
-
|
149
|
-
|
150
|
-
|
151
|
-
|
152
|
-
|
153
|
-
|
154
|
-
|
155
|
-
|
156
|
-
end
|
194
|
+
namespace ||= traits.category_namespace.first
|
195
|
+
[namespace, titl].join(':')
|
196
|
+
end
|
197
|
+
|
198
|
+
def user_agent(options)
|
199
|
+
options[:user_agent] || options[:ua] || self.class.user_agent || UA
|
200
|
+
end
|
201
|
+
|
202
|
+
def extract_namespaces
|
203
|
+
siteinfo = @client.query.meta(siteinfo: {prop: [:namespaces, :namespacealiases]}).perform
|
204
|
+
siteinfo.raw.query.namespaces.map{|_, namespace|
|
205
|
+
aliases = siteinfo.raw.query.namespacealiases.select{|a| a.id == namespace.id}.map{|a| a['*']}
|
206
|
+
namespace.merge(aliases: aliases)
|
157
207
|
}
|
158
|
-
rescue JSON::ParserError
|
159
|
-
fail RuntimeError, "Not a JSON response, seems there's not a MediaWiki API: #{@api_base_url}"
|
160
208
|
end
|
161
209
|
end
|
162
210
|
end
|
@@ -118,7 +118,7 @@ module Infoboxer
|
|
118
118
|
#
|
119
119
|
# @return {Tree::Nodes}
|
120
120
|
def categories
|
121
|
-
lookup(Tree::Wikilink, namespace: /^#{ensure_traits.
|
121
|
+
lookup(Tree::Wikilink, namespace: /^#{ensure_traits.category_namespace.join('|')}$/)
|
122
122
|
end
|
123
123
|
|
124
124
|
# As users accustomed to have only one infobox on a page
|
@@ -1,4 +1,6 @@
|
|
1
1
|
# encoding: utf-8
|
2
|
+
require 'strscan'
|
3
|
+
|
2
4
|
module Infoboxer
|
3
5
|
class Parser
|
4
6
|
class Context
|
@@ -86,11 +88,23 @@ module Infoboxer
|
|
86
88
|
res
|
87
89
|
end
|
88
90
|
|
91
|
+
def push_eol_sign(re)
|
92
|
+
@inline_eol_sign = re
|
93
|
+
end
|
94
|
+
|
95
|
+
def pop_eol_sign
|
96
|
+
@inline_eol_sign = nil
|
97
|
+
end
|
98
|
+
|
99
|
+
attr_reader :inline_eol_sign
|
100
|
+
|
89
101
|
def inline_eol?(exclude = nil)
|
90
102
|
# not using StringScanner#check, as it will change #matched value
|
91
103
|
eol? ||
|
92
|
-
(
|
93
|
-
(
|
104
|
+
(
|
105
|
+
(current =~ %r[^(</ref>|}})] || @inline_eol_sign && current =~ @inline_eol_sign) &&
|
106
|
+
(!exclude || $1 !~ exclude)
|
107
|
+
) # FIXME: ugly, but no idea of prettier solution
|
94
108
|
end
|
95
109
|
|
96
110
|
def scan_continued_until(re, leave_pattern = false)
|
@@ -32,7 +32,12 @@ module Infoboxer
|
|
32
32
|
def short_inline(until_pattern = nil)
|
33
33
|
nodes = Nodes[]
|
34
34
|
guarded_loop do
|
35
|
-
|
35
|
+
# FIXME: quick and UGLY IS HELL JUST TRYING TO MAKE THE SHIT WORK
|
36
|
+
if @context.inline_eol_sign
|
37
|
+
chunk = @context.scan_until(re.short_inline_until_cache_brackets[until_pattern])
|
38
|
+
else
|
39
|
+
chunk = @context.scan_until(re.short_inline_until_cache[until_pattern])
|
40
|
+
end
|
36
41
|
nodes << chunk
|
37
42
|
|
38
43
|
break if @context.matched_inline?(until_pattern)
|
@@ -82,7 +87,7 @@ module Infoboxer
|
|
82
87
|
when "''"
|
83
88
|
Italic.new(short_inline(/''/))
|
84
89
|
when '[['
|
85
|
-
if @context.check(re.
|
90
|
+
if @context.check(re.file_namespace)
|
86
91
|
image
|
87
92
|
else
|
88
93
|
wikilink
|
@@ -118,7 +123,11 @@ module Infoboxer
|
|
118
123
|
# [http://www.example.org link name]
|
119
124
|
def external_link(protocol)
|
120
125
|
link = @context.scan_continued_until(/\s+|\]/)
|
121
|
-
|
126
|
+
if @context.matched =~ /\s+/
|
127
|
+
@context.push_eol_sign(/^\]/)
|
128
|
+
caption = short_inline(/\]/)
|
129
|
+
@context.pop_eol_sign
|
130
|
+
end
|
122
131
|
ExternalLink.new(protocol + link, caption)
|
123
132
|
end
|
124
133
|
|
@@ -4,8 +4,8 @@ module Infoboxer
|
|
4
4
|
module Template
|
5
5
|
include Tree
|
6
6
|
|
7
|
-
# NB: here we are not distingish templates like {{Infobox|variable}}
|
8
|
-
# and "magic words" like {{formatnum:123}}
|
7
|
+
# NB: here we are not distingish templates like `{{Infobox|variable}}`
|
8
|
+
# and "magic words" like `{{formatnum:123}}`
|
9
9
|
# Just calling all of them "templates". This behaviour will change
|
10
10
|
# in future, I presume
|
11
11
|
# More about magic words: https://www.mediawiki.org/wiki/Help:Magic_words
|
@@ -29,6 +29,7 @@ module Infoboxer
|
|
29
29
|
@context.skip(/\s*=\s*/)
|
30
30
|
else
|
31
31
|
name = num
|
32
|
+
num += 1
|
32
33
|
end
|
33
34
|
|
34
35
|
value = long_inline(/\||}}/)
|
@@ -38,8 +39,6 @@ module Infoboxer
|
|
38
39
|
|
39
40
|
break if @context.eat_matched?('}}')
|
40
41
|
@context.eof? and @context.fail!("Unexpected break of template variables: #{res}")
|
41
|
-
|
42
|
-
num += 1
|
43
42
|
end
|
44
43
|
res
|
45
44
|
end
|
@@ -16,20 +16,31 @@ module Infoboxer
|
|
16
16
|
|
17
17
|
INLINE_EOL = %r[(?= # if we have ahead... (not scanned, just checked
|
18
18
|
</ref> | # <ref> closed
|
19
|
-
}}
|
19
|
+
}}
|
20
|
+
)]x
|
21
|
+
|
22
|
+
INLINE_EOL_BR = %r[(?= # if we have ahead... (not scanned, just checked
|
23
|
+
</ref> | # <ref> closed
|
24
|
+
}} | # or template closed
|
25
|
+
(?<!\])\](?!\]) # or ext.link closed,
|
26
|
+
# the madness with look-ahead/behind means "match single bracket but not double"
|
20
27
|
)]x
|
21
28
|
|
22
29
|
|
23
30
|
def make_regexps
|
24
31
|
{
|
25
|
-
|
32
|
+
file_namespace: /(#{@context.traits.file_namespace.join('|')}):/,
|
26
33
|
formatting: FORMATTING,
|
27
34
|
inline_until_cache: Hash.new{|h, r|
|
28
35
|
h[r] = Regexp.union(*[r, FORMATTING, /$/].compact.uniq)
|
29
36
|
},
|
30
37
|
short_inline_until_cache: Hash.new{|h, r|
|
31
38
|
h[r] = Regexp.union(*[r, INLINE_EOL, FORMATTING, /$/].compact.uniq)
|
39
|
+
},
|
40
|
+
short_inline_until_cache_brackets: Hash.new{|h, r|
|
41
|
+
h[r] = Regexp.union(*[r, INLINE_EOL_BR, FORMATTING, /$/].compact.uniq)
|
32
42
|
}
|
43
|
+
|
33
44
|
}
|
34
45
|
end
|
35
46
|
|
@@ -46,7 +57,7 @@ module Infoboxer
|
|
46
57
|
scan.skip(/=\s*/)
|
47
58
|
q = scan.scan(/['"]/)
|
48
59
|
if q
|
49
|
-
value = scan.scan_until(/#{q}
|
60
|
+
value = scan.scan_until(/#{q}|$/).sub(q, '')
|
50
61
|
else
|
51
62
|
value = scan.scan_until(/\s|$/)
|
52
63
|
end
|
data/lib/infoboxer/tree/image.rb
CHANGED