infoboxer 0.1.2.1 → 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +13 -1
- data/README.md +26 -5
- data/bin/infoboxer +45 -0
- data/infoboxer.gemspec +1 -1
- data/lib/infoboxer/definitions/en.wikipedia.org.rb +0 -1
- data/lib/infoboxer/media_wiki/page.rb +13 -5
- data/lib/infoboxer/media_wiki/traits.rb +11 -5
- data/lib/infoboxer/media_wiki.rb +115 -67
- data/lib/infoboxer/navigation/shortcuts.rb +1 -1
- data/lib/infoboxer/parser/context.rb +16 -2
- data/lib/infoboxer/parser/image.rb +1 -1
- data/lib/infoboxer/parser/inline.rb +12 -3
- data/lib/infoboxer/parser/template.rb +3 -4
- data/lib/infoboxer/parser/util.rb +14 -3
- data/lib/infoboxer/tree/image.rb +1 -1
- data/lib/infoboxer/tree/nodes.rb +2 -2
- data/lib/infoboxer/tree/paragraphs.rb +1 -0
- data/lib/infoboxer/tree/table.rb +1 -1
- data/lib/infoboxer/tree/template.rb +9 -0
- data/lib/infoboxer/version.rb +4 -1
- data/lib/infoboxer.rb +87 -35
- data/regression/pages/list_of_countries.wiki +1493 -0
- data/regression/pages/ukrainian_galician_army.wiki +76 -0
- metadata +8 -5
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: d3081274989109208504796d1357e7ab78dd8981
|
4
|
+
data.tar.gz: 255f2ffa01c283fd11cbe1a1b308223d276c3b22
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 47ff1c7ac1f6e34ba4e5491cd7f5a6e180f18c02c4bf6061d08c6589ca3b66cd8ac1c600cc6e03dda244c3dd37a1986356e47d86f79602416e0eba021182fe00
|
7
|
+
data.tar.gz: c40d2bb3f4b2d336830d56e8b8cc2b126807022f409fe41fd63f0d229f139030b4ef9c18116f802016c3e33af6f4aba1f1766c03619d3e185cdce2b949d63bf6
|
data/CHANGELOG.md
CHANGED
@@ -1,5 +1,17 @@
|
|
1
1
|
# Infoboxer's change log
|
2
2
|
|
3
|
+
## 0.2.0 (2015-12-21)
|
4
|
+
|
5
|
+
* MediaWiki backend changed to (our own handcrafted)
|
6
|
+
[mediawiktory](https://github.com/molybdenum-99/mediawiktory);
|
7
|
+
* Added page lists fetching like `MediaWiki#category(categoryname)`,
|
8
|
+
`MediaWiki#search(search_phrase)`;
|
9
|
+
* `MediaWiki#get` now can fetch any number of pages at once (it was only
|
10
|
+
50 in previous versions);
|
11
|
+
* `bin/infoboxer` console added for quick experimenting;
|
12
|
+
* `Template#to_h` added for quick information extraction;
|
13
|
+
* many small bugfixes and echancements.
|
14
|
+
|
3
15
|
## 0.1.2.1 (2015-12-04)
|
4
16
|
|
5
17
|
* Small bug with newlines in templates fixed.
|
@@ -22,6 +34,6 @@ Basically, preparing for wider release!
|
|
22
34
|
|
23
35
|
## 0.1.0 (2015-08-07)
|
24
36
|
|
25
|
-
Initial (ok, I know it's typically called 0.
|
37
|
+
Initial (ok, I know it's typically called 0.0.1, but here's work of
|
26
38
|
three monthes, numerous documentations and examples and so on... so, let
|
27
39
|
it be 0.1.0).
|
data/README.md
CHANGED
@@ -4,6 +4,7 @@
|
|
4
4
|
[![Build Status](https://travis-ci.org/molybdenum-99/infoboxer.svg?branch=master)](https://travis-ci.org/molybdenum-99/infoboxer)
|
5
5
|
[![Coverage Status](https://coveralls.io/repos/molybdenum-99/infoboxer/badge.svg?branch=master&service=github)](https://coveralls.io/github/molybdenum-99/infoboxer?branch=master)
|
6
6
|
[![Code Climate](https://codeclimate.com/github/molybdenum-99/infoboxer/badges/gpa.svg)](https://codeclimate.com/github/molybdenum-99/infoboxer)
|
7
|
+
[![Molybdenum-99 Gitter](https://badges.gitter.im/molybdenum-99.png)](https://gitter.im/molybdenum-99)
|
7
8
|
|
8
9
|
**Infoboxer** is pure-Ruby Wikipedia (and generic MediaWiki) client and
|
9
10
|
parser, targeting information extraction (hence the name).
|
@@ -97,6 +98,25 @@ See [Navigation shortcuts](https://github.com/molybdenum-99/infoboxer/wiki/Navig
|
|
97
98
|
|
98
99
|
To put it all in one piece, also take a look at [Data extraction tips and tricks](https://github.com/molybdenum-99/infoboxer/wiki/Tips-and-tricks).
|
99
100
|
|
101
|
+
### infoboxer executable
|
102
|
+
|
103
|
+
Just try `infoboxer` command.
|
104
|
+
|
105
|
+
Without any options, it starts IRB session with infoboxer required and
|
106
|
+
included into main namespace.
|
107
|
+
|
108
|
+
With `-w` option, it provides a shortcut to MediaWiki instance you want.
|
109
|
+
Like this:
|
110
|
+
|
111
|
+
```
|
112
|
+
$ infoboxer -w https://en.wikipedia.org/w/api.php
|
113
|
+
> get('Argentina')
|
114
|
+
=> #<Page(title: "Argentina", url: "https://en.wikipedia.org/wiki/Argentina"): ....
|
115
|
+
```
|
116
|
+
|
117
|
+
You can also use shortcuts like `infoboxer -w wikipedia` for common
|
118
|
+
wikies (and, just for fun, `infoboxer -wikipedia` also).
|
119
|
+
|
100
120
|
## Advanced topics
|
101
121
|
|
102
122
|
* [Reasons](https://github.com/molybdenum-99/infoboxer/wiki/Reasons) for
|
@@ -114,9 +134,10 @@ To put it all in one piece, also take a look at [Data extraction tips and tricks
|
|
114
134
|
|
115
135
|
## Compatibility
|
116
136
|
|
117
|
-
As of now, Infoboxer reported to be compatible with any MRI Ruby since
|
118
|
-
|
119
|
-
|
137
|
+
As of now, Infoboxer reported to be compatible with any MRI Ruby since 2.0.0
|
138
|
+
(1.9.3 previously, dropped since Infoboxer 0.2.0). In Travis-CI tests,
|
139
|
+
JRuby is failing due to bug in old Java 7/Java 8 SSL certificate support
|
140
|
+
([see here](https://github.com/jruby/jruby/issues/2599)),
|
120
141
|
and Rubinius failing 3 specs of 500 by mystery, which is uninvestigated yet.
|
121
142
|
|
122
143
|
Therefore, those Ruby versions are excluded from Travis config, though,
|
@@ -129,10 +150,10 @@ they may still work for you.
|
|
129
150
|
* **NB**: ↑ this is "current version" link, but RubyDoc.info unfortunately
|
130
151
|
sometimes fails to update it to really _current_; in case you feel
|
131
152
|
something seriously underdocumented, please-please look at
|
132
|
-
[0.
|
153
|
+
[0.2.0 docs](http://www.rubydoc.info/gems/infoboxer/0.2.0).
|
133
154
|
* [Contributing](https://github.com/molybdenum-99/infoboxer/wiki/Contributing)
|
134
155
|
* [Roadmap](https://github.com/molybdenum-99/infoboxer/wiki/Roadmap)
|
135
156
|
|
136
157
|
## License
|
137
158
|
|
138
|
-
MIT.
|
159
|
+
[MIT](https://github.com/molybdenum-99/infoboxer/blob/master/LICENSE.txt).
|
data/bin/infoboxer
ADDED
@@ -0,0 +1,45 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
require 'rubygems'
|
3
|
+
require 'bundler/setup'
|
4
|
+
require 'infoboxer'
|
5
|
+
|
6
|
+
include Infoboxer
|
7
|
+
|
8
|
+
require 'optparse'
|
9
|
+
|
10
|
+
wiki_url = nil
|
11
|
+
|
12
|
+
OptionParser.new do |opts|
|
13
|
+
opts.banner = "Usage: bin/infoboxer [-w wiki_api_url]"
|
14
|
+
|
15
|
+
opts.on("-w", "--wiki WIKI_API_URL",
|
16
|
+
"Make wiki by WIKI_API_URL a default wiki, and use it with just get('Pagename')") do |w|
|
17
|
+
wiki_url = w
|
18
|
+
end
|
19
|
+
end.parse!
|
20
|
+
|
21
|
+
if wiki_url
|
22
|
+
if wiki_url =~ /^[a-z]+$/
|
23
|
+
wiki_url = case
|
24
|
+
when domain = Infoboxer::WIKIMEDIA_PROJECTS[wiki_url.to_sym]
|
25
|
+
"https://en.#{domain}/w/api.php"
|
26
|
+
when domain = Infoboxer::WIKIMEDIA_PROJECTS[('w' + wiki_url).to_sym]
|
27
|
+
"https://en.#{domain}/w/api.php"
|
28
|
+
else
|
29
|
+
fail("Unidentified wiki: #{wiki_url}")
|
30
|
+
end
|
31
|
+
end
|
32
|
+
|
33
|
+
DEFAULT_WIKI = Infoboxer.wiki(wiki_url)
|
34
|
+
puts "Default Wiki selected: #{wiki_url}.\nNow you can use `get('Pagename')`, `category('Categoryname')` and so on.\n\n"
|
35
|
+
[:raw, :get, :category, :search, :prefixsearch].each do |m|
|
36
|
+
define_method(m){|*arg|
|
37
|
+
DEFAULT_WIKI.send(m, *arg)
|
38
|
+
}
|
39
|
+
end
|
40
|
+
end
|
41
|
+
|
42
|
+
require 'irb'
|
43
|
+
ARGV.shift until ARGV.empty?
|
44
|
+
IRB.start
|
45
|
+
|
data/infoboxer.gemspec
CHANGED
@@ -29,7 +29,7 @@ Gem::Specification.new do |s|
|
|
29
29
|
|
30
30
|
s.add_dependency 'htmlentities'
|
31
31
|
s.add_dependency 'procme'
|
32
|
-
s.add_dependency '
|
32
|
+
s.add_dependency 'mediawiktory', '>= 0.0.2'
|
33
33
|
s.add_dependency 'addressable'
|
34
34
|
s.add_dependency 'terminal-table'
|
35
35
|
s.add_dependency 'backports'
|
@@ -7,15 +7,19 @@ module Infoboxer
|
|
7
7
|
# Alongside with document tree structure, knows document's title as
|
8
8
|
# represented by MediaWiki and human (non-API) URL.
|
9
9
|
class Page < Tree::Document
|
10
|
-
def initialize(client, children,
|
11
|
-
@client = client
|
12
|
-
super(children,
|
10
|
+
def initialize(client, children, source)
|
11
|
+
@client, @source = client, source
|
12
|
+
super(children, title: source.title, url: source.fullurl)
|
13
13
|
end
|
14
14
|
|
15
15
|
# Instance of {MediaWiki} which this page was received from
|
16
16
|
# @return {MediaWiki}
|
17
17
|
attr_reader :client
|
18
18
|
|
19
|
+
# Instance of MediaWiktory::Page class with source data
|
20
|
+
# @return {MediaWiktory::Page}
|
21
|
+
attr_reader :source
|
22
|
+
|
19
23
|
# @!attribute [r] title
|
20
24
|
# Page title.
|
21
25
|
# @return [String]
|
@@ -24,11 +28,15 @@ module Infoboxer
|
|
24
28
|
# Page friendly URL.
|
25
29
|
# @return [String]
|
26
30
|
|
27
|
-
def_readers :title, :url
|
31
|
+
def_readers :title, :url
|
32
|
+
|
33
|
+
def traits
|
34
|
+
client.traits
|
35
|
+
end
|
28
36
|
|
29
37
|
private
|
30
38
|
|
31
|
-
PARAMS_TO_INSPECT = [:url, :title
|
39
|
+
PARAMS_TO_INSPECT = [:url, :title] #, :domain]
|
32
40
|
|
33
41
|
def show_params
|
34
42
|
super(params.select{|k, v| PARAMS_TO_INSPECT.include?(k)})
|
@@ -68,14 +68,14 @@ module Infoboxer
|
|
68
68
|
|
69
69
|
def initialize(options = {})
|
70
70
|
@options = options
|
71
|
-
@
|
71
|
+
@file_namespace = [DEFAULTS[:file_namespace], namespace_aliases(options, 'File')].
|
72
72
|
flatten.compact.uniq
|
73
|
-
@
|
73
|
+
@category_namespace = [DEFAULTS[:category_namespace], namespace_aliases(options, 'Category')].
|
74
74
|
flatten.compact.uniq
|
75
75
|
end
|
76
76
|
|
77
77
|
# @private
|
78
|
-
attr_reader :
|
78
|
+
attr_reader :file_namespace, :category_namespace
|
79
79
|
|
80
80
|
# @private
|
81
81
|
def templates
|
@@ -84,9 +84,15 @@ module Infoboxer
|
|
84
84
|
|
85
85
|
private
|
86
86
|
|
87
|
+
def namespace_aliases(options, canonical)
|
88
|
+
namespace = (options[:namespaces] || []).detect{|v| v.canonical == canonical}
|
89
|
+
return nil unless namespace
|
90
|
+
[namespace['*'], *namespace.aliases]
|
91
|
+
end
|
92
|
+
|
87
93
|
DEFAULTS = {
|
88
|
-
|
89
|
-
|
94
|
+
file_namespace: 'File',
|
95
|
+
category_namespace: 'Category'
|
90
96
|
}
|
91
97
|
|
92
98
|
end
|
data/lib/infoboxer/media_wiki.rb
CHANGED
@@ -1,6 +1,7 @@
|
|
1
1
|
# encoding: utf-8
|
2
|
-
require 'rest-client'
|
3
|
-
require 'json'
|
2
|
+
#require 'rest-client'
|
3
|
+
#require 'json'
|
4
|
+
require 'mediawiktory'
|
4
5
|
require 'addressable/uri'
|
5
6
|
|
6
7
|
require_relative 'media_wiki/traits'
|
@@ -36,7 +37,7 @@ module Infoboxer
|
|
36
37
|
attr_accessor :user_agent
|
37
38
|
end
|
38
39
|
|
39
|
-
attr_reader :api_base_url
|
40
|
+
attr_reader :api_base_url, :traits
|
40
41
|
|
41
42
|
# Creating new MediaWiki client. {Infoboxer.wiki} provides shortcut
|
42
43
|
# for it, as well as shortcuts for some well-known wikis, like
|
@@ -49,7 +50,8 @@ module Infoboxer
|
|
49
50
|
# * `:user_agent` (also aliased as `:ua`) -- custom User-Agent header.
|
50
51
|
def initialize(api_base_url, options = {})
|
51
52
|
@api_base_url = Addressable::URI.parse(api_base_url)
|
52
|
-
@
|
53
|
+
@client = MediaWiktory::Client.new(api_base_url, user_agent: user_agent(options))
|
54
|
+
@traits = Traits.get(@api_base_url.host, namespaces: extract_namespaces)
|
53
55
|
end
|
54
56
|
|
55
57
|
# Receive "raw" data from Wikipedia (without parsing or wrapping in
|
@@ -57,18 +59,22 @@ module Infoboxer
|
|
57
59
|
#
|
58
60
|
# @return [Array<Hash>]
|
59
61
|
def raw(*titles)
|
60
|
-
|
61
|
-
|
62
|
-
|
62
|
+
titles.each_slice(50).map{|part|
|
63
|
+
@client.query.
|
64
|
+
titles(*part).
|
65
|
+
prop(revisions: {prop: :content}, info: {prop: :url}).
|
66
|
+
redirects(true). # FIXME: should be done transparently by MediaWiktory?
|
67
|
+
perform.pages
|
68
|
+
}.inject(:concat) # somehow flatten(1) fails!
|
63
69
|
end
|
64
70
|
|
65
|
-
# Receive list of parsed
|
71
|
+
# Receive list of parsed MediaWiki pages for list of titles provided.
|
66
72
|
# All pages are received with single query to MediaWiki API.
|
67
73
|
#
|
68
|
-
# **NB**:
|
69
|
-
#
|
70
|
-
#
|
71
|
-
#
|
74
|
+
# **NB**: if you are requesting more than 50 titles at once
|
75
|
+
# (MediaWiki limitation for single request), Infoboxer will do as
|
76
|
+
# many queries as necessary to extract them all (it will be like
|
77
|
+
# `(titles.count / 50.0).ceil` requests)
|
72
78
|
#
|
73
79
|
# @return [Tree::Nodes<Page>] array of parsed pages. Notes:
|
74
80
|
# * if you call `get` with only one title, one page will be
|
@@ -87,76 +93,118 @@ module Infoboxer
|
|
87
93
|
# NotFound.
|
88
94
|
#
|
89
95
|
def get(*titles)
|
90
|
-
pages = raw(*titles).
|
96
|
+
pages = raw(*titles).
|
97
|
+
tap{|pages| pages.detect(&:invalid?).tap{|i| i && fail(i.raw.invalidreason)}}.
|
98
|
+
select(&:exists?).
|
91
99
|
map{|raw|
|
92
|
-
traits = Traits.get(@api_base_url.host, extract_traits(raw))
|
93
|
-
|
94
100
|
Page.new(self,
|
95
|
-
Parser.paragraphs(raw
|
96
|
-
raw
|
101
|
+
Parser.paragraphs(raw.content, traits),
|
102
|
+
raw)
|
97
103
|
}
|
98
104
|
titles.count == 1 ? pages.first : Tree::Nodes[*pages]
|
99
105
|
end
|
100
106
|
|
101
|
-
|
107
|
+
# Receive list of parsed MediaWiki pages from specified category.
|
108
|
+
#
|
109
|
+
# **NB**: currently, this API **always** fetches all pages from
|
110
|
+
# category, there is no option to "take first 20 pages". Pages are
|
111
|
+
# fetched in 50-page batches, then parsed. So, for large category
|
112
|
+
# it can really take a while to fetch all pages.
|
113
|
+
#
|
114
|
+
# @param title Category title. You can use namespaceless title (like
|
115
|
+
# `"Countries in South America"`), title with namespace (like
|
116
|
+
# `"Category:Countries in South America"`) or title with local
|
117
|
+
# namespace (like `"Catégorie:Argentine"` for French Wikipedia)
|
118
|
+
#
|
119
|
+
# @return [Tree::Nodes<Page>] array of parsed pages.
|
120
|
+
#
|
121
|
+
def category(title)
|
122
|
+
title = normalize_category_title(title)
|
123
|
+
|
124
|
+
list(categorymembers: {title: title, limit: 50})
|
125
|
+
end
|
102
126
|
|
103
|
-
#
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
|
108
|
-
|
109
|
-
|
110
|
-
|
111
|
-
#
|
112
|
-
|
113
|
-
|
114
|
-
|
115
|
-
|
116
|
-
|
117
|
-
|
118
|
-
|
119
|
-
|
120
|
-
|
121
|
-
|
122
|
-
|
123
|
-
|
127
|
+
# Receive list of parsed MediaWiki pages for provided search query.
|
128
|
+
# See [MediaWiki API docs](https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bsearch)
|
129
|
+
# for details.
|
130
|
+
#
|
131
|
+
# **NB**: currently, this API **always** fetches all pages from
|
132
|
+
# category, there is no option to "take first 20 pages". Pages are
|
133
|
+
# fetched in 50-page batches, then parsed. So, for large category
|
134
|
+
# it can really take a while to fetch all pages.
|
135
|
+
#
|
136
|
+
# @param query Search query. For old installations, look at
|
137
|
+
# https://www.mediawiki.org/wiki/Help:Searching
|
138
|
+
# for search syntax. For new ones (including Wikipedia), see at
|
139
|
+
# https://www.mediawiki.org/wiki/Help:CirrusSearch.
|
140
|
+
#
|
141
|
+
# @return [Tree::Nodes<Page>] array of parsed pages.
|
142
|
+
#
|
143
|
+
def search(query)
|
144
|
+
list(search: {search: query, limit: 50})
|
145
|
+
end
|
146
|
+
|
147
|
+
# Receive list of parsed MediaWiki pages with titles startin from prefix.
|
148
|
+
# See [MediaWiki API docs](https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bprefixsearch)
|
149
|
+
# for details.
|
150
|
+
#
|
151
|
+
# **NB**: currently, this API **always** fetches all pages from
|
152
|
+
# category, there is no option to "take first 20 pages". Pages are
|
153
|
+
# fetched in 50-page batches, then parsed. So, for large category
|
154
|
+
# it can really take a while to fetch all pages.
|
155
|
+
#
|
156
|
+
# @param prefix page title prefix.
|
157
|
+
#
|
158
|
+
# @return [Tree::Nodes<Page>] array of parsed pages.
|
159
|
+
#
|
160
|
+
def prefixsearch(prefix)
|
161
|
+
list(prefixsearch: {search: prefix, limit: 100})
|
124
162
|
end
|
125
163
|
|
126
|
-
def
|
127
|
-
|
164
|
+
def inspect
|
165
|
+
"#<#{self.class}(#{@api_base_url.host})>"
|
128
166
|
end
|
129
167
|
|
130
|
-
|
131
|
-
|
132
|
-
|
133
|
-
|
134
|
-
|
135
|
-
|
136
|
-
|
168
|
+
private
|
169
|
+
|
170
|
+
def list(query)
|
171
|
+
response = @client.query.
|
172
|
+
generator(query).
|
173
|
+
prop(revisions: {prop: :content}, info: {prop: :url}).
|
174
|
+
redirects(true). # FIXME: should be done transparently by MediaWiktory?
|
175
|
+
perform
|
176
|
+
|
177
|
+
response.continue! while response.continue?
|
178
|
+
|
179
|
+
pages = response.pages.select(&:exists?).
|
180
|
+
map{|raw|
|
181
|
+
Page.new(self,
|
182
|
+
Parser.paragraphs(raw.content, traits),
|
183
|
+
raw)
|
184
|
+
}
|
185
|
+
|
186
|
+
Tree::Nodes[*pages]
|
137
187
|
end
|
138
188
|
|
139
|
-
def
|
140
|
-
|
141
|
-
|
189
|
+
def normalize_category_title(title)
|
190
|
+
# FIXME: shouldn't it go to MediaWiktory?..
|
191
|
+
namespace, titl = title.include?(':') ? title.split(':', 2) : [nil, title]
|
192
|
+
namespace, titl = nil, title unless traits.category_namespace.include?(namespace)
|
142
193
|
|
143
|
-
|
144
|
-
|
145
|
-
|
146
|
-
|
147
|
-
|
148
|
-
|
149
|
-
|
150
|
-
|
151
|
-
|
152
|
-
|
153
|
-
|
154
|
-
|
155
|
-
|
156
|
-
end
|
194
|
+
namespace ||= traits.category_namespace.first
|
195
|
+
[namespace, titl].join(':')
|
196
|
+
end
|
197
|
+
|
198
|
+
def user_agent(options)
|
199
|
+
options[:user_agent] || options[:ua] || self.class.user_agent || UA
|
200
|
+
end
|
201
|
+
|
202
|
+
def extract_namespaces
|
203
|
+
siteinfo = @client.query.meta(siteinfo: {prop: [:namespaces, :namespacealiases]}).perform
|
204
|
+
siteinfo.raw.query.namespaces.map{|_, namespace|
|
205
|
+
aliases = siteinfo.raw.query.namespacealiases.select{|a| a.id == namespace.id}.map{|a| a['*']}
|
206
|
+
namespace.merge(aliases: aliases)
|
157
207
|
}
|
158
|
-
rescue JSON::ParserError
|
159
|
-
fail RuntimeError, "Not a JSON response, seems there's not a MediaWiki API: #{@api_base_url}"
|
160
208
|
end
|
161
209
|
end
|
162
210
|
end
|
@@ -118,7 +118,7 @@ module Infoboxer
|
|
118
118
|
#
|
119
119
|
# @return {Tree::Nodes}
|
120
120
|
def categories
|
121
|
-
lookup(Tree::Wikilink, namespace: /^#{ensure_traits.
|
121
|
+
lookup(Tree::Wikilink, namespace: /^#{ensure_traits.category_namespace.join('|')}$/)
|
122
122
|
end
|
123
123
|
|
124
124
|
# As users accustomed to have only one infobox on a page
|
@@ -1,4 +1,6 @@
|
|
1
1
|
# encoding: utf-8
|
2
|
+
require 'strscan'
|
3
|
+
|
2
4
|
module Infoboxer
|
3
5
|
class Parser
|
4
6
|
class Context
|
@@ -86,11 +88,23 @@ module Infoboxer
|
|
86
88
|
res
|
87
89
|
end
|
88
90
|
|
91
|
+
def push_eol_sign(re)
|
92
|
+
@inline_eol_sign = re
|
93
|
+
end
|
94
|
+
|
95
|
+
def pop_eol_sign
|
96
|
+
@inline_eol_sign = nil
|
97
|
+
end
|
98
|
+
|
99
|
+
attr_reader :inline_eol_sign
|
100
|
+
|
89
101
|
def inline_eol?(exclude = nil)
|
90
102
|
# not using StringScanner#check, as it will change #matched value
|
91
103
|
eol? ||
|
92
|
-
(
|
93
|
-
(
|
104
|
+
(
|
105
|
+
(current =~ %r[^(</ref>|}})] || @inline_eol_sign && current =~ @inline_eol_sign) &&
|
106
|
+
(!exclude || $1 !~ exclude)
|
107
|
+
) # FIXME: ugly, but no idea of prettier solution
|
94
108
|
end
|
95
109
|
|
96
110
|
def scan_continued_until(re, leave_pattern = false)
|
@@ -32,7 +32,12 @@ module Infoboxer
|
|
32
32
|
def short_inline(until_pattern = nil)
|
33
33
|
nodes = Nodes[]
|
34
34
|
guarded_loop do
|
35
|
-
|
35
|
+
# FIXME: quick and UGLY IS HELL JUST TRYING TO MAKE THE SHIT WORK
|
36
|
+
if @context.inline_eol_sign
|
37
|
+
chunk = @context.scan_until(re.short_inline_until_cache_brackets[until_pattern])
|
38
|
+
else
|
39
|
+
chunk = @context.scan_until(re.short_inline_until_cache[until_pattern])
|
40
|
+
end
|
36
41
|
nodes << chunk
|
37
42
|
|
38
43
|
break if @context.matched_inline?(until_pattern)
|
@@ -82,7 +87,7 @@ module Infoboxer
|
|
82
87
|
when "''"
|
83
88
|
Italic.new(short_inline(/''/))
|
84
89
|
when '[['
|
85
|
-
if @context.check(re.
|
90
|
+
if @context.check(re.file_namespace)
|
86
91
|
image
|
87
92
|
else
|
88
93
|
wikilink
|
@@ -118,7 +123,11 @@ module Infoboxer
|
|
118
123
|
# [http://www.example.org link name]
|
119
124
|
def external_link(protocol)
|
120
125
|
link = @context.scan_continued_until(/\s+|\]/)
|
121
|
-
|
126
|
+
if @context.matched =~ /\s+/
|
127
|
+
@context.push_eol_sign(/^\]/)
|
128
|
+
caption = short_inline(/\]/)
|
129
|
+
@context.pop_eol_sign
|
130
|
+
end
|
122
131
|
ExternalLink.new(protocol + link, caption)
|
123
132
|
end
|
124
133
|
|
@@ -4,8 +4,8 @@ module Infoboxer
|
|
4
4
|
module Template
|
5
5
|
include Tree
|
6
6
|
|
7
|
-
# NB: here we are not distingish templates like {{Infobox|variable}}
|
8
|
-
# and "magic words" like {{formatnum:123}}
|
7
|
+
# NB: here we are not distingish templates like `{{Infobox|variable}}`
|
8
|
+
# and "magic words" like `{{formatnum:123}}`
|
9
9
|
# Just calling all of them "templates". This behaviour will change
|
10
10
|
# in future, I presume
|
11
11
|
# More about magic words: https://www.mediawiki.org/wiki/Help:Magic_words
|
@@ -29,6 +29,7 @@ module Infoboxer
|
|
29
29
|
@context.skip(/\s*=\s*/)
|
30
30
|
else
|
31
31
|
name = num
|
32
|
+
num += 1
|
32
33
|
end
|
33
34
|
|
34
35
|
value = long_inline(/\||}}/)
|
@@ -38,8 +39,6 @@ module Infoboxer
|
|
38
39
|
|
39
40
|
break if @context.eat_matched?('}}')
|
40
41
|
@context.eof? and @context.fail!("Unexpected break of template variables: #{res}")
|
41
|
-
|
42
|
-
num += 1
|
43
42
|
end
|
44
43
|
res
|
45
44
|
end
|
@@ -16,20 +16,31 @@ module Infoboxer
|
|
16
16
|
|
17
17
|
INLINE_EOL = %r[(?= # if we have ahead... (not scanned, just checked
|
18
18
|
</ref> | # <ref> closed
|
19
|
-
}}
|
19
|
+
}}
|
20
|
+
)]x
|
21
|
+
|
22
|
+
INLINE_EOL_BR = %r[(?= # if we have ahead... (not scanned, just checked
|
23
|
+
</ref> | # <ref> closed
|
24
|
+
}} | # or template closed
|
25
|
+
(?<!\])\](?!\]) # or ext.link closed,
|
26
|
+
# the madness with look-ahead/behind means "match single bracket but not double"
|
20
27
|
)]x
|
21
28
|
|
22
29
|
|
23
30
|
def make_regexps
|
24
31
|
{
|
25
|
-
|
32
|
+
file_namespace: /(#{@context.traits.file_namespace.join('|')}):/,
|
26
33
|
formatting: FORMATTING,
|
27
34
|
inline_until_cache: Hash.new{|h, r|
|
28
35
|
h[r] = Regexp.union(*[r, FORMATTING, /$/].compact.uniq)
|
29
36
|
},
|
30
37
|
short_inline_until_cache: Hash.new{|h, r|
|
31
38
|
h[r] = Regexp.union(*[r, INLINE_EOL, FORMATTING, /$/].compact.uniq)
|
39
|
+
},
|
40
|
+
short_inline_until_cache_brackets: Hash.new{|h, r|
|
41
|
+
h[r] = Regexp.union(*[r, INLINE_EOL_BR, FORMATTING, /$/].compact.uniq)
|
32
42
|
}
|
43
|
+
|
33
44
|
}
|
34
45
|
end
|
35
46
|
|
@@ -46,7 +57,7 @@ module Infoboxer
|
|
46
57
|
scan.skip(/=\s*/)
|
47
58
|
q = scan.scan(/['"]/)
|
48
59
|
if q
|
49
|
-
value = scan.scan_until(/#{q}
|
60
|
+
value = scan.scan_until(/#{q}|$/).sub(q, '')
|
50
61
|
else
|
51
62
|
value = scan.scan_until(/\s|$/)
|
52
63
|
end
|
data/lib/infoboxer/tree/image.rb
CHANGED