infoboxer 0.1.0 → 0.1.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.yardopts +1 -0
- data/CHANGELOG.md +14 -0
- data/CONTRIBUTING.md +62 -0
- data/README.md +1 -1
- data/infoboxer.gemspec +2 -1
- data/lib/infoboxer/media_wiki/traits.rb +42 -8
- data/lib/infoboxer/templates.rb +37 -2
- data/lib/infoboxer/templates/base.rb +9 -24
- data/lib/infoboxer/templates/set.rb +107 -3
- data/lib/infoboxer/tree/compound.rb +5 -0
- data/lib/infoboxer/tree/html.rb +1 -0
- data/lib/infoboxer/tree/linkable.rb +42 -0
- data/lib/infoboxer/tree/list.rb +4 -0
- data/lib/infoboxer/tree/node.rb +2 -0
- data/lib/infoboxer/tree/nodes.rb +3 -0
- data/lib/infoboxer/tree/paragraphs.rb +6 -3
- data/lib/infoboxer/tree/ref.rb +1 -0
- data/lib/infoboxer/tree/template.rb +163 -6
- data/lib/infoboxer/tree/text.rb +3 -0
- data/lib/infoboxer/tree/wikilink.rb +5 -14
- data/lib/infoboxer/version.rb +1 -1
- metadata +20 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 37db410a52fa9f9755b70ff3a1284c6a427509b8
|
4
|
+
data.tar.gz: cd9952485ffffb6b3eb3d3cb69cb771aa2c9610a
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 2800426e4a055f838a89b50bd9388364f7a2499d2736816c686a5d9bb2cc727d2bbdfe485565835d844cefe75c8d56f510f2fd2823411e4b7a37c4d0fd3122bf
|
7
|
+
data.tar.gz: 6dc878f4957c468051ecff57eee57ca66ac2ed19e157cb6f62f993113ba2c8bd220f87926eda4f51aa5917d097ca2e8c13d94e2acd9dff8b8f82036412845245
|
data/.yardopts
CHANGED
data/CHANGELOG.md
ADDED
@@ -0,0 +1,14 @@
|
|
1
|
+
# Infoboxer's change log
|
2
|
+
|
3
|
+
## 0.1.1 (2015-08-11)
|
4
|
+
|
5
|
+
Basically, preparing for wider release!
|
6
|
+
|
7
|
+
* Small refactorings;
|
8
|
+
* Documentation fixes;
|
9
|
+
|
10
|
+
## 0.1.0 (2015-08-07)
|
11
|
+
|
12
|
+
Initial (ok, I know it's typically called 0.1.1, but here's work of
|
13
|
+
three monthes, numerous documentations and examples and so on... so, let
|
14
|
+
it be 0.1.0).
|
data/CONTRIBUTING.md
ADDED
@@ -0,0 +1,62 @@
|
|
1
|
+
# Contributing to Infoboxer
|
2
|
+
|
3
|
+
_(Also duplicated in [wiki](https://github.com/molybdenum-99/infoboxer/wiki/Contributing).)_
|
4
|
+
|
5
|
+
## Contributing via test cases
|
6
|
+
|
7
|
+
If you are assured that Infoboxer takes some page wrong, please create an
|
8
|
+
[issue](https://github.com/molybdenum-99/infoboxer/issues) with link
|
9
|
+
to page (or raw wikitext) and description of a problem.
|
10
|
+
|
11
|
+
## Contributing via localizations and templates describing
|
12
|
+
|
13
|
+
Look at [en.wikipedia.org](https://github.com/molybdenum-99/infoboxer/blob/master/lib/infoboxer/definitions/en.wikipedia.org.rb)
|
14
|
+
template definitions. It can be extended. Also, similar definitions
|
15
|
+
can/should be created for other language wikipedias and other popular
|
16
|
+
wikis.
|
17
|
+
|
18
|
+
You can do pull requests with your own definitions, or create an
|
19
|
+
[issue](https://github.com/molybdenum-99/infoboxer/issues) describing
|
20
|
+
which template definitions should be added to Infoboxer.
|
21
|
+
|
22
|
+
## Contributing via code
|
23
|
+
|
24
|
+
If you want to fix some bug or implement some feature, please just
|
25
|
+
follow the standard process for github opensource: fork, fix, push,
|
26
|
+
make pull request.
|
27
|
+
|
28
|
+
Some (scanty) information below.
|
29
|
+
|
30
|
+
### Understanding the code
|
31
|
+
|
32
|
+
* Infoboxer is splitted in several modules (which are clearly visible in
|
33
|
+
API docs and folders structure).
|
34
|
+
* Most of "easy features"
|
35
|
+
can be added to [Navigation](http://www.rubydoc.info/gems/infoboxer/Infoboxer/Navigation)
|
36
|
+
module and its submodules: enchancing of navigational experience and
|
37
|
+
implement clever shortcuts (like "converting table to dataframe/list of
|
38
|
+
hashes", for ex.).
|
39
|
+
* Most of potential bugs can seat in
|
40
|
+
[Parser](http://www.rubydoc.info/gems/infoboxer/Infoboxer/Parser) class
|
41
|
+
and its modules; MediaWiki markup IS tricky and tightly coupled and
|
42
|
+
ambigous; there's also some non-implemented features, like `<source>`
|
43
|
+
tag parsing and template definition pages (which, possibly, is not
|
44
|
+
target of Infoboxer anyways).
|
45
|
+
* Most of underfeatured area is in
|
46
|
+
[MediaWiki](http://www.rubydoc.info/gems/infoboxer/Infoboxer/MediaWiki)
|
47
|
+
-- seems reasonable for information extraction purposes to have more
|
48
|
+
features from MediaWiki API, like "page list generators", search,
|
49
|
+
"what links here" and similar functionality.
|
50
|
+
* Most of clarification and documentation is required for
|
51
|
+
[Templates](http://www.rubydoc.info/gems/infoboxer/Infoboxer/Templates)
|
52
|
+
module, which is still underloved heart of Infoboxer.
|
53
|
+
|
54
|
+
### Parser: quick, not clever
|
55
|
+
|
56
|
+
Whether you'd want to put your hands on Parser: please remember, that
|
57
|
+
it's hand-crafted and thoroughly optimized. The first thought you may
|
58
|
+
have that it needs more OO decompozition, a class for each case; or more
|
59
|
+
ideomatic Ruby, or ... Trust me, I've tried it all. But when you are
|
60
|
+
dealing with hundreds of thousands of parsing operations and tens of
|
61
|
+
thousands of resulting nodes, it turns out even simplest things like
|
62
|
+
`Object#tap` have performance penalty on large number of calls.
|
data/README.md
CHANGED
@@ -16,7 +16,7 @@ obvious structure, you can navigate that tree easily, and you have a
|
|
16
16
|
bunch of hi-level helpers method, so typical information extraction
|
17
17
|
tasks should be super-easy, one-liners in best cases.
|
18
18
|
|
19
|
-
_(For those already
|
19
|
+
_(For those already thinking "Why should you do this, we already have
|
20
20
|
DBPedia?" -- please, read "[Reasons](https://github.com/molybdenum-99/infoboxer/wiki/Reasons)"
|
21
21
|
page in our wiki.)_
|
22
22
|
|
data/infoboxer.gemspec
CHANGED
@@ -5,7 +5,7 @@ Gem::Specification.new do |s|
|
|
5
5
|
s.version = Infoboxer::VERSION
|
6
6
|
s.authors = ['Victor Shepelev']
|
7
7
|
s.email = 'zverok.offline@gmail.com'
|
8
|
-
s.homepage = 'https://github.com/
|
8
|
+
s.homepage = 'https://github.com/molybdenum-99/infoboxer'
|
9
9
|
|
10
10
|
s.summary = 'MediaWiki client and parser, targeting information extraction.'
|
11
11
|
s.description = <<-EOF
|
@@ -40,4 +40,5 @@ Gem::Specification.new do |s|
|
|
40
40
|
s.add_development_dependency 'ruby-prof'
|
41
41
|
s.add_development_dependency 'vcr'
|
42
42
|
s.add_development_dependency 'webmock'
|
43
|
+
s.add_development_dependency 'inch'
|
43
44
|
end
|
@@ -1,8 +1,20 @@
|
|
1
1
|
# encoding: utf-8
|
2
2
|
module Infoboxer
|
3
3
|
class MediaWiki
|
4
|
+
# DSL for defining "traits" for some site.
|
5
|
+
#
|
6
|
+
# More docs (and possible refactoring) to follow.
|
7
|
+
#
|
8
|
+
# You can look at current
|
9
|
+
# [English Wikipedia traits](https://github.com/molybdenum-99/infoboxer/blob/master/lib/infoboxer/definitions/en.wikipedia.org.rb)
|
10
|
+
# definitions in Infoboxer's repo.
|
4
11
|
class Traits
|
5
12
|
class << self
|
13
|
+
# Define set of templates for current site's traits.
|
14
|
+
#
|
15
|
+
# See {Templates::Set} for longer (yet insufficient) explanation.
|
16
|
+
#
|
17
|
+
# Expected to be used inside Traits definition block.
|
6
18
|
def templates(&definition)
|
7
19
|
@templates ||= Templates::Set.new
|
8
20
|
|
@@ -11,35 +23,49 @@ module Infoboxer
|
|
11
23
|
@templates.define(&definition)
|
12
24
|
end
|
13
25
|
|
14
|
-
#
|
26
|
+
# @private
|
15
27
|
def domain(d)
|
28
|
+
# NB: explicitly store all domains in base Traits class
|
16
29
|
Traits.domains.key?(d) and
|
17
30
|
fail(ArgumentError, "Domain binding redefinition: #{Traits.domains[d]}")
|
18
31
|
|
19
32
|
Traits.domains[d] = self
|
20
33
|
end
|
21
34
|
|
35
|
+
# @private
|
22
36
|
def get(domain, options = {})
|
23
37
|
cls = Traits.domains[domain]
|
24
38
|
cls ? cls.new(options) : Traits.new(options)
|
25
39
|
end
|
26
40
|
|
41
|
+
# @private
|
27
42
|
def domains
|
28
43
|
@domains ||= {}
|
29
44
|
end
|
30
45
|
|
46
|
+
# Define traits for some domain. Use it like:
|
47
|
+
#
|
48
|
+
# ```ruby
|
49
|
+
# MediaWiki::Traits.for 'ru.wikipedia.org' do
|
50
|
+
# templates do
|
51
|
+
# template '...' do
|
52
|
+
# # some template definition
|
53
|
+
# end
|
54
|
+
# end
|
55
|
+
# end
|
56
|
+
# ```
|
57
|
+
#
|
58
|
+
# Again, you can look at current
|
59
|
+
# [English Wikipedia traits](https://github.com/molybdenum-99/infoboxer/blob/master/lib/infoboxer/definitions/en.wikipedia.org.rb)
|
60
|
+
# for example implementation.
|
31
61
|
def for(domain, &block)
|
32
62
|
Class.new(self, &block).domain(domain)
|
33
63
|
end
|
34
64
|
|
65
|
+
# @private
|
35
66
|
alias_method :default, :new
|
36
67
|
end
|
37
68
|
|
38
|
-
DEFAULTS = {
|
39
|
-
file_prefix: 'File',
|
40
|
-
category_prefix: 'Category'
|
41
|
-
}
|
42
|
-
|
43
69
|
def initialize(options = {})
|
44
70
|
@options = options
|
45
71
|
@file_prefix = [DEFAULTS[:file_prefix], options.delete(:file_prefix)].
|
@@ -48,13 +74,21 @@ module Infoboxer
|
|
48
74
|
flatten.compact.uniq
|
49
75
|
end
|
50
76
|
|
77
|
+
# @private
|
51
78
|
attr_reader :file_prefix, :category_prefix
|
52
79
|
|
53
|
-
#
|
54
|
-
|
80
|
+
# @private
|
55
81
|
def templates
|
56
82
|
self.class.templates
|
57
83
|
end
|
84
|
+
|
85
|
+
private
|
86
|
+
|
87
|
+
DEFAULTS = {
|
88
|
+
file_prefix: 'File',
|
89
|
+
category_prefix: 'Category'
|
90
|
+
}
|
91
|
+
|
58
92
|
end
|
59
93
|
end
|
60
94
|
end
|
data/lib/infoboxer/templates.rb
CHANGED
@@ -1,7 +1,42 @@
|
|
1
1
|
module Infoboxer
|
2
|
-
#
|
2
|
+
# This module covers advanced MediaWiki templates usage.
|
3
|
+
#
|
4
|
+
# It is seriously adviced to read [Wikipedia docs](https://en.wikipedia.org/wiki/Help:Template)
|
5
|
+
# or at least look through it (and have it opened while reading further).
|
6
|
+
#
|
7
|
+
# If you just have a page with templates and want some variable value
|
8
|
+
# (like "page about country - infobox - total population"), you should
|
9
|
+
# be totally happy with {Tree::Template} and its features.
|
10
|
+
#
|
11
|
+
# What this module does is, basically, two things:
|
12
|
+
# * allow you to define for arbitrary templates how they are converted
|
13
|
+
# to text; by default, templates are totally excluded from text, which
|
14
|
+
# is not most reasonable behavior for many formatting templates;
|
15
|
+
# * allow you to define additional functionality for arbitrary templates;
|
16
|
+
# many of them containing pretty complicated logic (see, for ex.,
|
17
|
+
# [Template:Convert](https://en.wikipedia.org/wiki/Template:Convert),
|
18
|
+
# and it seems reasonable to extend instances of such a template.
|
19
|
+
#
|
20
|
+
# Infoboxer allows you to define {Templates::Set} of template-specific
|
21
|
+
# classes for some site/domain.
|
22
|
+
# There is already defined set of most commonly used templates at
|
23
|
+
# en.wikipedia.org (so, most of English Wikipedia texts will be rendered
|
24
|
+
# correctly, and also some advanced functionality is provided).
|
25
|
+
# You can take a look at
|
26
|
+
# [lib/infoboxer/definitions/en.wikipedia.org.rb](https://github.com/molybdenum-99/infoboxer/blob/master/lib/infoboxer/definitions/en.wikipedia.org.rb)
|
27
|
+
# to feel it (and also see a couple of TODOs and FIXMEs and other
|
28
|
+
# considerations).
|
29
|
+
#
|
30
|
+
# From Infoboxer's point-of-view, templates are the most complex part
|
31
|
+
# of Wikipedia, and we are currently trying hard to do the most reasonable
|
32
|
+
# things about them.
|
33
|
+
#
|
34
|
+
# Future versions also should:
|
35
|
+
# * define more of common English Wikipedia templates;
|
36
|
+
# * define templates for other popular wikis;
|
37
|
+
# * allow to add template definitions on-the-fly, while loading some
|
38
|
+
# page.
|
3
39
|
#
|
4
|
-
# I do my best.
|
5
40
|
module Templates
|
6
41
|
%w[base set].each do |tmpl|
|
7
42
|
require_relative "templates/#{tmpl}"
|
@@ -15,29 +15,6 @@ module Infoboxer
|
|
15
15
|
end
|
16
16
|
end
|
17
17
|
|
18
|
-
def unnamed_variables
|
19
|
-
variables.select{|v| v.name =~ /^\d+$/}
|
20
|
-
end
|
21
|
-
|
22
|
-
def fetch(*patterns)
|
23
|
-
Nodes[*patterns.map{|p| variables.find(name: p)}.flatten]
|
24
|
-
end
|
25
|
-
|
26
|
-
def fetch_hash(*patterns)
|
27
|
-
fetch(*patterns).map{|v| [v.name, v]}.to_h
|
28
|
-
end
|
29
|
-
|
30
|
-
def fetch_date(*patterns)
|
31
|
-
components = fetch(*patterns)
|
32
|
-
components.pop while components.last.nil? && !components.empty?
|
33
|
-
|
34
|
-
if components.empty?
|
35
|
-
nil
|
36
|
-
else
|
37
|
-
Date.new(*components.map{|v| v.to_s.to_i})
|
38
|
-
end
|
39
|
-
end
|
40
|
-
|
41
18
|
def ==(other)
|
42
19
|
other.kind_of?(Tree::Template) && _eq(other)
|
43
20
|
end
|
@@ -54,7 +31,9 @@ module Infoboxer
|
|
54
31
|
end
|
55
32
|
|
56
33
|
# Renders all of its unnamed variables as space-separated text
|
57
|
-
# Also allows in-template navigation
|
34
|
+
# Also allows in-template navigation.
|
35
|
+
#
|
36
|
+
# Used for {Set} definitions.
|
58
37
|
class Show < Base
|
59
38
|
alias_method :children, :unnamed_variables
|
60
39
|
|
@@ -65,6 +44,9 @@ module Infoboxer
|
|
65
44
|
end
|
66
45
|
end
|
67
46
|
|
47
|
+
# Replaces template with replacement, while rendering.
|
48
|
+
#
|
49
|
+
# Used for {Set} definitions.
|
68
50
|
class Replace < Base
|
69
51
|
def replace
|
70
52
|
fail(NotImplementedError, "Descendants should define :replace")
|
@@ -75,6 +57,9 @@ module Infoboxer
|
|
75
57
|
end
|
76
58
|
end
|
77
59
|
|
60
|
+
# Replaces template with its name, while rendering.
|
61
|
+
#
|
62
|
+
# Used for {Set} definitions.
|
78
63
|
class Literal < Base
|
79
64
|
alias_method :text, :name
|
80
65
|
end
|
@@ -1,31 +1,103 @@
|
|
1
1
|
# encoding: utf-8
|
2
2
|
module Infoboxer
|
3
3
|
module Templates
|
4
|
+
# Base class for defining set of templates, used for some site/domain.
|
5
|
+
#
|
6
|
+
# Currently only can be plugged in via {MediaWiki::Traits.templates}.
|
7
|
+
#
|
8
|
+
# Template set defines a DSL for creating new template definitions --
|
9
|
+
# also simplest ones and very complicated.
|
10
|
+
#
|
11
|
+
# You can look at implementation of English Wikipedia
|
12
|
+
# [common templates set](https://github.com/molybdenum-99/infoboxer/blob/master/lib/infoboxer/definitions/en.wikipedia.org.rb)
|
13
|
+
# in Infoboxer's repo.
|
14
|
+
#
|
4
15
|
class Set
|
5
16
|
def initialize(&definitions)
|
6
17
|
@templates = []
|
7
18
|
define(&definitions) if definitions
|
8
19
|
end
|
9
|
-
|
20
|
+
|
21
|
+
# @private
|
10
22
|
def find(name)
|
11
23
|
_, template = @templates.detect{|m, t| m === name.downcase}
|
12
24
|
template || Base
|
13
25
|
end
|
14
26
|
|
27
|
+
# @private
|
15
28
|
def define(&definitions)
|
16
29
|
instance_eval(&definitions)
|
17
30
|
end
|
18
31
|
|
32
|
+
# @private
|
19
33
|
def clear
|
20
34
|
@templates.clear
|
21
35
|
end
|
22
36
|
|
23
|
-
|
24
|
-
|
37
|
+
# Most common form of template definition.
|
38
|
+
#
|
39
|
+
# Can be used like:
|
40
|
+
#
|
41
|
+
# ```ruby
|
42
|
+
# template 'Age' do
|
43
|
+
# def from
|
44
|
+
# fetch_date('1', '2', '3')
|
45
|
+
# end
|
46
|
+
#
|
47
|
+
# def to
|
48
|
+
# fetch_date('4', '5', '6') || Date.today
|
49
|
+
# end
|
50
|
+
#
|
51
|
+
# def value
|
52
|
+
# (to - from).to_i / 365 # FIXME: obviously
|
53
|
+
# end
|
54
|
+
#
|
55
|
+
# def text
|
56
|
+
# "#{value} years"
|
57
|
+
# end
|
58
|
+
# end
|
59
|
+
# ```
|
60
|
+
#
|
61
|
+
# @param name Definition name.
|
62
|
+
# @param options Definition options.
|
63
|
+
# Currently recognized options are:
|
64
|
+
# * `:match` -- regexp or string, which matches template name to
|
65
|
+
# add this definition to (if not provided, `name` param used
|
66
|
+
# to match relevant templates);
|
67
|
+
# * `:base` -- name of template definition to use as a base class;
|
68
|
+
# for example you can do things like:
|
69
|
+
#
|
70
|
+
# ```ruby
|
71
|
+
# # ...inside template set definition...
|
72
|
+
# template 'Infobox', match: /^Infobox/ do
|
73
|
+
# # implementation
|
74
|
+
# end
|
75
|
+
#
|
76
|
+
# template 'Infobox cheese', base: 'Infobox' do
|
77
|
+
# end
|
78
|
+
# ```
|
79
|
+
#
|
80
|
+
# Expected to be used inside Set definition block.
|
25
81
|
def template(name, options = {}, &definition)
|
26
82
|
setup_class(name, Base, options, &definition)
|
27
83
|
end
|
28
84
|
|
85
|
+
# Define list of "replacements": templates, which text should be replaced
|
86
|
+
# with arbitrary value.
|
87
|
+
#
|
88
|
+
# Example:
|
89
|
+
#
|
90
|
+
# ```ruby
|
91
|
+
# # ...inside template set definition...
|
92
|
+
# replace(
|
93
|
+
# '!!' => '||',
|
94
|
+
# '!(' => '['
|
95
|
+
# )
|
96
|
+
# ```
|
97
|
+
# Now, all templates with name `!!` will render as `||` when you
|
98
|
+
# call their (or their parents') {Tree::Node#text}.
|
99
|
+
#
|
100
|
+
# Expected to be used inside Set definition block.
|
29
101
|
def replace(*replacements)
|
30
102
|
case
|
31
103
|
when replacements.count == 2 && replacements.all?{|r| r.is_a?(String)}
|
@@ -44,18 +116,50 @@ module Infoboxer
|
|
44
116
|
end
|
45
117
|
end
|
46
118
|
|
119
|
+
# Define list of "show children" templates. Those ones, when rendered
|
120
|
+
# as text, just provide join of their children text (space-separated).
|
121
|
+
#
|
122
|
+
# Example:
|
123
|
+
#
|
124
|
+
# ```ruby
|
125
|
+
# #...in template set definition...
|
126
|
+
# show 'Small'
|
127
|
+
# ```
|
128
|
+
# Now, wikitext paragraph looking like...
|
129
|
+
#
|
130
|
+
# ```
|
131
|
+
# This is {{small|text}} in template
|
132
|
+
# ```
|
133
|
+
# ...before this template definition had rendered like
|
134
|
+
# `"This is in template"` (template contents ommitted), and after
|
135
|
+
# this definition it will render like `"This is text in template"`
|
136
|
+
# (template contents rendered as is).
|
137
|
+
#
|
138
|
+
# Expected to be used inside Set definition block.
|
47
139
|
def show(*names)
|
48
140
|
names.each do |name|
|
49
141
|
setup_class(name, Show)
|
50
142
|
end
|
51
143
|
end
|
52
144
|
|
145
|
+
# Define list of "literally rendered templates". It means, when
|
146
|
+
# rendering text, template is replaced with just its name.
|
147
|
+
#
|
148
|
+
# Explanation: in
|
149
|
+
# MediaWiki, there are contexts (deeply in other templates and
|
150
|
+
# tables), when you can't just type something like `","` and not
|
151
|
+
# have it interpreted. So, wikis oftenly define wrappers around
|
152
|
+
# those templates, looking like `{{,}}` -- so, while rendering texts,
|
153
|
+
# such templates can be replaced with their names.
|
154
|
+
#
|
155
|
+
# Expected to be used inside Set definition block.
|
53
156
|
def literal(*names)
|
54
157
|
names.each do |name|
|
55
158
|
setup_class(name, Literal)
|
56
159
|
end
|
57
160
|
end
|
58
161
|
|
162
|
+
# @private
|
59
163
|
def setup_class(name, base_class, options = {}, &definition)
|
60
164
|
match = options.fetch(:match, name.downcase)
|
61
165
|
base = options.fetch(:base, base_class)
|
@@ -21,6 +21,7 @@ module Infoboxer
|
|
21
21
|
children.index(child)
|
22
22
|
end
|
23
23
|
|
24
|
+
# @private
|
24
25
|
# Internal, used by {Parser}
|
25
26
|
def push_children(*nodes)
|
26
27
|
nodes.each{|c| c.parent = self}.each do |n|
|
@@ -45,21 +46,25 @@ module Infoboxer
|
|
45
46
|
|
46
47
|
# Kinda "private" methods, used by Parser only -------------------
|
47
48
|
|
49
|
+
# @private
|
48
50
|
# Internal, used by {Parser}
|
49
51
|
def can_merge?(other)
|
50
52
|
false
|
51
53
|
end
|
52
54
|
|
55
|
+
# @private
|
53
56
|
# Internal, used by {Parser}
|
54
57
|
def closed!
|
55
58
|
@closed = true
|
56
59
|
end
|
57
60
|
|
61
|
+
# @private
|
58
62
|
# Internal, used by {Parser}
|
59
63
|
def closed?
|
60
64
|
@closed
|
61
65
|
end
|
62
66
|
|
67
|
+
# @private
|
63
68
|
# Internal, used by {Parser}
|
64
69
|
def empty?
|
65
70
|
children.empty?
|
data/lib/infoboxer/tree/html.rb
CHANGED
@@ -0,0 +1,42 @@
|
|
1
|
+
module Infoboxer
|
2
|
+
module Tree
|
3
|
+
# Module included into everything, that can be treated as
|
4
|
+
# link to some MediaWiki page, despite of behavior. Namely,
|
5
|
+
# {Wikilink} and {Template}.
|
6
|
+
module Linkable
|
7
|
+
# Extracts wiki page by this link and returns it parsed (or nil,
|
8
|
+
# if page not found).
|
9
|
+
#
|
10
|
+
# About template "following" see also {Template#follow} docs.
|
11
|
+
#
|
12
|
+
# @return {MediaWiki::Page}
|
13
|
+
#
|
14
|
+
# **See also**:
|
15
|
+
# * {Tree::Nodes#follow} for extracting multiple links at once;
|
16
|
+
# * {MediaWiki#get} for basic information on page extraction.
|
17
|
+
def follow
|
18
|
+
client.get(link)
|
19
|
+
end
|
20
|
+
|
21
|
+
# Human-readable page URL
|
22
|
+
#
|
23
|
+
# @return [String]
|
24
|
+
def url
|
25
|
+
# FIXME: fragile as hell.
|
26
|
+
page.url.sub(/[^\/]+$/, link.gsub(' ', '_'))
|
27
|
+
end
|
28
|
+
|
29
|
+
protected
|
30
|
+
|
31
|
+
def page
|
32
|
+
page = lookup_parents(MediaWiki::Page).first or
|
33
|
+
fail("Not in a page from real source")
|
34
|
+
end
|
35
|
+
|
36
|
+
def client
|
37
|
+
page.client or fail("MediaWiki client not set")
|
38
|
+
end
|
39
|
+
|
40
|
+
end
|
41
|
+
end
|
42
|
+
end
|
data/lib/infoboxer/tree/list.rb
CHANGED
@@ -3,12 +3,14 @@ module Infoboxer
|
|
3
3
|
module Tree
|
4
4
|
# Represents item of ordered or unordered list.
|
5
5
|
class ListItem < BaseParagraph
|
6
|
+
# @private
|
6
7
|
# Internal, used by {Parser}
|
7
8
|
def can_merge?(other)
|
8
9
|
other.class == self.class &&
|
9
10
|
other.children.first.kind_of?(List)
|
10
11
|
end
|
11
12
|
|
13
|
+
# @private
|
12
14
|
# Internal, used by {Parser}
|
13
15
|
def merge!(other)
|
14
16
|
ochildren = other.children.dup
|
@@ -111,6 +113,7 @@ module Infoboxer
|
|
111
113
|
class List < Compound
|
112
114
|
include Mergeable
|
113
115
|
|
116
|
+
# @private
|
114
117
|
# Internal, used by {Parser}
|
115
118
|
def merge!(other)
|
116
119
|
ochildren = other.children.dup
|
@@ -123,6 +126,7 @@ module Infoboxer
|
|
123
126
|
push_children(*ochildren)
|
124
127
|
end
|
125
128
|
|
129
|
+
# @private
|
126
130
|
# Internal, used by {Parser}
|
127
131
|
def self.construct(marker, nodes)
|
128
132
|
m = marker.shift
|
data/lib/infoboxer/tree/node.rb
CHANGED
@@ -61,11 +61,13 @@ module Infoboxer
|
|
61
61
|
Nodes[] # redefined in descendants
|
62
62
|
end
|
63
63
|
|
64
|
+
# @private
|
64
65
|
# Used only during tree construction in {Parser}.
|
65
66
|
def can_merge?(other)
|
66
67
|
false
|
67
68
|
end
|
68
69
|
|
70
|
+
# @private
|
69
71
|
# Whether node is empty (definition of "empty" varies for different
|
70
72
|
# kinds of nodes). Used mainly in {Parser}.
|
71
73
|
def empty?
|
data/lib/infoboxer/tree/nodes.rb
CHANGED
@@ -146,6 +146,7 @@ module Infoboxer
|
|
146
146
|
page.client.get(*links)
|
147
147
|
end
|
148
148
|
|
149
|
+
# @private
|
149
150
|
# Internal, used by {Parser}
|
150
151
|
def <<(node)
|
151
152
|
if node.kind_of?(Array)
|
@@ -159,6 +160,7 @@ module Infoboxer
|
|
159
160
|
end
|
160
161
|
end
|
161
162
|
|
163
|
+
# @private
|
162
164
|
# Internal, used by {Parser}
|
163
165
|
def strip
|
164
166
|
res = dup
|
@@ -167,6 +169,7 @@ module Infoboxer
|
|
167
169
|
res
|
168
170
|
end
|
169
171
|
|
172
|
+
# @private
|
170
173
|
# Internal, used by {Parser}
|
171
174
|
def flow_templates
|
172
175
|
make_nodes map{|n| n.is_a?(Paragraph) ? n.to_templates? : n}
|
@@ -15,7 +15,6 @@ module Infoboxer
|
|
15
15
|
end
|
16
16
|
|
17
17
|
# @private
|
18
|
-
# Internal! Nothing to see here! Just YARD `@private` tag not working at class level
|
19
18
|
class EmptyParagraph < Node
|
20
19
|
def initialize(text)
|
21
20
|
@text = text
|
@@ -30,7 +29,6 @@ module Infoboxer
|
|
30
29
|
end
|
31
30
|
|
32
31
|
# @private
|
33
|
-
# Internal! Nothing to see here! Just YARD `@private` tag not working at class level
|
34
32
|
module Mergeable
|
35
33
|
def can_merge?(other)
|
36
34
|
!closed? && self.class == other.class
|
@@ -49,7 +47,6 @@ module Infoboxer
|
|
49
47
|
end
|
50
48
|
|
51
49
|
# @private
|
52
|
-
# Internal! Nothing to see here! Just YARD `@private` tag not working at class level
|
53
50
|
class MergeableParagraph < BaseParagraph
|
54
51
|
include Mergeable
|
55
52
|
|
@@ -61,21 +58,25 @@ module Infoboxer
|
|
61
58
|
|
62
59
|
# Represents plain text paragraph.
|
63
60
|
class Paragraph < MergeableParagraph
|
61
|
+
# @private
|
64
62
|
# Internal, used by {Parser} for merging
|
65
63
|
def splitter
|
66
64
|
Text.new(' ')
|
67
65
|
end
|
68
66
|
|
67
|
+
# @private
|
69
68
|
# Internal, used by {Parser}
|
70
69
|
def templates_only?
|
71
70
|
children.all?{|c| c.is_a?(Template) || c.is_a?(Text) && c.raw_text.strip.empty?}
|
72
71
|
end
|
73
72
|
|
73
|
+
# @private
|
74
74
|
# Internal, used by {Parser}
|
75
75
|
def to_templates
|
76
76
|
children.select(&filter(itself: Template))
|
77
77
|
end
|
78
78
|
|
79
|
+
# @private
|
79
80
|
# Internal, used by {Parser}
|
80
81
|
def to_templates?
|
81
82
|
templates_only? ? to_templates : self
|
@@ -104,6 +105,7 @@ module Infoboxer
|
|
104
105
|
#
|
105
106
|
# Paragraph-level thing, can contain many lines of text.
|
106
107
|
class Pre < MergeableParagraph
|
108
|
+
# @private
|
107
109
|
# Internal, used by {Parser}
|
108
110
|
def merge!(other)
|
109
111
|
if other.is_a?(EmptyParagraph) && !other.text.empty?
|
@@ -113,6 +115,7 @@ module Infoboxer
|
|
113
115
|
end
|
114
116
|
end
|
115
117
|
|
118
|
+
# @private
|
116
119
|
# Internal, used by {Parser} for merging
|
117
120
|
def splitter
|
118
121
|
Text.new("\n")
|
data/lib/infoboxer/tree/ref.rb
CHANGED
@@ -1,4 +1,6 @@
|
|
1
1
|
# encoding: utf-8
|
2
|
+
require_relative 'linkable'
|
3
|
+
|
2
4
|
module Infoboxer
|
3
5
|
module Tree
|
4
6
|
# Template variable.
|
@@ -29,15 +31,79 @@ module Infoboxer
|
|
29
31
|
end
|
30
32
|
end
|
31
33
|
|
32
|
-
#
|
34
|
+
# Represents MediaWiki **template**.
|
35
|
+
#
|
36
|
+
# [**Template**](https://en.wikipedia.org/wiki/Wikipedia:Templates)
|
37
|
+
# is basically a thing with name, some variables and their
|
38
|
+
# values. When pages are displayed in browser, templates are rendered in
|
39
|
+
# something different by wiki engine; yet, when extracting information
|
40
|
+
# with Infoboxer, you are working with original templates.
|
41
|
+
#
|
42
|
+
# It requires some mastering and understanding, yet allows to do
|
43
|
+
# very poweful things. There are many kinds of them, from pure
|
44
|
+
# formatting-related (which are typically not more than small bells
|
45
|
+
# and whistles for page outlook, and should be rendered as a text)
|
46
|
+
# to very information-heavy ones, like
|
47
|
+
# [**infoboxes**](https://en.wikipedia.org/wiki/Help:Infobox), from
|
48
|
+
# which Infoboxer borrows its name!
|
49
|
+
#
|
50
|
+
# Basically, for information extraction from template you'll list
|
51
|
+
# its {#variables}, and then use {#fetch} method
|
52
|
+
# (and its variants: {#fetch_hash}/#{fetch_date}) to extract their
|
53
|
+
# values.
|
54
|
+
#
|
55
|
+
# ### On variables naming
|
56
|
+
# MediaWiki templates can contain _named_ and _unnamed_ variables.
|
57
|
+
# Example:
|
58
|
+
#
|
59
|
+
# ```
|
60
|
+
# {{birth date and age|1953|2|19|df=y}}
|
61
|
+
# ```
|
33
62
|
#
|
34
|
-
#
|
63
|
+
# This is template with name "birth date and age", three unnamed
|
64
|
+
# variables with values "1953", "2" and "19", and one named variable
|
65
|
+
# with name "df" and value "y".
|
35
66
|
#
|
36
|
-
#
|
37
|
-
#
|
38
|
-
#
|
67
|
+
# For consistency, Infoboxer treats unnamed variables _exactly_ the
|
68
|
+
# same way MediaWiki does: they considered to have numeric names,
|
69
|
+
# which are _started from 1_ and _stored as a strings_. So, for
|
70
|
+
# template shown above, the following is correct:
|
71
|
+
#
|
72
|
+
# ```ruby
|
73
|
+
# template.fetch('1').text == '1953'
|
74
|
+
# template.fetch('2').text == '2'
|
75
|
+
# template.fetch('3').text == '19'
|
76
|
+
# template.fetch('df').text == 'y'
|
77
|
+
# ```
|
78
|
+
#
|
79
|
+
# Note also, that _named variables with simple text values_ are
|
80
|
+
# duplicated as a template node {Node#params}, so, the following is
|
81
|
+
# correct also:
|
82
|
+
#
|
83
|
+
# ```ruby
|
84
|
+
# template.params['df'] == 'y'
|
85
|
+
# template.params.has_key?('1') == false
|
86
|
+
# ```
|
87
|
+
#
|
88
|
+
# For more advanced topics, like subclassing templates by names and
|
89
|
+
# converting them to inline text, please read {Templates} module's
|
90
|
+
# documentation.
|
39
91
|
class Template < Compound
|
40
|
-
|
92
|
+
# Template name, designating its contents structure.
|
93
|
+
#
|
94
|
+
# See also {Linkable#url #url}, which you can navigate to read template's
|
95
|
+
# definition (and, in Wikipedia and many other projects, its
|
96
|
+
# documentation).
|
97
|
+
#
|
98
|
+
# @return [String]
|
99
|
+
attr_reader :name
|
100
|
+
|
101
|
+
# Template variables list.
|
102
|
+
#
|
103
|
+
# See {Var} class to understand what you can do with them.
|
104
|
+
#
|
105
|
+
# @return [Nodes<Var>]
|
106
|
+
attr_reader :variables
|
41
107
|
|
42
108
|
def initialize(name, variables = Nodes[])
|
43
109
|
super(Nodes[], extract_params(variables))
|
@@ -51,6 +117,97 @@ module Infoboxer
|
|
51
117
|
variables.map{|var| var.to_tree(level+1)}.join
|
52
118
|
end
|
53
119
|
|
120
|
+
# Returns list of template variables with numeric names (which
|
121
|
+
# are treated as "unnamed" variables by MediaWiki templates, see
|
122
|
+
# {Template class docs} for explanation).
|
123
|
+
#
|
124
|
+
# @return [Nodes<Var>]
|
125
|
+
def unnamed_variables
|
126
|
+
variables.find(name: /^\d+$/)
|
127
|
+
end
|
128
|
+
|
129
|
+
# Fetches template variable(s) by name(s) or patterns.
|
130
|
+
#
|
131
|
+
# Usage:
|
132
|
+
#
|
133
|
+
# ```ruby
|
134
|
+
# argentina.infobox.fetch('leader_title_1') # => one Var node
|
135
|
+
# argentina.infobox.fetch('leader_title_1',
|
136
|
+
# 'leader_name_1') # => two Var nodes
|
137
|
+
# argentina.infobox.fetch(/leader_title_\d+/) # => several Var nodes
|
138
|
+
# ```
|
139
|
+
#
|
140
|
+
# @return [Nodes<Var>]
|
141
|
+
def fetch(*patterns)
|
142
|
+
Nodes[*patterns.map{|p| variables.find(name: p)}.flatten]
|
143
|
+
end
|
144
|
+
|
145
|
+
# Fetches hash `{name => variable}`, by same patterns as {#fetch}.
|
146
|
+
#
|
147
|
+
# @return [Hash<String => Var>]
|
148
|
+
def fetch_hash(*patterns)
|
149
|
+
fetch(*patterns).map{|v| [v.name, v]}.to_h
|
150
|
+
end
|
151
|
+
|
152
|
+
# Fetches date by list of variable names containing date components.
|
153
|
+
#
|
154
|
+
# _(Experimental, subject to change or enchance.)_
|
155
|
+
#
|
156
|
+
# Explanation: if you have template like
|
157
|
+
# ```
|
158
|
+
# {{birth date and age|1953|2|19|df=y}}
|
159
|
+
# ```
|
160
|
+
# ...there is a short way to obtain date from it:
|
161
|
+
# ```ruby
|
162
|
+
# template.fetch_date('1', '2', '3') # => Date.new(1953,2,19)
|
163
|
+
# ```
|
164
|
+
#
|
165
|
+
# @return [Date]
|
166
|
+
def fetch_date(*patterns)
|
167
|
+
components = fetch(*patterns)
|
168
|
+
components.pop while components.last.nil? && !components.empty?
|
169
|
+
|
170
|
+
if components.empty?
|
171
|
+
nil
|
172
|
+
else
|
173
|
+
Date.new(*components.map{|v| v.to_s.to_i})
|
174
|
+
end
|
175
|
+
end
|
176
|
+
|
177
|
+
include Linkable
|
178
|
+
|
179
|
+
# @!method follow
|
180
|
+
# Extracts template source and returns it parsed (or nil,
|
181
|
+
# if template not found).
|
182
|
+
#
|
183
|
+
# **NB**: Infoboxer does NO variable substitution or other template
|
184
|
+
# evaluation actions. Moreover, it will almost certainly NOT parse
|
185
|
+
# template definitions correctly. You should use this method ONLY
|
186
|
+
# for "transclusion" templates (parts of content, which are
|
187
|
+
# included into other pages "as is").
|
188
|
+
#
|
189
|
+
# Look for example at [this page's](https://en.wikipedia.org/wiki/Tropical_and_subtropical_coniferous_forests)
|
190
|
+
# [source](https://en.wikipedia.org/w/index.php?title=Tropical_and_subtropical_coniferous_forests&action=edit):
|
191
|
+
# each subtable about some region is just a transclusion of
|
192
|
+
# template. This can be processed like:
|
193
|
+
#
|
194
|
+
# ```ruby
|
195
|
+
# Infoboxer.wp.get('Tropical and subtropical coniferous forests').
|
196
|
+
# templates(name: /forests^/).
|
197
|
+
# follow.tables #.and_so_on
|
198
|
+
# ```
|
199
|
+
#
|
200
|
+
# @return {MediaWiki::Page}
|
201
|
+
#
|
202
|
+
# **See also** {Linkable#follow} for general notes on the following links.
|
203
|
+
|
204
|
+
# Wikilink name of this template's source.
|
205
|
+
def link
|
206
|
+
# FIXME: super-naive for now, doesn't thinks about subpages and stuff.
|
207
|
+
"Template:#{name}"
|
208
|
+
end
|
209
|
+
|
210
|
+
# @private
|
54
211
|
# Internal, used by {Parser}.
|
55
212
|
def empty?
|
56
213
|
false
|
data/lib/infoboxer/tree/text.rb
CHANGED
@@ -29,11 +29,13 @@ module Infoboxer
|
|
29
29
|
"#{indent(level)}#{text} <#{descr}>\n"
|
30
30
|
end
|
31
31
|
|
32
|
+
# @private
|
32
33
|
# Internal, used by {Parser}
|
33
34
|
def can_merge?(other)
|
34
35
|
other.is_a?(String) || other.is_a?(Text)
|
35
36
|
end
|
36
37
|
|
38
|
+
# @private
|
37
39
|
# Internal, used by {Parser}
|
38
40
|
def merge!(other)
|
39
41
|
if other.is_a?(String)
|
@@ -45,6 +47,7 @@ module Infoboxer
|
|
45
47
|
end
|
46
48
|
end
|
47
49
|
|
50
|
+
# @private
|
48
51
|
# Internal, used by {Parser}
|
49
52
|
def empty?
|
50
53
|
raw_text.empty?
|
@@ -1,4 +1,6 @@
|
|
1
1
|
# encoding: utf-8
|
2
|
+
require_relative 'linkable'
|
3
|
+
|
2
4
|
module Infoboxer
|
3
5
|
module Tree
|
4
6
|
# Internal MediaWiki link class.
|
@@ -6,6 +8,8 @@ module Infoboxer
|
|
6
8
|
# See [Wikipedia docs](https://en.wikipedia.org/wiki/Help:Link#Wikilinks)
|
7
9
|
# for extensive explanation of Wikilink concept.
|
8
10
|
#
|
11
|
+
# Note, that Wikilink is {Linkable}, so you can {Linkable#follow #follow}
|
12
|
+
# it to obtain linked pages.
|
9
13
|
class Wikilink < Link
|
10
14
|
def initialize(*)
|
11
15
|
super
|
@@ -37,20 +41,7 @@ module Infoboxer
|
|
37
41
|
# See {#topic} for explanation.
|
38
42
|
attr_reader :refinement
|
39
43
|
|
40
|
-
|
41
|
-
# if page not found).
|
42
|
-
#
|
43
|
-
# @return {MediaWiki::Page}
|
44
|
-
#
|
45
|
-
# **See also**:
|
46
|
-
# * {Tree::Nodes#follow} for extracting multiple links at once;
|
47
|
-
# * {MediaWiki#get} for basic information on page extraction.
|
48
|
-
def follow
|
49
|
-
page = lookup_parents(MediaWiki::Page).first or
|
50
|
-
fail("Not in a page from real source")
|
51
|
-
page.client or fail("MediaWiki client not set")
|
52
|
-
page.client.get(link)
|
53
|
-
end
|
44
|
+
include Linkable
|
54
45
|
|
55
46
|
private
|
56
47
|
|
data/lib/infoboxer/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: infoboxer
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Victor Shepelev
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2015-08-
|
11
|
+
date: 2015-08-11 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: htmlentities
|
@@ -178,6 +178,20 @@ dependencies:
|
|
178
178
|
- - ">="
|
179
179
|
- !ruby/object:Gem::Version
|
180
180
|
version: '0'
|
181
|
+
- !ruby/object:Gem::Dependency
|
182
|
+
name: inch
|
183
|
+
requirement: !ruby/object:Gem::Requirement
|
184
|
+
requirements:
|
185
|
+
- - ">="
|
186
|
+
- !ruby/object:Gem::Version
|
187
|
+
version: '0'
|
188
|
+
type: :development
|
189
|
+
prerelease: false
|
190
|
+
version_requirements: !ruby/object:Gem::Requirement
|
191
|
+
requirements:
|
192
|
+
- - ">="
|
193
|
+
- !ruby/object:Gem::Version
|
194
|
+
version: '0'
|
181
195
|
description: |2
|
182
196
|
Infoboxer is library targeting use of Wikipedia (or any other
|
183
197
|
MediaWiki-based wiki) as a rich powerful data source.
|
@@ -188,6 +202,8 @@ extra_rdoc_files: []
|
|
188
202
|
files:
|
189
203
|
- ".dokaz"
|
190
204
|
- ".yardopts"
|
205
|
+
- CHANGELOG.md
|
206
|
+
- CONTRIBUTING.md
|
191
207
|
- LICENSE.txt
|
192
208
|
- Parsing.md
|
193
209
|
- README.md
|
@@ -225,6 +241,7 @@ files:
|
|
225
241
|
- lib/infoboxer/tree/html.rb
|
226
242
|
- lib/infoboxer/tree/image.rb
|
227
243
|
- lib/infoboxer/tree/inline.rb
|
244
|
+
- lib/infoboxer/tree/linkable.rb
|
228
245
|
- lib/infoboxer/tree/list.rb
|
229
246
|
- lib/infoboxer/tree/node.rb
|
230
247
|
- lib/infoboxer/tree/nodes.rb
|
@@ -245,7 +262,7 @@ files:
|
|
245
262
|
- regression/pages/south_america.wiki
|
246
263
|
- regression/pages/ukraine.wiki
|
247
264
|
- regression/pages/usa.wiki
|
248
|
-
homepage: https://github.com/
|
265
|
+
homepage: https://github.com/molybdenum-99/infoboxer
|
249
266
|
licenses:
|
250
267
|
- MIT
|
251
268
|
metadata: {}
|