infoboxer 0.1.0 → 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.yardopts +1 -0
- data/CHANGELOG.md +14 -0
- data/CONTRIBUTING.md +62 -0
- data/README.md +1 -1
- data/infoboxer.gemspec +2 -1
- data/lib/infoboxer/media_wiki/traits.rb +42 -8
- data/lib/infoboxer/templates.rb +37 -2
- data/lib/infoboxer/templates/base.rb +9 -24
- data/lib/infoboxer/templates/set.rb +107 -3
- data/lib/infoboxer/tree/compound.rb +5 -0
- data/lib/infoboxer/tree/html.rb +1 -0
- data/lib/infoboxer/tree/linkable.rb +42 -0
- data/lib/infoboxer/tree/list.rb +4 -0
- data/lib/infoboxer/tree/node.rb +2 -0
- data/lib/infoboxer/tree/nodes.rb +3 -0
- data/lib/infoboxer/tree/paragraphs.rb +6 -3
- data/lib/infoboxer/tree/ref.rb +1 -0
- data/lib/infoboxer/tree/template.rb +163 -6
- data/lib/infoboxer/tree/text.rb +3 -0
- data/lib/infoboxer/tree/wikilink.rb +5 -14
- data/lib/infoboxer/version.rb +1 -1
- metadata +20 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 37db410a52fa9f9755b70ff3a1284c6a427509b8
|
4
|
+
data.tar.gz: cd9952485ffffb6b3eb3d3cb69cb771aa2c9610a
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 2800426e4a055f838a89b50bd9388364f7a2499d2736816c686a5d9bb2cc727d2bbdfe485565835d844cefe75c8d56f510f2fd2823411e4b7a37c4d0fd3122bf
|
7
|
+
data.tar.gz: 6dc878f4957c468051ecff57eee57ca66ac2ed19e157cb6f62f993113ba2c8bd220f87926eda4f51aa5917d097ca2e8c13d94e2acd9dff8b8f82036412845245
|
data/.yardopts
CHANGED
data/CHANGELOG.md
ADDED
@@ -0,0 +1,14 @@
|
|
1
|
+
# Infoboxer's change log
|
2
|
+
|
3
|
+
## 0.1.1 (2015-08-11)
|
4
|
+
|
5
|
+
Basically, preparing for wider release!
|
6
|
+
|
7
|
+
* Small refactorings;
|
8
|
+
* Documentation fixes;
|
9
|
+
|
10
|
+
## 0.1.0 (2015-08-07)
|
11
|
+
|
12
|
+
Initial (ok, I know it's typically called 0.1.1, but here's work of
|
13
|
+
three monthes, numerous documentations and examples and so on... so, let
|
14
|
+
it be 0.1.0).
|
data/CONTRIBUTING.md
ADDED
@@ -0,0 +1,62 @@
|
|
1
|
+
# Contributing to Infoboxer
|
2
|
+
|
3
|
+
_(Also duplicated in [wiki](https://github.com/molybdenum-99/infoboxer/wiki/Contributing).)_
|
4
|
+
|
5
|
+
## Contributing via test cases
|
6
|
+
|
7
|
+
If you are assured that Infoboxer takes some page wrong, please create an
|
8
|
+
[issue](https://github.com/molybdenum-99/infoboxer/issues) with link
|
9
|
+
to page (or raw wikitext) and description of a problem.
|
10
|
+
|
11
|
+
## Contributing via localizations and templates describing
|
12
|
+
|
13
|
+
Look at [en.wikipedia.org](https://github.com/molybdenum-99/infoboxer/blob/master/lib/infoboxer/definitions/en.wikipedia.org.rb)
|
14
|
+
template definitions. It can be extended. Also, similar definitions
|
15
|
+
can/should be created for other language wikipedias and other popular
|
16
|
+
wikis.
|
17
|
+
|
18
|
+
You can do pull requests with your own definitions, or create an
|
19
|
+
[issue](https://github.com/molybdenum-99/infoboxer/issues) describing
|
20
|
+
which template definitions should be added to Infoboxer.
|
21
|
+
|
22
|
+
## Contributing via code
|
23
|
+
|
24
|
+
If you want to fix some bug or implement some feature, please just
|
25
|
+
follow the standard process for github opensource: fork, fix, push,
|
26
|
+
make pull request.
|
27
|
+
|
28
|
+
Some (scanty) information below.
|
29
|
+
|
30
|
+
### Understanding the code
|
31
|
+
|
32
|
+
* Infoboxer is splitted in several modules (which are clearly visible in
|
33
|
+
API docs and folders structure).
|
34
|
+
* Most of "easy features"
|
35
|
+
can be added to [Navigation](http://www.rubydoc.info/gems/infoboxer/Infoboxer/Navigation)
|
36
|
+
module and its submodules: enchancing of navigational experience and
|
37
|
+
implement clever shortcuts (like "converting table to dataframe/list of
|
38
|
+
hashes", for ex.).
|
39
|
+
* Most of potential bugs can seat in
|
40
|
+
[Parser](http://www.rubydoc.info/gems/infoboxer/Infoboxer/Parser) class
|
41
|
+
and its modules; MediaWiki markup IS tricky and tightly coupled and
|
42
|
+
ambigous; there's also some non-implemented features, like `<source>`
|
43
|
+
tag parsing and template definition pages (which, possibly, is not
|
44
|
+
target of Infoboxer anyways).
|
45
|
+
* Most of underfeatured area is in
|
46
|
+
[MediaWiki](http://www.rubydoc.info/gems/infoboxer/Infoboxer/MediaWiki)
|
47
|
+
-- seems reasonable for information extraction purposes to have more
|
48
|
+
features from MediaWiki API, like "page list generators", search,
|
49
|
+
"what links here" and similar functionality.
|
50
|
+
* Most of clarification and documentation is required for
|
51
|
+
[Templates](http://www.rubydoc.info/gems/infoboxer/Infoboxer/Templates)
|
52
|
+
module, which is still underloved heart of Infoboxer.
|
53
|
+
|
54
|
+
### Parser: quick, not clever
|
55
|
+
|
56
|
+
Whether you'd want to put your hands on Parser: please remember, that
|
57
|
+
it's hand-crafted and thoroughly optimized. The first thought you may
|
58
|
+
have that it needs more OO decompozition, a class for each case; or more
|
59
|
+
ideomatic Ruby, or ... Trust me, I've tried it all. But when you are
|
60
|
+
dealing with hundreds of thousands of parsing operations and tens of
|
61
|
+
thousands of resulting nodes, it turns out even simplest things like
|
62
|
+
`Object#tap` have performance penalty on large number of calls.
|
data/README.md
CHANGED
@@ -16,7 +16,7 @@ obvious structure, you can navigate that tree easily, and you have a
|
|
16
16
|
bunch of hi-level helpers method, so typical information extraction
|
17
17
|
tasks should be super-easy, one-liners in best cases.
|
18
18
|
|
19
|
-
_(For those already
|
19
|
+
_(For those already thinking "Why should you do this, we already have
|
20
20
|
DBPedia?" -- please, read "[Reasons](https://github.com/molybdenum-99/infoboxer/wiki/Reasons)"
|
21
21
|
page in our wiki.)_
|
22
22
|
|
data/infoboxer.gemspec
CHANGED
@@ -5,7 +5,7 @@ Gem::Specification.new do |s|
|
|
5
5
|
s.version = Infoboxer::VERSION
|
6
6
|
s.authors = ['Victor Shepelev']
|
7
7
|
s.email = 'zverok.offline@gmail.com'
|
8
|
-
s.homepage = 'https://github.com/
|
8
|
+
s.homepage = 'https://github.com/molybdenum-99/infoboxer'
|
9
9
|
|
10
10
|
s.summary = 'MediaWiki client and parser, targeting information extraction.'
|
11
11
|
s.description = <<-EOF
|
@@ -40,4 +40,5 @@ Gem::Specification.new do |s|
|
|
40
40
|
s.add_development_dependency 'ruby-prof'
|
41
41
|
s.add_development_dependency 'vcr'
|
42
42
|
s.add_development_dependency 'webmock'
|
43
|
+
s.add_development_dependency 'inch'
|
43
44
|
end
|
@@ -1,8 +1,20 @@
|
|
1
1
|
# encoding: utf-8
|
2
2
|
module Infoboxer
|
3
3
|
class MediaWiki
|
4
|
+
# DSL for defining "traits" for some site.
|
5
|
+
#
|
6
|
+
# More docs (and possible refactoring) to follow.
|
7
|
+
#
|
8
|
+
# You can look at current
|
9
|
+
# [English Wikipedia traits](https://github.com/molybdenum-99/infoboxer/blob/master/lib/infoboxer/definitions/en.wikipedia.org.rb)
|
10
|
+
# definitions in Infoboxer's repo.
|
4
11
|
class Traits
|
5
12
|
class << self
|
13
|
+
# Define set of templates for current site's traits.
|
14
|
+
#
|
15
|
+
# See {Templates::Set} for longer (yet insufficient) explanation.
|
16
|
+
#
|
17
|
+
# Expected to be used inside Traits definition block.
|
6
18
|
def templates(&definition)
|
7
19
|
@templates ||= Templates::Set.new
|
8
20
|
|
@@ -11,35 +23,49 @@ module Infoboxer
|
|
11
23
|
@templates.define(&definition)
|
12
24
|
end
|
13
25
|
|
14
|
-
#
|
26
|
+
# @private
|
15
27
|
def domain(d)
|
28
|
+
# NB: explicitly store all domains in base Traits class
|
16
29
|
Traits.domains.key?(d) and
|
17
30
|
fail(ArgumentError, "Domain binding redefinition: #{Traits.domains[d]}")
|
18
31
|
|
19
32
|
Traits.domains[d] = self
|
20
33
|
end
|
21
34
|
|
35
|
+
# @private
|
22
36
|
def get(domain, options = {})
|
23
37
|
cls = Traits.domains[domain]
|
24
38
|
cls ? cls.new(options) : Traits.new(options)
|
25
39
|
end
|
26
40
|
|
41
|
+
# @private
|
27
42
|
def domains
|
28
43
|
@domains ||= {}
|
29
44
|
end
|
30
45
|
|
46
|
+
# Define traits for some domain. Use it like:
|
47
|
+
#
|
48
|
+
# ```ruby
|
49
|
+
# MediaWiki::Traits.for 'ru.wikipedia.org' do
|
50
|
+
# templates do
|
51
|
+
# template '...' do
|
52
|
+
# # some template definition
|
53
|
+
# end
|
54
|
+
# end
|
55
|
+
# end
|
56
|
+
# ```
|
57
|
+
#
|
58
|
+
# Again, you can look at current
|
59
|
+
# [English Wikipedia traits](https://github.com/molybdenum-99/infoboxer/blob/master/lib/infoboxer/definitions/en.wikipedia.org.rb)
|
60
|
+
# for example implementation.
|
31
61
|
def for(domain, &block)
|
32
62
|
Class.new(self, &block).domain(domain)
|
33
63
|
end
|
34
64
|
|
65
|
+
# @private
|
35
66
|
alias_method :default, :new
|
36
67
|
end
|
37
68
|
|
38
|
-
DEFAULTS = {
|
39
|
-
file_prefix: 'File',
|
40
|
-
category_prefix: 'Category'
|
41
|
-
}
|
42
|
-
|
43
69
|
def initialize(options = {})
|
44
70
|
@options = options
|
45
71
|
@file_prefix = [DEFAULTS[:file_prefix], options.delete(:file_prefix)].
|
@@ -48,13 +74,21 @@ module Infoboxer
|
|
48
74
|
flatten.compact.uniq
|
49
75
|
end
|
50
76
|
|
77
|
+
# @private
|
51
78
|
attr_reader :file_prefix, :category_prefix
|
52
79
|
|
53
|
-
#
|
54
|
-
|
80
|
+
# @private
|
55
81
|
def templates
|
56
82
|
self.class.templates
|
57
83
|
end
|
84
|
+
|
85
|
+
private
|
86
|
+
|
87
|
+
DEFAULTS = {
|
88
|
+
file_prefix: 'File',
|
89
|
+
category_prefix: 'Category'
|
90
|
+
}
|
91
|
+
|
58
92
|
end
|
59
93
|
end
|
60
94
|
end
|
data/lib/infoboxer/templates.rb
CHANGED
@@ -1,7 +1,42 @@
|
|
1
1
|
module Infoboxer
|
2
|
-
#
|
2
|
+
# This module covers advanced MediaWiki templates usage.
|
3
|
+
#
|
4
|
+
# It is seriously adviced to read [Wikipedia docs](https://en.wikipedia.org/wiki/Help:Template)
|
5
|
+
# or at least look through it (and have it opened while reading further).
|
6
|
+
#
|
7
|
+
# If you just have a page with templates and want some variable value
|
8
|
+
# (like "page about country - infobox - total population"), you should
|
9
|
+
# be totally happy with {Tree::Template} and its features.
|
10
|
+
#
|
11
|
+
# What this module does is, basically, two things:
|
12
|
+
# * allow you to define for arbitrary templates how they are converted
|
13
|
+
# to text; by default, templates are totally excluded from text, which
|
14
|
+
# is not most reasonable behavior for many formatting templates;
|
15
|
+
# * allow you to define additional functionality for arbitrary templates;
|
16
|
+
# many of them containing pretty complicated logic (see, for ex.,
|
17
|
+
# [Template:Convert](https://en.wikipedia.org/wiki/Template:Convert),
|
18
|
+
# and it seems reasonable to extend instances of such a template.
|
19
|
+
#
|
20
|
+
# Infoboxer allows you to define {Templates::Set} of template-specific
|
21
|
+
# classes for some site/domain.
|
22
|
+
# There is already defined set of most commonly used templates at
|
23
|
+
# en.wikipedia.org (so, most of English Wikipedia texts will be rendered
|
24
|
+
# correctly, and also some advanced functionality is provided).
|
25
|
+
# You can take a look at
|
26
|
+
# [lib/infoboxer/definitions/en.wikipedia.org.rb](https://github.com/molybdenum-99/infoboxer/blob/master/lib/infoboxer/definitions/en.wikipedia.org.rb)
|
27
|
+
# to feel it (and also see a couple of TODOs and FIXMEs and other
|
28
|
+
# considerations).
|
29
|
+
#
|
30
|
+
# From Infoboxer's point-of-view, templates are the most complex part
|
31
|
+
# of Wikipedia, and we are currently trying hard to do the most reasonable
|
32
|
+
# things about them.
|
33
|
+
#
|
34
|
+
# Future versions also should:
|
35
|
+
# * define more of common English Wikipedia templates;
|
36
|
+
# * define templates for other popular wikis;
|
37
|
+
# * allow to add template definitions on-the-fly, while loading some
|
38
|
+
# page.
|
3
39
|
#
|
4
|
-
# I do my best.
|
5
40
|
module Templates
|
6
41
|
%w[base set].each do |tmpl|
|
7
42
|
require_relative "templates/#{tmpl}"
|
@@ -15,29 +15,6 @@ module Infoboxer
|
|
15
15
|
end
|
16
16
|
end
|
17
17
|
|
18
|
-
def unnamed_variables
|
19
|
-
variables.select{|v| v.name =~ /^\d+$/}
|
20
|
-
end
|
21
|
-
|
22
|
-
def fetch(*patterns)
|
23
|
-
Nodes[*patterns.map{|p| variables.find(name: p)}.flatten]
|
24
|
-
end
|
25
|
-
|
26
|
-
def fetch_hash(*patterns)
|
27
|
-
fetch(*patterns).map{|v| [v.name, v]}.to_h
|
28
|
-
end
|
29
|
-
|
30
|
-
def fetch_date(*patterns)
|
31
|
-
components = fetch(*patterns)
|
32
|
-
components.pop while components.last.nil? && !components.empty?
|
33
|
-
|
34
|
-
if components.empty?
|
35
|
-
nil
|
36
|
-
else
|
37
|
-
Date.new(*components.map{|v| v.to_s.to_i})
|
38
|
-
end
|
39
|
-
end
|
40
|
-
|
41
18
|
def ==(other)
|
42
19
|
other.kind_of?(Tree::Template) && _eq(other)
|
43
20
|
end
|
@@ -54,7 +31,9 @@ module Infoboxer
|
|
54
31
|
end
|
55
32
|
|
56
33
|
# Renders all of its unnamed variables as space-separated text
|
57
|
-
# Also allows in-template navigation
|
34
|
+
# Also allows in-template navigation.
|
35
|
+
#
|
36
|
+
# Used for {Set} definitions.
|
58
37
|
class Show < Base
|
59
38
|
alias_method :children, :unnamed_variables
|
60
39
|
|
@@ -65,6 +44,9 @@ module Infoboxer
|
|
65
44
|
end
|
66
45
|
end
|
67
46
|
|
47
|
+
# Replaces template with replacement, while rendering.
|
48
|
+
#
|
49
|
+
# Used for {Set} definitions.
|
68
50
|
class Replace < Base
|
69
51
|
def replace
|
70
52
|
fail(NotImplementedError, "Descendants should define :replace")
|
@@ -75,6 +57,9 @@ module Infoboxer
|
|
75
57
|
end
|
76
58
|
end
|
77
59
|
|
60
|
+
# Replaces template with its name, while rendering.
|
61
|
+
#
|
62
|
+
# Used for {Set} definitions.
|
78
63
|
class Literal < Base
|
79
64
|
alias_method :text, :name
|
80
65
|
end
|
@@ -1,31 +1,103 @@
|
|
1
1
|
# encoding: utf-8
|
2
2
|
module Infoboxer
|
3
3
|
module Templates
|
4
|
+
# Base class for defining set of templates, used for some site/domain.
|
5
|
+
#
|
6
|
+
# Currently only can be plugged in via {MediaWiki::Traits.templates}.
|
7
|
+
#
|
8
|
+
# Template set defines a DSL for creating new template definitions --
|
9
|
+
# also simplest ones and very complicated.
|
10
|
+
#
|
11
|
+
# You can look at implementation of English Wikipedia
|
12
|
+
# [common templates set](https://github.com/molybdenum-99/infoboxer/blob/master/lib/infoboxer/definitions/en.wikipedia.org.rb)
|
13
|
+
# in Infoboxer's repo.
|
14
|
+
#
|
4
15
|
class Set
|
5
16
|
def initialize(&definitions)
|
6
17
|
@templates = []
|
7
18
|
define(&definitions) if definitions
|
8
19
|
end
|
9
|
-
|
20
|
+
|
21
|
+
# @private
|
10
22
|
def find(name)
|
11
23
|
_, template = @templates.detect{|m, t| m === name.downcase}
|
12
24
|
template || Base
|
13
25
|
end
|
14
26
|
|
27
|
+
# @private
|
15
28
|
def define(&definitions)
|
16
29
|
instance_eval(&definitions)
|
17
30
|
end
|
18
31
|
|
32
|
+
# @private
|
19
33
|
def clear
|
20
34
|
@templates.clear
|
21
35
|
end
|
22
36
|
|
23
|
-
|
24
|
-
|
37
|
+
# Most common form of template definition.
|
38
|
+
#
|
39
|
+
# Can be used like:
|
40
|
+
#
|
41
|
+
# ```ruby
|
42
|
+
# template 'Age' do
|
43
|
+
# def from
|
44
|
+
# fetch_date('1', '2', '3')
|
45
|
+
# end
|
46
|
+
#
|
47
|
+
# def to
|
48
|
+
# fetch_date('4', '5', '6') || Date.today
|
49
|
+
# end
|
50
|
+
#
|
51
|
+
# def value
|
52
|
+
# (to - from).to_i / 365 # FIXME: obviously
|
53
|
+
# end
|
54
|
+
#
|
55
|
+
# def text
|
56
|
+
# "#{value} years"
|
57
|
+
# end
|
58
|
+
# end
|
59
|
+
# ```
|
60
|
+
#
|
61
|
+
# @param name Definition name.
|
62
|
+
# @param options Definition options.
|
63
|
+
# Currently recognized options are:
|
64
|
+
# * `:match` -- regexp or string, which matches template name to
|
65
|
+
# add this definition to (if not provided, `name` param used
|
66
|
+
# to match relevant templates);
|
67
|
+
# * `:base` -- name of template definition to use as a base class;
|
68
|
+
# for example you can do things like:
|
69
|
+
#
|
70
|
+
# ```ruby
|
71
|
+
# # ...inside template set definition...
|
72
|
+
# template 'Infobox', match: /^Infobox/ do
|
73
|
+
# # implementation
|
74
|
+
# end
|
75
|
+
#
|
76
|
+
# template 'Infobox cheese', base: 'Infobox' do
|
77
|
+
# end
|
78
|
+
# ```
|
79
|
+
#
|
80
|
+
# Expected to be used inside Set definition block.
|
25
81
|
def template(name, options = {}, &definition)
|
26
82
|
setup_class(name, Base, options, &definition)
|
27
83
|
end
|
28
84
|
|
85
|
+
# Define list of "replacements": templates, which text should be replaced
|
86
|
+
# with arbitrary value.
|
87
|
+
#
|
88
|
+
# Example:
|
89
|
+
#
|
90
|
+
# ```ruby
|
91
|
+
# # ...inside template set definition...
|
92
|
+
# replace(
|
93
|
+
# '!!' => '||',
|
94
|
+
# '!(' => '['
|
95
|
+
# )
|
96
|
+
# ```
|
97
|
+
# Now, all templates with name `!!` will render as `||` when you
|
98
|
+
# call their (or their parents') {Tree::Node#text}.
|
99
|
+
#
|
100
|
+
# Expected to be used inside Set definition block.
|
29
101
|
def replace(*replacements)
|
30
102
|
case
|
31
103
|
when replacements.count == 2 && replacements.all?{|r| r.is_a?(String)}
|
@@ -44,18 +116,50 @@ module Infoboxer
|
|
44
116
|
end
|
45
117
|
end
|
46
118
|
|
119
|
+
# Define list of "show children" templates. Those ones, when rendered
|
120
|
+
# as text, just provide join of their children text (space-separated).
|
121
|
+
#
|
122
|
+
# Example:
|
123
|
+
#
|
124
|
+
# ```ruby
|
125
|
+
# #...in template set definition...
|
126
|
+
# show 'Small'
|
127
|
+
# ```
|
128
|
+
# Now, wikitext paragraph looking like...
|
129
|
+
#
|
130
|
+
# ```
|
131
|
+
# This is {{small|text}} in template
|
132
|
+
# ```
|
133
|
+
# ...before this template definition had rendered like
|
134
|
+
# `"This is in template"` (template contents ommitted), and after
|
135
|
+
# this definition it will render like `"This is text in template"`
|
136
|
+
# (template contents rendered as is).
|
137
|
+
#
|
138
|
+
# Expected to be used inside Set definition block.
|
47
139
|
def show(*names)
|
48
140
|
names.each do |name|
|
49
141
|
setup_class(name, Show)
|
50
142
|
end
|
51
143
|
end
|
52
144
|
|
145
|
+
# Define list of "literally rendered templates". It means, when
|
146
|
+
# rendering text, template is replaced with just its name.
|
147
|
+
#
|
148
|
+
# Explanation: in
|
149
|
+
# MediaWiki, there are contexts (deeply in other templates and
|
150
|
+
# tables), when you can't just type something like `","` and not
|
151
|
+
# have it interpreted. So, wikis oftenly define wrappers around
|
152
|
+
# those templates, looking like `{{,}}` -- so, while rendering texts,
|
153
|
+
# such templates can be replaced with their names.
|
154
|
+
#
|
155
|
+
# Expected to be used inside Set definition block.
|
53
156
|
def literal(*names)
|
54
157
|
names.each do |name|
|
55
158
|
setup_class(name, Literal)
|
56
159
|
end
|
57
160
|
end
|
58
161
|
|
162
|
+
# @private
|
59
163
|
def setup_class(name, base_class, options = {}, &definition)
|
60
164
|
match = options.fetch(:match, name.downcase)
|
61
165
|
base = options.fetch(:base, base_class)
|
@@ -21,6 +21,7 @@ module Infoboxer
|
|
21
21
|
children.index(child)
|
22
22
|
end
|
23
23
|
|
24
|
+
# @private
|
24
25
|
# Internal, used by {Parser}
|
25
26
|
def push_children(*nodes)
|
26
27
|
nodes.each{|c| c.parent = self}.each do |n|
|
@@ -45,21 +46,25 @@ module Infoboxer
|
|
45
46
|
|
46
47
|
# Kinda "private" methods, used by Parser only -------------------
|
47
48
|
|
49
|
+
# @private
|
48
50
|
# Internal, used by {Parser}
|
49
51
|
def can_merge?(other)
|
50
52
|
false
|
51
53
|
end
|
52
54
|
|
55
|
+
# @private
|
53
56
|
# Internal, used by {Parser}
|
54
57
|
def closed!
|
55
58
|
@closed = true
|
56
59
|
end
|
57
60
|
|
61
|
+
# @private
|
58
62
|
# Internal, used by {Parser}
|
59
63
|
def closed?
|
60
64
|
@closed
|
61
65
|
end
|
62
66
|
|
67
|
+
# @private
|
63
68
|
# Internal, used by {Parser}
|
64
69
|
def empty?
|
65
70
|
children.empty?
|
data/lib/infoboxer/tree/html.rb
CHANGED
@@ -0,0 +1,42 @@
|
|
1
|
+
module Infoboxer
|
2
|
+
module Tree
|
3
|
+
# Module included into everything, that can be treated as
|
4
|
+
# link to some MediaWiki page, despite of behavior. Namely,
|
5
|
+
# {Wikilink} and {Template}.
|
6
|
+
module Linkable
|
7
|
+
# Extracts wiki page by this link and returns it parsed (or nil,
|
8
|
+
# if page not found).
|
9
|
+
#
|
10
|
+
# About template "following" see also {Template#follow} docs.
|
11
|
+
#
|
12
|
+
# @return {MediaWiki::Page}
|
13
|
+
#
|
14
|
+
# **See also**:
|
15
|
+
# * {Tree::Nodes#follow} for extracting multiple links at once;
|
16
|
+
# * {MediaWiki#get} for basic information on page extraction.
|
17
|
+
def follow
|
18
|
+
client.get(link)
|
19
|
+
end
|
20
|
+
|
21
|
+
# Human-readable page URL
|
22
|
+
#
|
23
|
+
# @return [String]
|
24
|
+
def url
|
25
|
+
# FIXME: fragile as hell.
|
26
|
+
page.url.sub(/[^\/]+$/, link.gsub(' ', '_'))
|
27
|
+
end
|
28
|
+
|
29
|
+
protected
|
30
|
+
|
31
|
+
def page
|
32
|
+
page = lookup_parents(MediaWiki::Page).first or
|
33
|
+
fail("Not in a page from real source")
|
34
|
+
end
|
35
|
+
|
36
|
+
def client
|
37
|
+
page.client or fail("MediaWiki client not set")
|
38
|
+
end
|
39
|
+
|
40
|
+
end
|
41
|
+
end
|
42
|
+
end
|
data/lib/infoboxer/tree/list.rb
CHANGED
@@ -3,12 +3,14 @@ module Infoboxer
|
|
3
3
|
module Tree
|
4
4
|
# Represents item of ordered or unordered list.
|
5
5
|
class ListItem < BaseParagraph
|
6
|
+
# @private
|
6
7
|
# Internal, used by {Parser}
|
7
8
|
def can_merge?(other)
|
8
9
|
other.class == self.class &&
|
9
10
|
other.children.first.kind_of?(List)
|
10
11
|
end
|
11
12
|
|
13
|
+
# @private
|
12
14
|
# Internal, used by {Parser}
|
13
15
|
def merge!(other)
|
14
16
|
ochildren = other.children.dup
|
@@ -111,6 +113,7 @@ module Infoboxer
|
|
111
113
|
class List < Compound
|
112
114
|
include Mergeable
|
113
115
|
|
116
|
+
# @private
|
114
117
|
# Internal, used by {Parser}
|
115
118
|
def merge!(other)
|
116
119
|
ochildren = other.children.dup
|
@@ -123,6 +126,7 @@ module Infoboxer
|
|
123
126
|
push_children(*ochildren)
|
124
127
|
end
|
125
128
|
|
129
|
+
# @private
|
126
130
|
# Internal, used by {Parser}
|
127
131
|
def self.construct(marker, nodes)
|
128
132
|
m = marker.shift
|
data/lib/infoboxer/tree/node.rb
CHANGED
@@ -61,11 +61,13 @@ module Infoboxer
|
|
61
61
|
Nodes[] # redefined in descendants
|
62
62
|
end
|
63
63
|
|
64
|
+
# @private
|
64
65
|
# Used only during tree construction in {Parser}.
|
65
66
|
def can_merge?(other)
|
66
67
|
false
|
67
68
|
end
|
68
69
|
|
70
|
+
# @private
|
69
71
|
# Whether node is empty (definition of "empty" varies for different
|
70
72
|
# kinds of nodes). Used mainly in {Parser}.
|
71
73
|
def empty?
|
data/lib/infoboxer/tree/nodes.rb
CHANGED
@@ -146,6 +146,7 @@ module Infoboxer
|
|
146
146
|
page.client.get(*links)
|
147
147
|
end
|
148
148
|
|
149
|
+
# @private
|
149
150
|
# Internal, used by {Parser}
|
150
151
|
def <<(node)
|
151
152
|
if node.kind_of?(Array)
|
@@ -159,6 +160,7 @@ module Infoboxer
|
|
159
160
|
end
|
160
161
|
end
|
161
162
|
|
163
|
+
# @private
|
162
164
|
# Internal, used by {Parser}
|
163
165
|
def strip
|
164
166
|
res = dup
|
@@ -167,6 +169,7 @@ module Infoboxer
|
|
167
169
|
res
|
168
170
|
end
|
169
171
|
|
172
|
+
# @private
|
170
173
|
# Internal, used by {Parser}
|
171
174
|
def flow_templates
|
172
175
|
make_nodes map{|n| n.is_a?(Paragraph) ? n.to_templates? : n}
|
@@ -15,7 +15,6 @@ module Infoboxer
|
|
15
15
|
end
|
16
16
|
|
17
17
|
# @private
|
18
|
-
# Internal! Nothing to see here! Just YARD `@private` tag not working at class level
|
19
18
|
class EmptyParagraph < Node
|
20
19
|
def initialize(text)
|
21
20
|
@text = text
|
@@ -30,7 +29,6 @@ module Infoboxer
|
|
30
29
|
end
|
31
30
|
|
32
31
|
# @private
|
33
|
-
# Internal! Nothing to see here! Just YARD `@private` tag not working at class level
|
34
32
|
module Mergeable
|
35
33
|
def can_merge?(other)
|
36
34
|
!closed? && self.class == other.class
|
@@ -49,7 +47,6 @@ module Infoboxer
|
|
49
47
|
end
|
50
48
|
|
51
49
|
# @private
|
52
|
-
# Internal! Nothing to see here! Just YARD `@private` tag not working at class level
|
53
50
|
class MergeableParagraph < BaseParagraph
|
54
51
|
include Mergeable
|
55
52
|
|
@@ -61,21 +58,25 @@ module Infoboxer
|
|
61
58
|
|
62
59
|
# Represents plain text paragraph.
|
63
60
|
class Paragraph < MergeableParagraph
|
61
|
+
# @private
|
64
62
|
# Internal, used by {Parser} for merging
|
65
63
|
def splitter
|
66
64
|
Text.new(' ')
|
67
65
|
end
|
68
66
|
|
67
|
+
# @private
|
69
68
|
# Internal, used by {Parser}
|
70
69
|
def templates_only?
|
71
70
|
children.all?{|c| c.is_a?(Template) || c.is_a?(Text) && c.raw_text.strip.empty?}
|
72
71
|
end
|
73
72
|
|
73
|
+
# @private
|
74
74
|
# Internal, used by {Parser}
|
75
75
|
def to_templates
|
76
76
|
children.select(&filter(itself: Template))
|
77
77
|
end
|
78
78
|
|
79
|
+
# @private
|
79
80
|
# Internal, used by {Parser}
|
80
81
|
def to_templates?
|
81
82
|
templates_only? ? to_templates : self
|
@@ -104,6 +105,7 @@ module Infoboxer
|
|
104
105
|
#
|
105
106
|
# Paragraph-level thing, can contain many lines of text.
|
106
107
|
class Pre < MergeableParagraph
|
108
|
+
# @private
|
107
109
|
# Internal, used by {Parser}
|
108
110
|
def merge!(other)
|
109
111
|
if other.is_a?(EmptyParagraph) && !other.text.empty?
|
@@ -113,6 +115,7 @@ module Infoboxer
|
|
113
115
|
end
|
114
116
|
end
|
115
117
|
|
118
|
+
# @private
|
116
119
|
# Internal, used by {Parser} for merging
|
117
120
|
def splitter
|
118
121
|
Text.new("\n")
|
data/lib/infoboxer/tree/ref.rb
CHANGED
@@ -1,4 +1,6 @@
|
|
1
1
|
# encoding: utf-8
|
2
|
+
require_relative 'linkable'
|
3
|
+
|
2
4
|
module Infoboxer
|
3
5
|
module Tree
|
4
6
|
# Template variable.
|
@@ -29,15 +31,79 @@ module Infoboxer
|
|
29
31
|
end
|
30
32
|
end
|
31
33
|
|
32
|
-
#
|
34
|
+
# Represents MediaWiki **template**.
|
35
|
+
#
|
36
|
+
# [**Template**](https://en.wikipedia.org/wiki/Wikipedia:Templates)
|
37
|
+
# is basically a thing with name, some variables and their
|
38
|
+
# values. When pages are displayed in browser, templates are rendered in
|
39
|
+
# something different by wiki engine; yet, when extracting information
|
40
|
+
# with Infoboxer, you are working with original templates.
|
41
|
+
#
|
42
|
+
# It requires some mastering and understanding, yet allows to do
|
43
|
+
# very poweful things. There are many kinds of them, from pure
|
44
|
+
# formatting-related (which are typically not more than small bells
|
45
|
+
# and whistles for page outlook, and should be rendered as a text)
|
46
|
+
# to very information-heavy ones, like
|
47
|
+
# [**infoboxes**](https://en.wikipedia.org/wiki/Help:Infobox), from
|
48
|
+
# which Infoboxer borrows its name!
|
49
|
+
#
|
50
|
+
# Basically, for information extraction from template you'll list
|
51
|
+
# its {#variables}, and then use {#fetch} method
|
52
|
+
# (and its variants: {#fetch_hash}/#{fetch_date}) to extract their
|
53
|
+
# values.
|
54
|
+
#
|
55
|
+
# ### On variables naming
|
56
|
+
# MediaWiki templates can contain _named_ and _unnamed_ variables.
|
57
|
+
# Example:
|
58
|
+
#
|
59
|
+
# ```
|
60
|
+
# {{birth date and age|1953|2|19|df=y}}
|
61
|
+
# ```
|
33
62
|
#
|
34
|
-
#
|
63
|
+
# This is template with name "birth date and age", three unnamed
|
64
|
+
# variables with values "1953", "2" and "19", and one named variable
|
65
|
+
# with name "df" and value "y".
|
35
66
|
#
|
36
|
-
#
|
37
|
-
#
|
38
|
-
#
|
67
|
+
# For consistency, Infoboxer treats unnamed variables _exactly_ the
|
68
|
+
# same way MediaWiki does: they considered to have numeric names,
|
69
|
+
# which are _started from 1_ and _stored as a strings_. So, for
|
70
|
+
# template shown above, the following is correct:
|
71
|
+
#
|
72
|
+
# ```ruby
|
73
|
+
# template.fetch('1').text == '1953'
|
74
|
+
# template.fetch('2').text == '2'
|
75
|
+
# template.fetch('3').text == '19'
|
76
|
+
# template.fetch('df').text == 'y'
|
77
|
+
# ```
|
78
|
+
#
|
79
|
+
# Note also, that _named variables with simple text values_ are
|
80
|
+
# duplicated as a template node {Node#params}, so, the following is
|
81
|
+
# correct also:
|
82
|
+
#
|
83
|
+
# ```ruby
|
84
|
+
# template.params['df'] == 'y'
|
85
|
+
# template.params.has_key?('1') == false
|
86
|
+
# ```
|
87
|
+
#
|
88
|
+
# For more advanced topics, like subclassing templates by names and
|
89
|
+
# converting them to inline text, please read {Templates} module's
|
90
|
+
# documentation.
|
39
91
|
class Template < Compound
|
40
|
-
|
92
|
+
# Template name, designating its contents structure.
|
93
|
+
#
|
94
|
+
# See also {Linkable#url #url}, which you can navigate to read template's
|
95
|
+
# definition (and, in Wikipedia and many other projects, its
|
96
|
+
# documentation).
|
97
|
+
#
|
98
|
+
# @return [String]
|
99
|
+
attr_reader :name
|
100
|
+
|
101
|
+
# Template variables list.
|
102
|
+
#
|
103
|
+
# See {Var} class to understand what you can do with them.
|
104
|
+
#
|
105
|
+
# @return [Nodes<Var>]
|
106
|
+
attr_reader :variables
|
41
107
|
|
42
108
|
def initialize(name, variables = Nodes[])
|
43
109
|
super(Nodes[], extract_params(variables))
|
@@ -51,6 +117,97 @@ module Infoboxer
|
|
51
117
|
variables.map{|var| var.to_tree(level+1)}.join
|
52
118
|
end
|
53
119
|
|
120
|
+
# Returns list of template variables with numeric names (which
|
121
|
+
# are treated as "unnamed" variables by MediaWiki templates, see
|
122
|
+
# {Template class docs} for explanation).
|
123
|
+
#
|
124
|
+
# @return [Nodes<Var>]
|
125
|
+
def unnamed_variables
|
126
|
+
variables.find(name: /^\d+$/)
|
127
|
+
end
|
128
|
+
|
129
|
+
# Fetches template variable(s) by name(s) or patterns.
|
130
|
+
#
|
131
|
+
# Usage:
|
132
|
+
#
|
133
|
+
# ```ruby
|
134
|
+
# argentina.infobox.fetch('leader_title_1') # => one Var node
|
135
|
+
# argentina.infobox.fetch('leader_title_1',
|
136
|
+
# 'leader_name_1') # => two Var nodes
|
137
|
+
# argentina.infobox.fetch(/leader_title_\d+/) # => several Var nodes
|
138
|
+
# ```
|
139
|
+
#
|
140
|
+
# @return [Nodes<Var>]
|
141
|
+
def fetch(*patterns)
|
142
|
+
Nodes[*patterns.map{|p| variables.find(name: p)}.flatten]
|
143
|
+
end
|
144
|
+
|
145
|
+
# Fetches hash `{name => variable}`, by same patterns as {#fetch}.
|
146
|
+
#
|
147
|
+
# @return [Hash<String => Var>]
|
148
|
+
def fetch_hash(*patterns)
|
149
|
+
fetch(*patterns).map{|v| [v.name, v]}.to_h
|
150
|
+
end
|
151
|
+
|
152
|
+
# Fetches date by list of variable names containing date components.
|
153
|
+
#
|
154
|
+
# _(Experimental, subject to change or enchance.)_
|
155
|
+
#
|
156
|
+
# Explanation: if you have template like
|
157
|
+
# ```
|
158
|
+
# {{birth date and age|1953|2|19|df=y}}
|
159
|
+
# ```
|
160
|
+
# ...there is a short way to obtain date from it:
|
161
|
+
# ```ruby
|
162
|
+
# template.fetch_date('1', '2', '3') # => Date.new(1953,2,19)
|
163
|
+
# ```
|
164
|
+
#
|
165
|
+
# @return [Date]
|
166
|
+
def fetch_date(*patterns)
|
167
|
+
components = fetch(*patterns)
|
168
|
+
components.pop while components.last.nil? && !components.empty?
|
169
|
+
|
170
|
+
if components.empty?
|
171
|
+
nil
|
172
|
+
else
|
173
|
+
Date.new(*components.map{|v| v.to_s.to_i})
|
174
|
+
end
|
175
|
+
end
|
176
|
+
|
177
|
+
include Linkable
|
178
|
+
|
179
|
+
# @!method follow
|
180
|
+
# Extracts template source and returns it parsed (or nil,
|
181
|
+
# if template not found).
|
182
|
+
#
|
183
|
+
# **NB**: Infoboxer does NO variable substitution or other template
|
184
|
+
# evaluation actions. Moreover, it will almost certainly NOT parse
|
185
|
+
# template definitions correctly. You should use this method ONLY
|
186
|
+
# for "transclusion" templates (parts of content, which are
|
187
|
+
# included into other pages "as is").
|
188
|
+
#
|
189
|
+
# Look for example at [this page's](https://en.wikipedia.org/wiki/Tropical_and_subtropical_coniferous_forests)
|
190
|
+
# [source](https://en.wikipedia.org/w/index.php?title=Tropical_and_subtropical_coniferous_forests&action=edit):
|
191
|
+
# each subtable about some region is just a transclusion of
|
192
|
+
# template. This can be processed like:
|
193
|
+
#
|
194
|
+
# ```ruby
|
195
|
+
# Infoboxer.wp.get('Tropical and subtropical coniferous forests').
|
196
|
+
# templates(name: /forests^/).
|
197
|
+
# follow.tables #.and_so_on
|
198
|
+
# ```
|
199
|
+
#
|
200
|
+
# @return {MediaWiki::Page}
|
201
|
+
#
|
202
|
+
# **See also** {Linkable#follow} for general notes on the following links.
|
203
|
+
|
204
|
+
# Wikilink name of this template's source.
|
205
|
+
def link
|
206
|
+
# FIXME: super-naive for now, doesn't thinks about subpages and stuff.
|
207
|
+
"Template:#{name}"
|
208
|
+
end
|
209
|
+
|
210
|
+
# @private
|
54
211
|
# Internal, used by {Parser}.
|
55
212
|
def empty?
|
56
213
|
false
|
data/lib/infoboxer/tree/text.rb
CHANGED
@@ -29,11 +29,13 @@ module Infoboxer
|
|
29
29
|
"#{indent(level)}#{text} <#{descr}>\n"
|
30
30
|
end
|
31
31
|
|
32
|
+
# @private
|
32
33
|
# Internal, used by {Parser}
|
33
34
|
def can_merge?(other)
|
34
35
|
other.is_a?(String) || other.is_a?(Text)
|
35
36
|
end
|
36
37
|
|
38
|
+
# @private
|
37
39
|
# Internal, used by {Parser}
|
38
40
|
def merge!(other)
|
39
41
|
if other.is_a?(String)
|
@@ -45,6 +47,7 @@ module Infoboxer
|
|
45
47
|
end
|
46
48
|
end
|
47
49
|
|
50
|
+
# @private
|
48
51
|
# Internal, used by {Parser}
|
49
52
|
def empty?
|
50
53
|
raw_text.empty?
|
@@ -1,4 +1,6 @@
|
|
1
1
|
# encoding: utf-8
|
2
|
+
require_relative 'linkable'
|
3
|
+
|
2
4
|
module Infoboxer
|
3
5
|
module Tree
|
4
6
|
# Internal MediaWiki link class.
|
@@ -6,6 +8,8 @@ module Infoboxer
|
|
6
8
|
# See [Wikipedia docs](https://en.wikipedia.org/wiki/Help:Link#Wikilinks)
|
7
9
|
# for extensive explanation of Wikilink concept.
|
8
10
|
#
|
11
|
+
# Note, that Wikilink is {Linkable}, so you can {Linkable#follow #follow}
|
12
|
+
# it to obtain linked pages.
|
9
13
|
class Wikilink < Link
|
10
14
|
def initialize(*)
|
11
15
|
super
|
@@ -37,20 +41,7 @@ module Infoboxer
|
|
37
41
|
# See {#topic} for explanation.
|
38
42
|
attr_reader :refinement
|
39
43
|
|
40
|
-
|
41
|
-
# if page not found).
|
42
|
-
#
|
43
|
-
# @return {MediaWiki::Page}
|
44
|
-
#
|
45
|
-
# **See also**:
|
46
|
-
# * {Tree::Nodes#follow} for extracting multiple links at once;
|
47
|
-
# * {MediaWiki#get} for basic information on page extraction.
|
48
|
-
def follow
|
49
|
-
page = lookup_parents(MediaWiki::Page).first or
|
50
|
-
fail("Not in a page from real source")
|
51
|
-
page.client or fail("MediaWiki client not set")
|
52
|
-
page.client.get(link)
|
53
|
-
end
|
44
|
+
include Linkable
|
54
45
|
|
55
46
|
private
|
56
47
|
|
data/lib/infoboxer/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: infoboxer
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Victor Shepelev
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2015-08-
|
11
|
+
date: 2015-08-11 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: htmlentities
|
@@ -178,6 +178,20 @@ dependencies:
|
|
178
178
|
- - ">="
|
179
179
|
- !ruby/object:Gem::Version
|
180
180
|
version: '0'
|
181
|
+
- !ruby/object:Gem::Dependency
|
182
|
+
name: inch
|
183
|
+
requirement: !ruby/object:Gem::Requirement
|
184
|
+
requirements:
|
185
|
+
- - ">="
|
186
|
+
- !ruby/object:Gem::Version
|
187
|
+
version: '0'
|
188
|
+
type: :development
|
189
|
+
prerelease: false
|
190
|
+
version_requirements: !ruby/object:Gem::Requirement
|
191
|
+
requirements:
|
192
|
+
- - ">="
|
193
|
+
- !ruby/object:Gem::Version
|
194
|
+
version: '0'
|
181
195
|
description: |2
|
182
196
|
Infoboxer is library targeting use of Wikipedia (or any other
|
183
197
|
MediaWiki-based wiki) as a rich powerful data source.
|
@@ -188,6 +202,8 @@ extra_rdoc_files: []
|
|
188
202
|
files:
|
189
203
|
- ".dokaz"
|
190
204
|
- ".yardopts"
|
205
|
+
- CHANGELOG.md
|
206
|
+
- CONTRIBUTING.md
|
191
207
|
- LICENSE.txt
|
192
208
|
- Parsing.md
|
193
209
|
- README.md
|
@@ -225,6 +241,7 @@ files:
|
|
225
241
|
- lib/infoboxer/tree/html.rb
|
226
242
|
- lib/infoboxer/tree/image.rb
|
227
243
|
- lib/infoboxer/tree/inline.rb
|
244
|
+
- lib/infoboxer/tree/linkable.rb
|
228
245
|
- lib/infoboxer/tree/list.rb
|
229
246
|
- lib/infoboxer/tree/node.rb
|
230
247
|
- lib/infoboxer/tree/nodes.rb
|
@@ -245,7 +262,7 @@ files:
|
|
245
262
|
- regression/pages/south_america.wiki
|
246
263
|
- regression/pages/ukraine.wiki
|
247
264
|
- regression/pages/usa.wiki
|
248
|
-
homepage: https://github.com/
|
265
|
+
homepage: https://github.com/molybdenum-99/infoboxer
|
249
266
|
licenses:
|
250
267
|
- MIT
|
251
268
|
metadata: {}
|