infoboxer 0.1.0 → 0.1.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 23b2c1e2b3e0953dcab1ee64ea145cefc02f2174
4
- data.tar.gz: e8747e93e3bafcddff783bc08efb6df6ed32703b
3
+ metadata.gz: 37db410a52fa9f9755b70ff3a1284c6a427509b8
4
+ data.tar.gz: cd9952485ffffb6b3eb3d3cb69cb771aa2c9610a
5
5
  SHA512:
6
- metadata.gz: 28c1f2fb24a4831d4ca5af047e8ca3c7b573e1747aed0ec152468d64e5710c2c51e342de39178b2938f4af49d0f31a2d5afe0c8a37def863040a3803d9aa4aa0
7
- data.tar.gz: 1296e07dd28bd64435856dbbf8bdf5cc6bb65be135dd51483aa9a1637e5dc78403b221780b4cbe2ad85d8ac89dac0907309c96f0a921634b88636f9e0473bee2
6
+ metadata.gz: 2800426e4a055f838a89b50bd9388364f7a2499d2736816c686a5d9bb2cc727d2bbdfe485565835d844cefe75c8d56f510f2fd2823411e4b7a37c4d0fd3122bf
7
+ data.tar.gz: 6dc878f4957c468051ecff57eee57ca66ac2ed19e157cb6f62f993113ba2c8bd220f87926eda4f51aa5917d097ca2e8c13d94e2acd9dff8b8f82036412845245
data/.yardopts CHANGED
@@ -1 +1,2 @@
1
1
  --markup=markdown
2
+ --no-private
data/CHANGELOG.md ADDED
@@ -0,0 +1,14 @@
1
+ # Infoboxer's change log
2
+
3
+ ## 0.1.1 (2015-08-11)
4
+
5
+ Basically, preparing for wider release!
6
+
7
+ * Small refactorings;
8
+ * Documentation fixes;
9
+
10
+ ## 0.1.0 (2015-08-07)
11
+
12
+ Initial (ok, I know it's typically called 0.1.1, but here's work of
13
+ three monthes, numerous documentations and examples and so on... so, let
14
+ it be 0.1.0).
data/CONTRIBUTING.md ADDED
@@ -0,0 +1,62 @@
1
+ # Contributing to Infoboxer
2
+
3
+ _(Also duplicated in [wiki](https://github.com/molybdenum-99/infoboxer/wiki/Contributing).)_
4
+
5
+ ## Contributing via test cases
6
+
7
+ If you are assured that Infoboxer takes some page wrong, please create an
8
+ [issue](https://github.com/molybdenum-99/infoboxer/issues) with link
9
+ to page (or raw wikitext) and description of a problem.
10
+
11
+ ## Contributing via localizations and templates describing
12
+
13
+ Look at [en.wikipedia.org](https://github.com/molybdenum-99/infoboxer/blob/master/lib/infoboxer/definitions/en.wikipedia.org.rb)
14
+ template definitions. It can be extended. Also, similar definitions
15
+ can/should be created for other language wikipedias and other popular
16
+ wikis.
17
+
18
+ You can do pull requests with your own definitions, or create an
19
+ [issue](https://github.com/molybdenum-99/infoboxer/issues) describing
20
+ which template definitions should be added to Infoboxer.
21
+
22
+ ## Contributing via code
23
+
24
+ If you want to fix some bug or implement some feature, please just
25
+ follow the standard process for github opensource: fork, fix, push,
26
+ make pull request.
27
+
28
+ Some (scanty) information below.
29
+
30
+ ### Understanding the code
31
+
32
+ * Infoboxer is splitted in several modules (which are clearly visible in
33
+ API docs and folders structure).
34
+ * Most of "easy features"
35
+ can be added to [Navigation](http://www.rubydoc.info/gems/infoboxer/Infoboxer/Navigation)
36
+ module and its submodules: enchancing of navigational experience and
37
+ implement clever shortcuts (like "converting table to dataframe/list of
38
+ hashes", for ex.).
39
+ * Most of potential bugs can seat in
40
+ [Parser](http://www.rubydoc.info/gems/infoboxer/Infoboxer/Parser) class
41
+ and its modules; MediaWiki markup IS tricky and tightly coupled and
42
+ ambigous; there's also some non-implemented features, like `<source>`
43
+ tag parsing and template definition pages (which, possibly, is not
44
+ target of Infoboxer anyways).
45
+ * Most of underfeatured area is in
46
+ [MediaWiki](http://www.rubydoc.info/gems/infoboxer/Infoboxer/MediaWiki)
47
+ -- seems reasonable for information extraction purposes to have more
48
+ features from MediaWiki API, like "page list generators", search,
49
+ "what links here" and similar functionality.
50
+ * Most of clarification and documentation is required for
51
+ [Templates](http://www.rubydoc.info/gems/infoboxer/Infoboxer/Templates)
52
+ module, which is still underloved heart of Infoboxer.
53
+
54
+ ### Parser: quick, not clever
55
+
56
+ Whether you'd want to put your hands on Parser: please remember, that
57
+ it's hand-crafted and thoroughly optimized. The first thought you may
58
+ have that it needs more OO decompozition, a class for each case; or more
59
+ ideomatic Ruby, or ... Trust me, I've tried it all. But when you are
60
+ dealing with hundreds of thousands of parsing operations and tens of
61
+ thousands of resulting nodes, it turns out even simplest things like
62
+ `Object#tap` have performance penalty on large number of calls.
data/README.md CHANGED
@@ -16,7 +16,7 @@ obvious structure, you can navigate that tree easily, and you have a
16
16
  bunch of hi-level helpers method, so typical information extraction
17
17
  tasks should be super-easy, one-liners in best cases.
18
18
 
19
- _(For those already thinkind "Why should you do this, we already have
19
+ _(For those already thinking "Why should you do this, we already have
20
20
  DBPedia?" -- please, read "[Reasons](https://github.com/molybdenum-99/infoboxer/wiki/Reasons)"
21
21
  page in our wiki.)_
22
22
 
data/infoboxer.gemspec CHANGED
@@ -5,7 +5,7 @@ Gem::Specification.new do |s|
5
5
  s.version = Infoboxer::VERSION
6
6
  s.authors = ['Victor Shepelev']
7
7
  s.email = 'zverok.offline@gmail.com'
8
- s.homepage = 'https://github.com/zverok/infoboxer'
8
+ s.homepage = 'https://github.com/molybdenum-99/infoboxer'
9
9
 
10
10
  s.summary = 'MediaWiki client and parser, targeting information extraction.'
11
11
  s.description = <<-EOF
@@ -40,4 +40,5 @@ Gem::Specification.new do |s|
40
40
  s.add_development_dependency 'ruby-prof'
41
41
  s.add_development_dependency 'vcr'
42
42
  s.add_development_dependency 'webmock'
43
+ s.add_development_dependency 'inch'
43
44
  end
@@ -1,8 +1,20 @@
1
1
  # encoding: utf-8
2
2
  module Infoboxer
3
3
  class MediaWiki
4
+ # DSL for defining "traits" for some site.
5
+ #
6
+ # More docs (and possible refactoring) to follow.
7
+ #
8
+ # You can look at current
9
+ # [English Wikipedia traits](https://github.com/molybdenum-99/infoboxer/blob/master/lib/infoboxer/definitions/en.wikipedia.org.rb)
10
+ # definitions in Infoboxer's repo.
4
11
  class Traits
5
12
  class << self
13
+ # Define set of templates for current site's traits.
14
+ #
15
+ # See {Templates::Set} for longer (yet insufficient) explanation.
16
+ #
17
+ # Expected to be used inside Traits definition block.
6
18
  def templates(&definition)
7
19
  @templates ||= Templates::Set.new
8
20
 
@@ -11,35 +23,49 @@ module Infoboxer
11
23
  @templates.define(&definition)
12
24
  end
13
25
 
14
- # NB: explicitly store all domains in base Traits class
26
+ # @private
15
27
  def domain(d)
28
+ # NB: explicitly store all domains in base Traits class
16
29
  Traits.domains.key?(d) and
17
30
  fail(ArgumentError, "Domain binding redefinition: #{Traits.domains[d]}")
18
31
 
19
32
  Traits.domains[d] = self
20
33
  end
21
34
 
35
+ # @private
22
36
  def get(domain, options = {})
23
37
  cls = Traits.domains[domain]
24
38
  cls ? cls.new(options) : Traits.new(options)
25
39
  end
26
40
 
41
+ # @private
27
42
  def domains
28
43
  @domains ||= {}
29
44
  end
30
45
 
46
+ # Define traits for some domain. Use it like:
47
+ #
48
+ # ```ruby
49
+ # MediaWiki::Traits.for 'ru.wikipedia.org' do
50
+ # templates do
51
+ # template '...' do
52
+ # # some template definition
53
+ # end
54
+ # end
55
+ # end
56
+ # ```
57
+ #
58
+ # Again, you can look at current
59
+ # [English Wikipedia traits](https://github.com/molybdenum-99/infoboxer/blob/master/lib/infoboxer/definitions/en.wikipedia.org.rb)
60
+ # for example implementation.
31
61
  def for(domain, &block)
32
62
  Class.new(self, &block).domain(domain)
33
63
  end
34
64
 
65
+ # @private
35
66
  alias_method :default, :new
36
67
  end
37
68
 
38
- DEFAULTS = {
39
- file_prefix: 'File',
40
- category_prefix: 'Category'
41
- }
42
-
43
69
  def initialize(options = {})
44
70
  @options = options
45
71
  @file_prefix = [DEFAULTS[:file_prefix], options.delete(:file_prefix)].
@@ -48,13 +74,21 @@ module Infoboxer
48
74
  flatten.compact.uniq
49
75
  end
50
76
 
77
+ # @private
51
78
  attr_reader :file_prefix, :category_prefix
52
79
 
53
- #attr_accessor :re
54
-
80
+ # @private
55
81
  def templates
56
82
  self.class.templates
57
83
  end
84
+
85
+ private
86
+
87
+ DEFAULTS = {
88
+ file_prefix: 'File',
89
+ category_prefix: 'Category'
90
+ }
91
+
58
92
  end
59
93
  end
60
94
  end
@@ -1,7 +1,42 @@
1
1
  module Infoboxer
2
- # Templates are cool, powerful and undocumented. Sorry :(
2
+ # This module covers advanced MediaWiki templates usage.
3
+ #
4
+ # It is seriously adviced to read [Wikipedia docs](https://en.wikipedia.org/wiki/Help:Template)
5
+ # or at least look through it (and have it opened while reading further).
6
+ #
7
+ # If you just have a page with templates and want some variable value
8
+ # (like "page about country - infobox - total population"), you should
9
+ # be totally happy with {Tree::Template} and its features.
10
+ #
11
+ # What this module does is, basically, two things:
12
+ # * allow you to define for arbitrary templates how they are converted
13
+ # to text; by default, templates are totally excluded from text, which
14
+ # is not most reasonable behavior for many formatting templates;
15
+ # * allow you to define additional functionality for arbitrary templates;
16
+ # many of them containing pretty complicated logic (see, for ex.,
17
+ # [Template:Convert](https://en.wikipedia.org/wiki/Template:Convert),
18
+ # and it seems reasonable to extend instances of such a template.
19
+ #
20
+ # Infoboxer allows you to define {Templates::Set} of template-specific
21
+ # classes for some site/domain.
22
+ # There is already defined set of most commonly used templates at
23
+ # en.wikipedia.org (so, most of English Wikipedia texts will be rendered
24
+ # correctly, and also some advanced functionality is provided).
25
+ # You can take a look at
26
+ # [lib/infoboxer/definitions/en.wikipedia.org.rb](https://github.com/molybdenum-99/infoboxer/blob/master/lib/infoboxer/definitions/en.wikipedia.org.rb)
27
+ # to feel it (and also see a couple of TODOs and FIXMEs and other
28
+ # considerations).
29
+ #
30
+ # From Infoboxer's point-of-view, templates are the most complex part
31
+ # of Wikipedia, and we are currently trying hard to do the most reasonable
32
+ # things about them.
33
+ #
34
+ # Future versions also should:
35
+ # * define more of common English Wikipedia templates;
36
+ # * define templates for other popular wikis;
37
+ # * allow to add template definitions on-the-fly, while loading some
38
+ # page.
3
39
  #
4
- # I do my best.
5
40
  module Templates
6
41
  %w[base set].each do |tmpl|
7
42
  require_relative "templates/#{tmpl}"
@@ -15,29 +15,6 @@ module Infoboxer
15
15
  end
16
16
  end
17
17
 
18
- def unnamed_variables
19
- variables.select{|v| v.name =~ /^\d+$/}
20
- end
21
-
22
- def fetch(*patterns)
23
- Nodes[*patterns.map{|p| variables.find(name: p)}.flatten]
24
- end
25
-
26
- def fetch_hash(*patterns)
27
- fetch(*patterns).map{|v| [v.name, v]}.to_h
28
- end
29
-
30
- def fetch_date(*patterns)
31
- components = fetch(*patterns)
32
- components.pop while components.last.nil? && !components.empty?
33
-
34
- if components.empty?
35
- nil
36
- else
37
- Date.new(*components.map{|v| v.to_s.to_i})
38
- end
39
- end
40
-
41
18
  def ==(other)
42
19
  other.kind_of?(Tree::Template) && _eq(other)
43
20
  end
@@ -54,7 +31,9 @@ module Infoboxer
54
31
  end
55
32
 
56
33
  # Renders all of its unnamed variables as space-separated text
57
- # Also allows in-template navigation
34
+ # Also allows in-template navigation.
35
+ #
36
+ # Used for {Set} definitions.
58
37
  class Show < Base
59
38
  alias_method :children, :unnamed_variables
60
39
 
@@ -65,6 +44,9 @@ module Infoboxer
65
44
  end
66
45
  end
67
46
 
47
+ # Replaces template with replacement, while rendering.
48
+ #
49
+ # Used for {Set} definitions.
68
50
  class Replace < Base
69
51
  def replace
70
52
  fail(NotImplementedError, "Descendants should define :replace")
@@ -75,6 +57,9 @@ module Infoboxer
75
57
  end
76
58
  end
77
59
 
60
+ # Replaces template with its name, while rendering.
61
+ #
62
+ # Used for {Set} definitions.
78
63
  class Literal < Base
79
64
  alias_method :text, :name
80
65
  end
@@ -1,31 +1,103 @@
1
1
  # encoding: utf-8
2
2
  module Infoboxer
3
3
  module Templates
4
+ # Base class for defining set of templates, used for some site/domain.
5
+ #
6
+ # Currently only can be plugged in via {MediaWiki::Traits.templates}.
7
+ #
8
+ # Template set defines a DSL for creating new template definitions --
9
+ # also simplest ones and very complicated.
10
+ #
11
+ # You can look at implementation of English Wikipedia
12
+ # [common templates set](https://github.com/molybdenum-99/infoboxer/blob/master/lib/infoboxer/definitions/en.wikipedia.org.rb)
13
+ # in Infoboxer's repo.
14
+ #
4
15
  class Set
5
16
  def initialize(&definitions)
6
17
  @templates = []
7
18
  define(&definitions) if definitions
8
19
  end
9
-
20
+
21
+ # @private
10
22
  def find(name)
11
23
  _, template = @templates.detect{|m, t| m === name.downcase}
12
24
  template || Base
13
25
  end
14
26
 
27
+ # @private
15
28
  def define(&definitions)
16
29
  instance_eval(&definitions)
17
30
  end
18
31
 
32
+ # @private
19
33
  def clear
20
34
  @templates.clear
21
35
  end
22
36
 
23
- private
24
-
37
+ # Most common form of template definition.
38
+ #
39
+ # Can be used like:
40
+ #
41
+ # ```ruby
42
+ # template 'Age' do
43
+ # def from
44
+ # fetch_date('1', '2', '3')
45
+ # end
46
+ #
47
+ # def to
48
+ # fetch_date('4', '5', '6') || Date.today
49
+ # end
50
+ #
51
+ # def value
52
+ # (to - from).to_i / 365 # FIXME: obviously
53
+ # end
54
+ #
55
+ # def text
56
+ # "#{value} years"
57
+ # end
58
+ # end
59
+ # ```
60
+ #
61
+ # @param name Definition name.
62
+ # @param options Definition options.
63
+ # Currently recognized options are:
64
+ # * `:match` -- regexp or string, which matches template name to
65
+ # add this definition to (if not provided, `name` param used
66
+ # to match relevant templates);
67
+ # * `:base` -- name of template definition to use as a base class;
68
+ # for example you can do things like:
69
+ #
70
+ # ```ruby
71
+ # # ...inside template set definition...
72
+ # template 'Infobox', match: /^Infobox/ do
73
+ # # implementation
74
+ # end
75
+ #
76
+ # template 'Infobox cheese', base: 'Infobox' do
77
+ # end
78
+ # ```
79
+ #
80
+ # Expected to be used inside Set definition block.
25
81
  def template(name, options = {}, &definition)
26
82
  setup_class(name, Base, options, &definition)
27
83
  end
28
84
 
85
+ # Define list of "replacements": templates, which text should be replaced
86
+ # with arbitrary value.
87
+ #
88
+ # Example:
89
+ #
90
+ # ```ruby
91
+ # # ...inside template set definition...
92
+ # replace(
93
+ # '!!' => '||',
94
+ # '!(' => '['
95
+ # )
96
+ # ```
97
+ # Now, all templates with name `!!` will render as `||` when you
98
+ # call their (or their parents') {Tree::Node#text}.
99
+ #
100
+ # Expected to be used inside Set definition block.
29
101
  def replace(*replacements)
30
102
  case
31
103
  when replacements.count == 2 && replacements.all?{|r| r.is_a?(String)}
@@ -44,18 +116,50 @@ module Infoboxer
44
116
  end
45
117
  end
46
118
 
119
+ # Define list of "show children" templates. Those ones, when rendered
120
+ # as text, just provide join of their children text (space-separated).
121
+ #
122
+ # Example:
123
+ #
124
+ # ```ruby
125
+ # #...in template set definition...
126
+ # show 'Small'
127
+ # ```
128
+ # Now, wikitext paragraph looking like...
129
+ #
130
+ # ```
131
+ # This is {{small|text}} in template
132
+ # ```
133
+ # ...before this template definition had rendered like
134
+ # `"This is in template"` (template contents ommitted), and after
135
+ # this definition it will render like `"This is text in template"`
136
+ # (template contents rendered as is).
137
+ #
138
+ # Expected to be used inside Set definition block.
47
139
  def show(*names)
48
140
  names.each do |name|
49
141
  setup_class(name, Show)
50
142
  end
51
143
  end
52
144
 
145
+ # Define list of "literally rendered templates". It means, when
146
+ # rendering text, template is replaced with just its name.
147
+ #
148
+ # Explanation: in
149
+ # MediaWiki, there are contexts (deeply in other templates and
150
+ # tables), when you can't just type something like `","` and not
151
+ # have it interpreted. So, wikis oftenly define wrappers around
152
+ # those templates, looking like `{{,}}` -- so, while rendering texts,
153
+ # such templates can be replaced with their names.
154
+ #
155
+ # Expected to be used inside Set definition block.
53
156
  def literal(*names)
54
157
  names.each do |name|
55
158
  setup_class(name, Literal)
56
159
  end
57
160
  end
58
161
 
162
+ # @private
59
163
  def setup_class(name, base_class, options = {}, &definition)
60
164
  match = options.fetch(:match, name.downcase)
61
165
  base = options.fetch(:base, base_class)
@@ -21,6 +21,7 @@ module Infoboxer
21
21
  children.index(child)
22
22
  end
23
23
 
24
+ # @private
24
25
  # Internal, used by {Parser}
25
26
  def push_children(*nodes)
26
27
  nodes.each{|c| c.parent = self}.each do |n|
@@ -45,21 +46,25 @@ module Infoboxer
45
46
 
46
47
  # Kinda "private" methods, used by Parser only -------------------
47
48
 
49
+ # @private
48
50
  # Internal, used by {Parser}
49
51
  def can_merge?(other)
50
52
  false
51
53
  end
52
54
 
55
+ # @private
53
56
  # Internal, used by {Parser}
54
57
  def closed!
55
58
  @closed = true
56
59
  end
57
60
 
61
+ # @private
58
62
  # Internal, used by {Parser}
59
63
  def closed?
60
64
  @closed
61
65
  end
62
66
 
67
+ # @private
63
68
  # Internal, used by {Parser}
64
69
  def empty?
65
70
  children.empty?
@@ -21,6 +21,7 @@ module Infoboxer
21
21
 
22
22
  include HTMLTagCommons
23
23
 
24
+ # @private
24
25
  # Internal, used by {Parser}.
25
26
  def empty?
26
27
  # even empty tag, for ex., <br>, should not be dropped!
@@ -0,0 +1,42 @@
1
+ module Infoboxer
2
+ module Tree
3
+ # Module included into everything, that can be treated as
4
+ # link to some MediaWiki page, despite of behavior. Namely,
5
+ # {Wikilink} and {Template}.
6
+ module Linkable
7
+ # Extracts wiki page by this link and returns it parsed (or nil,
8
+ # if page not found).
9
+ #
10
+ # About template "following" see also {Template#follow} docs.
11
+ #
12
+ # @return {MediaWiki::Page}
13
+ #
14
+ # **See also**:
15
+ # * {Tree::Nodes#follow} for extracting multiple links at once;
16
+ # * {MediaWiki#get} for basic information on page extraction.
17
+ def follow
18
+ client.get(link)
19
+ end
20
+
21
+ # Human-readable page URL
22
+ #
23
+ # @return [String]
24
+ def url
25
+ # FIXME: fragile as hell.
26
+ page.url.sub(/[^\/]+$/, link.gsub(' ', '_'))
27
+ end
28
+
29
+ protected
30
+
31
+ def page
32
+ page = lookup_parents(MediaWiki::Page).first or
33
+ fail("Not in a page from real source")
34
+ end
35
+
36
+ def client
37
+ page.client or fail("MediaWiki client not set")
38
+ end
39
+
40
+ end
41
+ end
42
+ end
@@ -3,12 +3,14 @@ module Infoboxer
3
3
  module Tree
4
4
  # Represents item of ordered or unordered list.
5
5
  class ListItem < BaseParagraph
6
+ # @private
6
7
  # Internal, used by {Parser}
7
8
  def can_merge?(other)
8
9
  other.class == self.class &&
9
10
  other.children.first.kind_of?(List)
10
11
  end
11
12
 
13
+ # @private
12
14
  # Internal, used by {Parser}
13
15
  def merge!(other)
14
16
  ochildren = other.children.dup
@@ -111,6 +113,7 @@ module Infoboxer
111
113
  class List < Compound
112
114
  include Mergeable
113
115
 
116
+ # @private
114
117
  # Internal, used by {Parser}
115
118
  def merge!(other)
116
119
  ochildren = other.children.dup
@@ -123,6 +126,7 @@ module Infoboxer
123
126
  push_children(*ochildren)
124
127
  end
125
128
 
129
+ # @private
126
130
  # Internal, used by {Parser}
127
131
  def self.construct(marker, nodes)
128
132
  m = marker.shift
@@ -61,11 +61,13 @@ module Infoboxer
61
61
  Nodes[] # redefined in descendants
62
62
  end
63
63
 
64
+ # @private
64
65
  # Used only during tree construction in {Parser}.
65
66
  def can_merge?(other)
66
67
  false
67
68
  end
68
69
 
70
+ # @private
69
71
  # Whether node is empty (definition of "empty" varies for different
70
72
  # kinds of nodes). Used mainly in {Parser}.
71
73
  def empty?
@@ -146,6 +146,7 @@ module Infoboxer
146
146
  page.client.get(*links)
147
147
  end
148
148
 
149
+ # @private
149
150
  # Internal, used by {Parser}
150
151
  def <<(node)
151
152
  if node.kind_of?(Array)
@@ -159,6 +160,7 @@ module Infoboxer
159
160
  end
160
161
  end
161
162
 
163
+ # @private
162
164
  # Internal, used by {Parser}
163
165
  def strip
164
166
  res = dup
@@ -167,6 +169,7 @@ module Infoboxer
167
169
  res
168
170
  end
169
171
 
172
+ # @private
170
173
  # Internal, used by {Parser}
171
174
  def flow_templates
172
175
  make_nodes map{|n| n.is_a?(Paragraph) ? n.to_templates? : n}
@@ -15,7 +15,6 @@ module Infoboxer
15
15
  end
16
16
 
17
17
  # @private
18
- # Internal! Nothing to see here! Just YARD `@private` tag not working at class level
19
18
  class EmptyParagraph < Node
20
19
  def initialize(text)
21
20
  @text = text
@@ -30,7 +29,6 @@ module Infoboxer
30
29
  end
31
30
 
32
31
  # @private
33
- # Internal! Nothing to see here! Just YARD `@private` tag not working at class level
34
32
  module Mergeable
35
33
  def can_merge?(other)
36
34
  !closed? && self.class == other.class
@@ -49,7 +47,6 @@ module Infoboxer
49
47
  end
50
48
 
51
49
  # @private
52
- # Internal! Nothing to see here! Just YARD `@private` tag not working at class level
53
50
  class MergeableParagraph < BaseParagraph
54
51
  include Mergeable
55
52
 
@@ -61,21 +58,25 @@ module Infoboxer
61
58
 
62
59
  # Represents plain text paragraph.
63
60
  class Paragraph < MergeableParagraph
61
+ # @private
64
62
  # Internal, used by {Parser} for merging
65
63
  def splitter
66
64
  Text.new(' ')
67
65
  end
68
66
 
67
+ # @private
69
68
  # Internal, used by {Parser}
70
69
  def templates_only?
71
70
  children.all?{|c| c.is_a?(Template) || c.is_a?(Text) && c.raw_text.strip.empty?}
72
71
  end
73
72
 
73
+ # @private
74
74
  # Internal, used by {Parser}
75
75
  def to_templates
76
76
  children.select(&filter(itself: Template))
77
77
  end
78
78
 
79
+ # @private
79
80
  # Internal, used by {Parser}
80
81
  def to_templates?
81
82
  templates_only? ? to_templates : self
@@ -104,6 +105,7 @@ module Infoboxer
104
105
  #
105
106
  # Paragraph-level thing, can contain many lines of text.
106
107
  class Pre < MergeableParagraph
108
+ # @private
107
109
  # Internal, used by {Parser}
108
110
  def merge!(other)
109
111
  if other.is_a?(EmptyParagraph) && !other.text.empty?
@@ -113,6 +115,7 @@ module Infoboxer
113
115
  end
114
116
  end
115
117
 
118
+ # @private
116
119
  # Internal, used by {Parser} for merging
117
120
  def splitter
118
121
  Text.new("\n")
@@ -18,6 +18,7 @@ module Infoboxer
18
18
  # @!attribute [r] name
19
19
  def_readers :name
20
20
 
21
+ # @private
21
22
  # Internal, used by {Parser}
22
23
  def empty?
23
24
  # even empty tag should not be dropped!
@@ -1,4 +1,6 @@
1
1
  # encoding: utf-8
2
+ require_relative 'linkable'
3
+
2
4
  module Infoboxer
3
5
  module Tree
4
6
  # Template variable.
@@ -29,15 +31,79 @@ module Infoboxer
29
31
  end
30
32
  end
31
33
 
32
- # Wikipedia template.
34
+ # Represents MediaWiki **template**.
35
+ #
36
+ # [**Template**](https://en.wikipedia.org/wiki/Wikipedia:Templates)
37
+ # is basically a thing with name, some variables and their
38
+ # values. When pages are displayed in browser, templates are rendered in
39
+ # something different by wiki engine; yet, when extracting information
40
+ # with Infoboxer, you are working with original templates.
41
+ #
42
+ # It requires some mastering and understanding, yet allows to do
43
+ # very poweful things. There are many kinds of them, from pure
44
+ # formatting-related (which are typically not more than small bells
45
+ # and whistles for page outlook, and should be rendered as a text)
46
+ # to very information-heavy ones, like
47
+ # [**infoboxes**](https://en.wikipedia.org/wiki/Help:Infobox), from
48
+ # which Infoboxer borrows its name!
49
+ #
50
+ # Basically, for information extraction from template you'll list
51
+ # its {#variables}, and then use {#fetch} method
52
+ # (and its variants: {#fetch_hash}/#{fetch_date}) to extract their
53
+ # values.
54
+ #
55
+ # ### On variables naming
56
+ # MediaWiki templates can contain _named_ and _unnamed_ variables.
57
+ # Example:
58
+ #
59
+ # ```
60
+ # {{birth date and age|1953|2|19|df=y}}
61
+ # ```
33
62
  #
34
- # Templates are complicated! Also, they are useful.
63
+ # This is template with name "birth date and age", three unnamed
64
+ # variables with values "1953", "2" and "19", and one named variable
65
+ # with name "df" and value "y".
35
66
  #
36
- # You'd need to understand them from [Wikipedia docs](https://en.wikipedia.org/wiki/Wikipedia:Templates)
37
- # and then use much of Infoboxer's goodness provided with {Templates}
38
- # separate module.
67
+ # For consistency, Infoboxer treats unnamed variables _exactly_ the
68
+ # same way MediaWiki does: they considered to have numeric names,
69
+ # which are _started from 1_ and _stored as a strings_. So, for
70
+ # template shown above, the following is correct:
71
+ #
72
+ # ```ruby
73
+ # template.fetch('1').text == '1953'
74
+ # template.fetch('2').text == '2'
75
+ # template.fetch('3').text == '19'
76
+ # template.fetch('df').text == 'y'
77
+ # ```
78
+ #
79
+ # Note also, that _named variables with simple text values_ are
80
+ # duplicated as a template node {Node#params}, so, the following is
81
+ # correct also:
82
+ #
83
+ # ```ruby
84
+ # template.params['df'] == 'y'
85
+ # template.params.has_key?('1') == false
86
+ # ```
87
+ #
88
+ # For more advanced topics, like subclassing templates by names and
89
+ # converting them to inline text, please read {Templates} module's
90
+ # documentation.
39
91
  class Template < Compound
40
- attr_reader :name, :variables
92
+ # Template name, designating its contents structure.
93
+ #
94
+ # See also {Linkable#url #url}, which you can navigate to read template's
95
+ # definition (and, in Wikipedia and many other projects, its
96
+ # documentation).
97
+ #
98
+ # @return [String]
99
+ attr_reader :name
100
+
101
+ # Template variables list.
102
+ #
103
+ # See {Var} class to understand what you can do with them.
104
+ #
105
+ # @return [Nodes<Var>]
106
+ attr_reader :variables
41
107
 
42
108
  def initialize(name, variables = Nodes[])
43
109
  super(Nodes[], extract_params(variables))
@@ -51,6 +117,97 @@ module Infoboxer
51
117
  variables.map{|var| var.to_tree(level+1)}.join
52
118
  end
53
119
 
120
+ # Returns list of template variables with numeric names (which
121
+ # are treated as "unnamed" variables by MediaWiki templates, see
122
+ # {Template class docs} for explanation).
123
+ #
124
+ # @return [Nodes<Var>]
125
+ def unnamed_variables
126
+ variables.find(name: /^\d+$/)
127
+ end
128
+
129
+ # Fetches template variable(s) by name(s) or patterns.
130
+ #
131
+ # Usage:
132
+ #
133
+ # ```ruby
134
+ # argentina.infobox.fetch('leader_title_1') # => one Var node
135
+ # argentina.infobox.fetch('leader_title_1',
136
+ # 'leader_name_1') # => two Var nodes
137
+ # argentina.infobox.fetch(/leader_title_\d+/) # => several Var nodes
138
+ # ```
139
+ #
140
+ # @return [Nodes<Var>]
141
+ def fetch(*patterns)
142
+ Nodes[*patterns.map{|p| variables.find(name: p)}.flatten]
143
+ end
144
+
145
+ # Fetches hash `{name => variable}`, by same patterns as {#fetch}.
146
+ #
147
+ # @return [Hash<String => Var>]
148
+ def fetch_hash(*patterns)
149
+ fetch(*patterns).map{|v| [v.name, v]}.to_h
150
+ end
151
+
152
+ # Fetches date by list of variable names containing date components.
153
+ #
154
+ # _(Experimental, subject to change or enchance.)_
155
+ #
156
+ # Explanation: if you have template like
157
+ # ```
158
+ # {{birth date and age|1953|2|19|df=y}}
159
+ # ```
160
+ # ...there is a short way to obtain date from it:
161
+ # ```ruby
162
+ # template.fetch_date('1', '2', '3') # => Date.new(1953,2,19)
163
+ # ```
164
+ #
165
+ # @return [Date]
166
+ def fetch_date(*patterns)
167
+ components = fetch(*patterns)
168
+ components.pop while components.last.nil? && !components.empty?
169
+
170
+ if components.empty?
171
+ nil
172
+ else
173
+ Date.new(*components.map{|v| v.to_s.to_i})
174
+ end
175
+ end
176
+
177
+ include Linkable
178
+
179
+ # @!method follow
180
+ # Extracts template source and returns it parsed (or nil,
181
+ # if template not found).
182
+ #
183
+ # **NB**: Infoboxer does NO variable substitution or other template
184
+ # evaluation actions. Moreover, it will almost certainly NOT parse
185
+ # template definitions correctly. You should use this method ONLY
186
+ # for "transclusion" templates (parts of content, which are
187
+ # included into other pages "as is").
188
+ #
189
+ # Look for example at [this page's](https://en.wikipedia.org/wiki/Tropical_and_subtropical_coniferous_forests)
190
+ # [source](https://en.wikipedia.org/w/index.php?title=Tropical_and_subtropical_coniferous_forests&action=edit):
191
+ # each subtable about some region is just a transclusion of
192
+ # template. This can be processed like:
193
+ #
194
+ # ```ruby
195
+ # Infoboxer.wp.get('Tropical and subtropical coniferous forests').
196
+ # templates(name: /forests^/).
197
+ # follow.tables #.and_so_on
198
+ # ```
199
+ #
200
+ # @return {MediaWiki::Page}
201
+ #
202
+ # **See also** {Linkable#follow} for general notes on the following links.
203
+
204
+ # Wikilink name of this template's source.
205
+ def link
206
+ # FIXME: super-naive for now, doesn't thinks about subpages and stuff.
207
+ "Template:#{name}"
208
+ end
209
+
210
+ # @private
54
211
  # Internal, used by {Parser}.
55
212
  def empty?
56
213
  false
@@ -29,11 +29,13 @@ module Infoboxer
29
29
  "#{indent(level)}#{text} <#{descr}>\n"
30
30
  end
31
31
 
32
+ # @private
32
33
  # Internal, used by {Parser}
33
34
  def can_merge?(other)
34
35
  other.is_a?(String) || other.is_a?(Text)
35
36
  end
36
37
 
38
+ # @private
37
39
  # Internal, used by {Parser}
38
40
  def merge!(other)
39
41
  if other.is_a?(String)
@@ -45,6 +47,7 @@ module Infoboxer
45
47
  end
46
48
  end
47
49
 
50
+ # @private
48
51
  # Internal, used by {Parser}
49
52
  def empty?
50
53
  raw_text.empty?
@@ -1,4 +1,6 @@
1
1
  # encoding: utf-8
2
+ require_relative 'linkable'
3
+
2
4
  module Infoboxer
3
5
  module Tree
4
6
  # Internal MediaWiki link class.
@@ -6,6 +8,8 @@ module Infoboxer
6
8
  # See [Wikipedia docs](https://en.wikipedia.org/wiki/Help:Link#Wikilinks)
7
9
  # for extensive explanation of Wikilink concept.
8
10
  #
11
+ # Note, that Wikilink is {Linkable}, so you can {Linkable#follow #follow}
12
+ # it to obtain linked pages.
9
13
  class Wikilink < Link
10
14
  def initialize(*)
11
15
  super
@@ -37,20 +41,7 @@ module Infoboxer
37
41
  # See {#topic} for explanation.
38
42
  attr_reader :refinement
39
43
 
40
- # Extracts wiki page by this link and returns it parsed (or nil,
41
- # if page not found).
42
- #
43
- # @return {MediaWiki::Page}
44
- #
45
- # **See also**:
46
- # * {Tree::Nodes#follow} for extracting multiple links at once;
47
- # * {MediaWiki#get} for basic information on page extraction.
48
- def follow
49
- page = lookup_parents(MediaWiki::Page).first or
50
- fail("Not in a page from real source")
51
- page.client or fail("MediaWiki client not set")
52
- page.client.get(link)
53
- end
44
+ include Linkable
54
45
 
55
46
  private
56
47
 
@@ -1,4 +1,4 @@
1
1
  # encoding: utf-8
2
2
  module Infoboxer
3
- VERSION = '0.1.0'
3
+ VERSION = '0.1.1'
4
4
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: infoboxer
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Victor Shepelev
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2015-08-07 00:00:00.000000000 Z
11
+ date: 2015-08-11 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: htmlentities
@@ -178,6 +178,20 @@ dependencies:
178
178
  - - ">="
179
179
  - !ruby/object:Gem::Version
180
180
  version: '0'
181
+ - !ruby/object:Gem::Dependency
182
+ name: inch
183
+ requirement: !ruby/object:Gem::Requirement
184
+ requirements:
185
+ - - ">="
186
+ - !ruby/object:Gem::Version
187
+ version: '0'
188
+ type: :development
189
+ prerelease: false
190
+ version_requirements: !ruby/object:Gem::Requirement
191
+ requirements:
192
+ - - ">="
193
+ - !ruby/object:Gem::Version
194
+ version: '0'
181
195
  description: |2
182
196
  Infoboxer is library targeting use of Wikipedia (or any other
183
197
  MediaWiki-based wiki) as a rich powerful data source.
@@ -188,6 +202,8 @@ extra_rdoc_files: []
188
202
  files:
189
203
  - ".dokaz"
190
204
  - ".yardopts"
205
+ - CHANGELOG.md
206
+ - CONTRIBUTING.md
191
207
  - LICENSE.txt
192
208
  - Parsing.md
193
209
  - README.md
@@ -225,6 +241,7 @@ files:
225
241
  - lib/infoboxer/tree/html.rb
226
242
  - lib/infoboxer/tree/image.rb
227
243
  - lib/infoboxer/tree/inline.rb
244
+ - lib/infoboxer/tree/linkable.rb
228
245
  - lib/infoboxer/tree/list.rb
229
246
  - lib/infoboxer/tree/node.rb
230
247
  - lib/infoboxer/tree/nodes.rb
@@ -245,7 +262,7 @@ files:
245
262
  - regression/pages/south_america.wiki
246
263
  - regression/pages/ukraine.wiki
247
264
  - regression/pages/usa.wiki
248
- homepage: https://github.com/zverok/infoboxer
265
+ homepage: https://github.com/molybdenum-99/infoboxer
249
266
  licenses:
250
267
  - MIT
251
268
  metadata: {}