infoboxer 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 23b2c1e2b3e0953dcab1ee64ea145cefc02f2174
4
- data.tar.gz: e8747e93e3bafcddff783bc08efb6df6ed32703b
3
+ metadata.gz: 37db410a52fa9f9755b70ff3a1284c6a427509b8
4
+ data.tar.gz: cd9952485ffffb6b3eb3d3cb69cb771aa2c9610a
5
5
  SHA512:
6
- metadata.gz: 28c1f2fb24a4831d4ca5af047e8ca3c7b573e1747aed0ec152468d64e5710c2c51e342de39178b2938f4af49d0f31a2d5afe0c8a37def863040a3803d9aa4aa0
7
- data.tar.gz: 1296e07dd28bd64435856dbbf8bdf5cc6bb65be135dd51483aa9a1637e5dc78403b221780b4cbe2ad85d8ac89dac0907309c96f0a921634b88636f9e0473bee2
6
+ metadata.gz: 2800426e4a055f838a89b50bd9388364f7a2499d2736816c686a5d9bb2cc727d2bbdfe485565835d844cefe75c8d56f510f2fd2823411e4b7a37c4d0fd3122bf
7
+ data.tar.gz: 6dc878f4957c468051ecff57eee57ca66ac2ed19e157cb6f62f993113ba2c8bd220f87926eda4f51aa5917d097ca2e8c13d94e2acd9dff8b8f82036412845245
data/.yardopts CHANGED
@@ -1 +1,2 @@
1
1
  --markup=markdown
2
+ --no-private
data/CHANGELOG.md ADDED
@@ -0,0 +1,14 @@
1
+ # Infoboxer's change log
2
+
3
+ ## 0.1.1 (2015-08-11)
4
+
5
+ Basically, preparing for wider release!
6
+
7
+ * Small refactorings;
8
+ * Documentation fixes;
9
+
10
+ ## 0.1.0 (2015-08-07)
11
+
12
+ Initial (ok, I know it's typically called 0.1.1, but here's work of
13
+ three monthes, numerous documentations and examples and so on... so, let
14
+ it be 0.1.0).
data/CONTRIBUTING.md ADDED
@@ -0,0 +1,62 @@
1
+ # Contributing to Infoboxer
2
+
3
+ _(Also duplicated in [wiki](https://github.com/molybdenum-99/infoboxer/wiki/Contributing).)_
4
+
5
+ ## Contributing via test cases
6
+
7
+ If you are assured that Infoboxer takes some page wrong, please create an
8
+ [issue](https://github.com/molybdenum-99/infoboxer/issues) with link
9
+ to page (or raw wikitext) and description of a problem.
10
+
11
+ ## Contributing via localizations and templates describing
12
+
13
+ Look at [en.wikipedia.org](https://github.com/molybdenum-99/infoboxer/blob/master/lib/infoboxer/definitions/en.wikipedia.org.rb)
14
+ template definitions. It can be extended. Also, similar definitions
15
+ can/should be created for other language wikipedias and other popular
16
+ wikis.
17
+
18
+ You can do pull requests with your own definitions, or create an
19
+ [issue](https://github.com/molybdenum-99/infoboxer/issues) describing
20
+ which template definitions should be added to Infoboxer.
21
+
22
+ ## Contributing via code
23
+
24
+ If you want to fix some bug or implement some feature, please just
25
+ follow the standard process for github opensource: fork, fix, push,
26
+ make pull request.
27
+
28
+ Some (scanty) information below.
29
+
30
+ ### Understanding the code
31
+
32
+ * Infoboxer is splitted in several modules (which are clearly visible in
33
+ API docs and folders structure).
34
+ * Most of "easy features"
35
+ can be added to [Navigation](http://www.rubydoc.info/gems/infoboxer/Infoboxer/Navigation)
36
+ module and its submodules: enchancing of navigational experience and
37
+ implement clever shortcuts (like "converting table to dataframe/list of
38
+ hashes", for ex.).
39
+ * Most of potential bugs can seat in
40
+ [Parser](http://www.rubydoc.info/gems/infoboxer/Infoboxer/Parser) class
41
+ and its modules; MediaWiki markup IS tricky and tightly coupled and
42
+ ambigous; there's also some non-implemented features, like `<source>`
43
+ tag parsing and template definition pages (which, possibly, is not
44
+ target of Infoboxer anyways).
45
+ * Most of underfeatured area is in
46
+ [MediaWiki](http://www.rubydoc.info/gems/infoboxer/Infoboxer/MediaWiki)
47
+ -- seems reasonable for information extraction purposes to have more
48
+ features from MediaWiki API, like "page list generators", search,
49
+ "what links here" and similar functionality.
50
+ * Most of clarification and documentation is required for
51
+ [Templates](http://www.rubydoc.info/gems/infoboxer/Infoboxer/Templates)
52
+ module, which is still underloved heart of Infoboxer.
53
+
54
+ ### Parser: quick, not clever
55
+
56
+ Whether you'd want to put your hands on Parser: please remember, that
57
+ it's hand-crafted and thoroughly optimized. The first thought you may
58
+ have that it needs more OO decompozition, a class for each case; or more
59
+ ideomatic Ruby, or ... Trust me, I've tried it all. But when you are
60
+ dealing with hundreds of thousands of parsing operations and tens of
61
+ thousands of resulting nodes, it turns out even simplest things like
62
+ `Object#tap` have performance penalty on large number of calls.
data/README.md CHANGED
@@ -16,7 +16,7 @@ obvious structure, you can navigate that tree easily, and you have a
16
16
  bunch of hi-level helpers method, so typical information extraction
17
17
  tasks should be super-easy, one-liners in best cases.
18
18
 
19
- _(For those already thinkind "Why should you do this, we already have
19
+ _(For those already thinking "Why should you do this, we already have
20
20
  DBPedia?" -- please, read "[Reasons](https://github.com/molybdenum-99/infoboxer/wiki/Reasons)"
21
21
  page in our wiki.)_
22
22
 
data/infoboxer.gemspec CHANGED
@@ -5,7 +5,7 @@ Gem::Specification.new do |s|
5
5
  s.version = Infoboxer::VERSION
6
6
  s.authors = ['Victor Shepelev']
7
7
  s.email = 'zverok.offline@gmail.com'
8
- s.homepage = 'https://github.com/zverok/infoboxer'
8
+ s.homepage = 'https://github.com/molybdenum-99/infoboxer'
9
9
 
10
10
  s.summary = 'MediaWiki client and parser, targeting information extraction.'
11
11
  s.description = <<-EOF
@@ -40,4 +40,5 @@ Gem::Specification.new do |s|
40
40
  s.add_development_dependency 'ruby-prof'
41
41
  s.add_development_dependency 'vcr'
42
42
  s.add_development_dependency 'webmock'
43
+ s.add_development_dependency 'inch'
43
44
  end
@@ -1,8 +1,20 @@
1
1
  # encoding: utf-8
2
2
  module Infoboxer
3
3
  class MediaWiki
4
+ # DSL for defining "traits" for some site.
5
+ #
6
+ # More docs (and possible refactoring) to follow.
7
+ #
8
+ # You can look at current
9
+ # [English Wikipedia traits](https://github.com/molybdenum-99/infoboxer/blob/master/lib/infoboxer/definitions/en.wikipedia.org.rb)
10
+ # definitions in Infoboxer's repo.
4
11
  class Traits
5
12
  class << self
13
+ # Define set of templates for current site's traits.
14
+ #
15
+ # See {Templates::Set} for longer (yet insufficient) explanation.
16
+ #
17
+ # Expected to be used inside Traits definition block.
6
18
  def templates(&definition)
7
19
  @templates ||= Templates::Set.new
8
20
 
@@ -11,35 +23,49 @@ module Infoboxer
11
23
  @templates.define(&definition)
12
24
  end
13
25
 
14
- # NB: explicitly store all domains in base Traits class
26
+ # @private
15
27
  def domain(d)
28
+ # NB: explicitly store all domains in base Traits class
16
29
  Traits.domains.key?(d) and
17
30
  fail(ArgumentError, "Domain binding redefinition: #{Traits.domains[d]}")
18
31
 
19
32
  Traits.domains[d] = self
20
33
  end
21
34
 
35
+ # @private
22
36
  def get(domain, options = {})
23
37
  cls = Traits.domains[domain]
24
38
  cls ? cls.new(options) : Traits.new(options)
25
39
  end
26
40
 
41
+ # @private
27
42
  def domains
28
43
  @domains ||= {}
29
44
  end
30
45
 
46
+ # Define traits for some domain. Use it like:
47
+ #
48
+ # ```ruby
49
+ # MediaWiki::Traits.for 'ru.wikipedia.org' do
50
+ # templates do
51
+ # template '...' do
52
+ # # some template definition
53
+ # end
54
+ # end
55
+ # end
56
+ # ```
57
+ #
58
+ # Again, you can look at current
59
+ # [English Wikipedia traits](https://github.com/molybdenum-99/infoboxer/blob/master/lib/infoboxer/definitions/en.wikipedia.org.rb)
60
+ # for example implementation.
31
61
  def for(domain, &block)
32
62
  Class.new(self, &block).domain(domain)
33
63
  end
34
64
 
65
+ # @private
35
66
  alias_method :default, :new
36
67
  end
37
68
 
38
- DEFAULTS = {
39
- file_prefix: 'File',
40
- category_prefix: 'Category'
41
- }
42
-
43
69
  def initialize(options = {})
44
70
  @options = options
45
71
  @file_prefix = [DEFAULTS[:file_prefix], options.delete(:file_prefix)].
@@ -48,13 +74,21 @@ module Infoboxer
48
74
  flatten.compact.uniq
49
75
  end
50
76
 
77
+ # @private
51
78
  attr_reader :file_prefix, :category_prefix
52
79
 
53
- #attr_accessor :re
54
-
80
+ # @private
55
81
  def templates
56
82
  self.class.templates
57
83
  end
84
+
85
+ private
86
+
87
+ DEFAULTS = {
88
+ file_prefix: 'File',
89
+ category_prefix: 'Category'
90
+ }
91
+
58
92
  end
59
93
  end
60
94
  end
@@ -1,7 +1,42 @@
1
1
  module Infoboxer
2
- # Templates are cool, powerful and undocumented. Sorry :(
2
+ # This module covers advanced MediaWiki templates usage.
3
+ #
4
+ # It is seriously adviced to read [Wikipedia docs](https://en.wikipedia.org/wiki/Help:Template)
5
+ # or at least look through it (and have it opened while reading further).
6
+ #
7
+ # If you just have a page with templates and want some variable value
8
+ # (like "page about country - infobox - total population"), you should
9
+ # be totally happy with {Tree::Template} and its features.
10
+ #
11
+ # What this module does is, basically, two things:
12
+ # * allow you to define for arbitrary templates how they are converted
13
+ # to text; by default, templates are totally excluded from text, which
14
+ # is not most reasonable behavior for many formatting templates;
15
+ # * allow you to define additional functionality for arbitrary templates;
16
+ # many of them containing pretty complicated logic (see, for ex.,
17
+ # [Template:Convert](https://en.wikipedia.org/wiki/Template:Convert),
18
+ # and it seems reasonable to extend instances of such a template.
19
+ #
20
+ # Infoboxer allows you to define {Templates::Set} of template-specific
21
+ # classes for some site/domain.
22
+ # There is already defined set of most commonly used templates at
23
+ # en.wikipedia.org (so, most of English Wikipedia texts will be rendered
24
+ # correctly, and also some advanced functionality is provided).
25
+ # You can take a look at
26
+ # [lib/infoboxer/definitions/en.wikipedia.org.rb](https://github.com/molybdenum-99/infoboxer/blob/master/lib/infoboxer/definitions/en.wikipedia.org.rb)
27
+ # to feel it (and also see a couple of TODOs and FIXMEs and other
28
+ # considerations).
29
+ #
30
+ # From Infoboxer's point-of-view, templates are the most complex part
31
+ # of Wikipedia, and we are currently trying hard to do the most reasonable
32
+ # things about them.
33
+ #
34
+ # Future versions also should:
35
+ # * define more of common English Wikipedia templates;
36
+ # * define templates for other popular wikis;
37
+ # * allow to add template definitions on-the-fly, while loading some
38
+ # page.
3
39
  #
4
- # I do my best.
5
40
  module Templates
6
41
  %w[base set].each do |tmpl|
7
42
  require_relative "templates/#{tmpl}"
@@ -15,29 +15,6 @@ module Infoboxer
15
15
  end
16
16
  end
17
17
 
18
- def unnamed_variables
19
- variables.select{|v| v.name =~ /^\d+$/}
20
- end
21
-
22
- def fetch(*patterns)
23
- Nodes[*patterns.map{|p| variables.find(name: p)}.flatten]
24
- end
25
-
26
- def fetch_hash(*patterns)
27
- fetch(*patterns).map{|v| [v.name, v]}.to_h
28
- end
29
-
30
- def fetch_date(*patterns)
31
- components = fetch(*patterns)
32
- components.pop while components.last.nil? && !components.empty?
33
-
34
- if components.empty?
35
- nil
36
- else
37
- Date.new(*components.map{|v| v.to_s.to_i})
38
- end
39
- end
40
-
41
18
  def ==(other)
42
19
  other.kind_of?(Tree::Template) && _eq(other)
43
20
  end
@@ -54,7 +31,9 @@ module Infoboxer
54
31
  end
55
32
 
56
33
  # Renders all of its unnamed variables as space-separated text
57
- # Also allows in-template navigation
34
+ # Also allows in-template navigation.
35
+ #
36
+ # Used for {Set} definitions.
58
37
  class Show < Base
59
38
  alias_method :children, :unnamed_variables
60
39
 
@@ -65,6 +44,9 @@ module Infoboxer
65
44
  end
66
45
  end
67
46
 
47
+ # Replaces template with replacement, while rendering.
48
+ #
49
+ # Used for {Set} definitions.
68
50
  class Replace < Base
69
51
  def replace
70
52
  fail(NotImplementedError, "Descendants should define :replace")
@@ -75,6 +57,9 @@ module Infoboxer
75
57
  end
76
58
  end
77
59
 
60
+ # Replaces template with its name, while rendering.
61
+ #
62
+ # Used for {Set} definitions.
78
63
  class Literal < Base
79
64
  alias_method :text, :name
80
65
  end
@@ -1,31 +1,103 @@
1
1
  # encoding: utf-8
2
2
  module Infoboxer
3
3
  module Templates
4
+ # Base class for defining set of templates, used for some site/domain.
5
+ #
6
+ # Currently only can be plugged in via {MediaWiki::Traits.templates}.
7
+ #
8
+ # Template set defines a DSL for creating new template definitions --
9
+ # also simplest ones and very complicated.
10
+ #
11
+ # You can look at implementation of English Wikipedia
12
+ # [common templates set](https://github.com/molybdenum-99/infoboxer/blob/master/lib/infoboxer/definitions/en.wikipedia.org.rb)
13
+ # in Infoboxer's repo.
14
+ #
4
15
  class Set
5
16
  def initialize(&definitions)
6
17
  @templates = []
7
18
  define(&definitions) if definitions
8
19
  end
9
-
20
+
21
+ # @private
10
22
  def find(name)
11
23
  _, template = @templates.detect{|m, t| m === name.downcase}
12
24
  template || Base
13
25
  end
14
26
 
27
+ # @private
15
28
  def define(&definitions)
16
29
  instance_eval(&definitions)
17
30
  end
18
31
 
32
+ # @private
19
33
  def clear
20
34
  @templates.clear
21
35
  end
22
36
 
23
- private
24
-
37
+ # Most common form of template definition.
38
+ #
39
+ # Can be used like:
40
+ #
41
+ # ```ruby
42
+ # template 'Age' do
43
+ # def from
44
+ # fetch_date('1', '2', '3')
45
+ # end
46
+ #
47
+ # def to
48
+ # fetch_date('4', '5', '6') || Date.today
49
+ # end
50
+ #
51
+ # def value
52
+ # (to - from).to_i / 365 # FIXME: obviously
53
+ # end
54
+ #
55
+ # def text
56
+ # "#{value} years"
57
+ # end
58
+ # end
59
+ # ```
60
+ #
61
+ # @param name Definition name.
62
+ # @param options Definition options.
63
+ # Currently recognized options are:
64
+ # * `:match` -- regexp or string, which matches template name to
65
+ # add this definition to (if not provided, `name` param used
66
+ # to match relevant templates);
67
+ # * `:base` -- name of template definition to use as a base class;
68
+ # for example you can do things like:
69
+ #
70
+ # ```ruby
71
+ # # ...inside template set definition...
72
+ # template 'Infobox', match: /^Infobox/ do
73
+ # # implementation
74
+ # end
75
+ #
76
+ # template 'Infobox cheese', base: 'Infobox' do
77
+ # end
78
+ # ```
79
+ #
80
+ # Expected to be used inside Set definition block.
25
81
  def template(name, options = {}, &definition)
26
82
  setup_class(name, Base, options, &definition)
27
83
  end
28
84
 
85
+ # Define list of "replacements": templates, which text should be replaced
86
+ # with arbitrary value.
87
+ #
88
+ # Example:
89
+ #
90
+ # ```ruby
91
+ # # ...inside template set definition...
92
+ # replace(
93
+ # '!!' => '||',
94
+ # '!(' => '['
95
+ # )
96
+ # ```
97
+ # Now, all templates with name `!!` will render as `||` when you
98
+ # call their (or their parents') {Tree::Node#text}.
99
+ #
100
+ # Expected to be used inside Set definition block.
29
101
  def replace(*replacements)
30
102
  case
31
103
  when replacements.count == 2 && replacements.all?{|r| r.is_a?(String)}
@@ -44,18 +116,50 @@ module Infoboxer
44
116
  end
45
117
  end
46
118
 
119
+ # Define list of "show children" templates. Those ones, when rendered
120
+ # as text, just provide join of their children text (space-separated).
121
+ #
122
+ # Example:
123
+ #
124
+ # ```ruby
125
+ # #...in template set definition...
126
+ # show 'Small'
127
+ # ```
128
+ # Now, wikitext paragraph looking like...
129
+ #
130
+ # ```
131
+ # This is {{small|text}} in template
132
+ # ```
133
+ # ...before this template definition had rendered like
134
+ # `"This is in template"` (template contents ommitted), and after
135
+ # this definition it will render like `"This is text in template"`
136
+ # (template contents rendered as is).
137
+ #
138
+ # Expected to be used inside Set definition block.
47
139
  def show(*names)
48
140
  names.each do |name|
49
141
  setup_class(name, Show)
50
142
  end
51
143
  end
52
144
 
145
+ # Define list of "literally rendered templates". It means, when
146
+ # rendering text, template is replaced with just its name.
147
+ #
148
+ # Explanation: in
149
+ # MediaWiki, there are contexts (deeply in other templates and
150
+ # tables), when you can't just type something like `","` and not
151
+ # have it interpreted. So, wikis oftenly define wrappers around
152
+ # those templates, looking like `{{,}}` -- so, while rendering texts,
153
+ # such templates can be replaced with their names.
154
+ #
155
+ # Expected to be used inside Set definition block.
53
156
  def literal(*names)
54
157
  names.each do |name|
55
158
  setup_class(name, Literal)
56
159
  end
57
160
  end
58
161
 
162
+ # @private
59
163
  def setup_class(name, base_class, options = {}, &definition)
60
164
  match = options.fetch(:match, name.downcase)
61
165
  base = options.fetch(:base, base_class)
@@ -21,6 +21,7 @@ module Infoboxer
21
21
  children.index(child)
22
22
  end
23
23
 
24
+ # @private
24
25
  # Internal, used by {Parser}
25
26
  def push_children(*nodes)
26
27
  nodes.each{|c| c.parent = self}.each do |n|
@@ -45,21 +46,25 @@ module Infoboxer
45
46
 
46
47
  # Kinda "private" methods, used by Parser only -------------------
47
48
 
49
+ # @private
48
50
  # Internal, used by {Parser}
49
51
  def can_merge?(other)
50
52
  false
51
53
  end
52
54
 
55
+ # @private
53
56
  # Internal, used by {Parser}
54
57
  def closed!
55
58
  @closed = true
56
59
  end
57
60
 
61
+ # @private
58
62
  # Internal, used by {Parser}
59
63
  def closed?
60
64
  @closed
61
65
  end
62
66
 
67
+ # @private
63
68
  # Internal, used by {Parser}
64
69
  def empty?
65
70
  children.empty?
@@ -21,6 +21,7 @@ module Infoboxer
21
21
 
22
22
  include HTMLTagCommons
23
23
 
24
+ # @private
24
25
  # Internal, used by {Parser}.
25
26
  def empty?
26
27
  # even empty tag, for ex., <br>, should not be dropped!
@@ -0,0 +1,42 @@
1
+ module Infoboxer
2
+ module Tree
3
+ # Module included into everything, that can be treated as
4
+ # link to some MediaWiki page, despite of behavior. Namely,
5
+ # {Wikilink} and {Template}.
6
+ module Linkable
7
+ # Extracts wiki page by this link and returns it parsed (or nil,
8
+ # if page not found).
9
+ #
10
+ # About template "following" see also {Template#follow} docs.
11
+ #
12
+ # @return {MediaWiki::Page}
13
+ #
14
+ # **See also**:
15
+ # * {Tree::Nodes#follow} for extracting multiple links at once;
16
+ # * {MediaWiki#get} for basic information on page extraction.
17
+ def follow
18
+ client.get(link)
19
+ end
20
+
21
+ # Human-readable page URL
22
+ #
23
+ # @return [String]
24
+ def url
25
+ # FIXME: fragile as hell.
26
+ page.url.sub(/[^\/]+$/, link.gsub(' ', '_'))
27
+ end
28
+
29
+ protected
30
+
31
+ def page
32
+ page = lookup_parents(MediaWiki::Page).first or
33
+ fail("Not in a page from real source")
34
+ end
35
+
36
+ def client
37
+ page.client or fail("MediaWiki client not set")
38
+ end
39
+
40
+ end
41
+ end
42
+ end
@@ -3,12 +3,14 @@ module Infoboxer
3
3
  module Tree
4
4
  # Represents item of ordered or unordered list.
5
5
  class ListItem < BaseParagraph
6
+ # @private
6
7
  # Internal, used by {Parser}
7
8
  def can_merge?(other)
8
9
  other.class == self.class &&
9
10
  other.children.first.kind_of?(List)
10
11
  end
11
12
 
13
+ # @private
12
14
  # Internal, used by {Parser}
13
15
  def merge!(other)
14
16
  ochildren = other.children.dup
@@ -111,6 +113,7 @@ module Infoboxer
111
113
  class List < Compound
112
114
  include Mergeable
113
115
 
116
+ # @private
114
117
  # Internal, used by {Parser}
115
118
  def merge!(other)
116
119
  ochildren = other.children.dup
@@ -123,6 +126,7 @@ module Infoboxer
123
126
  push_children(*ochildren)
124
127
  end
125
128
 
129
+ # @private
126
130
  # Internal, used by {Parser}
127
131
  def self.construct(marker, nodes)
128
132
  m = marker.shift
@@ -61,11 +61,13 @@ module Infoboxer
61
61
  Nodes[] # redefined in descendants
62
62
  end
63
63
 
64
+ # @private
64
65
  # Used only during tree construction in {Parser}.
65
66
  def can_merge?(other)
66
67
  false
67
68
  end
68
69
 
70
+ # @private
69
71
  # Whether node is empty (definition of "empty" varies for different
70
72
  # kinds of nodes). Used mainly in {Parser}.
71
73
  def empty?
@@ -146,6 +146,7 @@ module Infoboxer
146
146
  page.client.get(*links)
147
147
  end
148
148
 
149
+ # @private
149
150
  # Internal, used by {Parser}
150
151
  def <<(node)
151
152
  if node.kind_of?(Array)
@@ -159,6 +160,7 @@ module Infoboxer
159
160
  end
160
161
  end
161
162
 
163
+ # @private
162
164
  # Internal, used by {Parser}
163
165
  def strip
164
166
  res = dup
@@ -167,6 +169,7 @@ module Infoboxer
167
169
  res
168
170
  end
169
171
 
172
+ # @private
170
173
  # Internal, used by {Parser}
171
174
  def flow_templates
172
175
  make_nodes map{|n| n.is_a?(Paragraph) ? n.to_templates? : n}
@@ -15,7 +15,6 @@ module Infoboxer
15
15
  end
16
16
 
17
17
  # @private
18
- # Internal! Nothing to see here! Just YARD `@private` tag not working at class level
19
18
  class EmptyParagraph < Node
20
19
  def initialize(text)
21
20
  @text = text
@@ -30,7 +29,6 @@ module Infoboxer
30
29
  end
31
30
 
32
31
  # @private
33
- # Internal! Nothing to see here! Just YARD `@private` tag not working at class level
34
32
  module Mergeable
35
33
  def can_merge?(other)
36
34
  !closed? && self.class == other.class
@@ -49,7 +47,6 @@ module Infoboxer
49
47
  end
50
48
 
51
49
  # @private
52
- # Internal! Nothing to see here! Just YARD `@private` tag not working at class level
53
50
  class MergeableParagraph < BaseParagraph
54
51
  include Mergeable
55
52
 
@@ -61,21 +58,25 @@ module Infoboxer
61
58
 
62
59
  # Represents plain text paragraph.
63
60
  class Paragraph < MergeableParagraph
61
+ # @private
64
62
  # Internal, used by {Parser} for merging
65
63
  def splitter
66
64
  Text.new(' ')
67
65
  end
68
66
 
67
+ # @private
69
68
  # Internal, used by {Parser}
70
69
  def templates_only?
71
70
  children.all?{|c| c.is_a?(Template) || c.is_a?(Text) && c.raw_text.strip.empty?}
72
71
  end
73
72
 
73
+ # @private
74
74
  # Internal, used by {Parser}
75
75
  def to_templates
76
76
  children.select(&filter(itself: Template))
77
77
  end
78
78
 
79
+ # @private
79
80
  # Internal, used by {Parser}
80
81
  def to_templates?
81
82
  templates_only? ? to_templates : self
@@ -104,6 +105,7 @@ module Infoboxer
104
105
  #
105
106
  # Paragraph-level thing, can contain many lines of text.
106
107
  class Pre < MergeableParagraph
108
+ # @private
107
109
  # Internal, used by {Parser}
108
110
  def merge!(other)
109
111
  if other.is_a?(EmptyParagraph) && !other.text.empty?
@@ -113,6 +115,7 @@ module Infoboxer
113
115
  end
114
116
  end
115
117
 
118
+ # @private
116
119
  # Internal, used by {Parser} for merging
117
120
  def splitter
118
121
  Text.new("\n")
@@ -18,6 +18,7 @@ module Infoboxer
18
18
  # @!attribute [r] name
19
19
  def_readers :name
20
20
 
21
+ # @private
21
22
  # Internal, used by {Parser}
22
23
  def empty?
23
24
  # even empty tag should not be dropped!
@@ -1,4 +1,6 @@
1
1
  # encoding: utf-8
2
+ require_relative 'linkable'
3
+
2
4
  module Infoboxer
3
5
  module Tree
4
6
  # Template variable.
@@ -29,15 +31,79 @@ module Infoboxer
29
31
  end
30
32
  end
31
33
 
32
- # Wikipedia template.
34
+ # Represents MediaWiki **template**.
35
+ #
36
+ # [**Template**](https://en.wikipedia.org/wiki/Wikipedia:Templates)
37
+ # is basically a thing with name, some variables and their
38
+ # values. When pages are displayed in browser, templates are rendered in
39
+ # something different by wiki engine; yet, when extracting information
40
+ # with Infoboxer, you are working with original templates.
41
+ #
42
+ # It requires some mastering and understanding, yet allows to do
43
+ # very poweful things. There are many kinds of them, from pure
44
+ # formatting-related (which are typically not more than small bells
45
+ # and whistles for page outlook, and should be rendered as a text)
46
+ # to very information-heavy ones, like
47
+ # [**infoboxes**](https://en.wikipedia.org/wiki/Help:Infobox), from
48
+ # which Infoboxer borrows its name!
49
+ #
50
+ # Basically, for information extraction from template you'll list
51
+ # its {#variables}, and then use {#fetch} method
52
+ # (and its variants: {#fetch_hash}/#{fetch_date}) to extract their
53
+ # values.
54
+ #
55
+ # ### On variables naming
56
+ # MediaWiki templates can contain _named_ and _unnamed_ variables.
57
+ # Example:
58
+ #
59
+ # ```
60
+ # {{birth date and age|1953|2|19|df=y}}
61
+ # ```
33
62
  #
34
- # Templates are complicated! Also, they are useful.
63
+ # This is template with name "birth date and age", three unnamed
64
+ # variables with values "1953", "2" and "19", and one named variable
65
+ # with name "df" and value "y".
35
66
  #
36
- # You'd need to understand them from [Wikipedia docs](https://en.wikipedia.org/wiki/Wikipedia:Templates)
37
- # and then use much of Infoboxer's goodness provided with {Templates}
38
- # separate module.
67
+ # For consistency, Infoboxer treats unnamed variables _exactly_ the
68
+ # same way MediaWiki does: they considered to have numeric names,
69
+ # which are _started from 1_ and _stored as a strings_. So, for
70
+ # template shown above, the following is correct:
71
+ #
72
+ # ```ruby
73
+ # template.fetch('1').text == '1953'
74
+ # template.fetch('2').text == '2'
75
+ # template.fetch('3').text == '19'
76
+ # template.fetch('df').text == 'y'
77
+ # ```
78
+ #
79
+ # Note also, that _named variables with simple text values_ are
80
+ # duplicated as a template node {Node#params}, so, the following is
81
+ # correct also:
82
+ #
83
+ # ```ruby
84
+ # template.params['df'] == 'y'
85
+ # template.params.has_key?('1') == false
86
+ # ```
87
+ #
88
+ # For more advanced topics, like subclassing templates by names and
89
+ # converting them to inline text, please read {Templates} module's
90
+ # documentation.
39
91
  class Template < Compound
40
- attr_reader :name, :variables
92
+ # Template name, designating its contents structure.
93
+ #
94
+ # See also {Linkable#url #url}, which you can navigate to read template's
95
+ # definition (and, in Wikipedia and many other projects, its
96
+ # documentation).
97
+ #
98
+ # @return [String]
99
+ attr_reader :name
100
+
101
+ # Template variables list.
102
+ #
103
+ # See {Var} class to understand what you can do with them.
104
+ #
105
+ # @return [Nodes<Var>]
106
+ attr_reader :variables
41
107
 
42
108
  def initialize(name, variables = Nodes[])
43
109
  super(Nodes[], extract_params(variables))
@@ -51,6 +117,97 @@ module Infoboxer
51
117
  variables.map{|var| var.to_tree(level+1)}.join
52
118
  end
53
119
 
120
+ # Returns list of template variables with numeric names (which
121
+ # are treated as "unnamed" variables by MediaWiki templates, see
122
+ # {Template class docs} for explanation).
123
+ #
124
+ # @return [Nodes<Var>]
125
+ def unnamed_variables
126
+ variables.find(name: /^\d+$/)
127
+ end
128
+
129
+ # Fetches template variable(s) by name(s) or patterns.
130
+ #
131
+ # Usage:
132
+ #
133
+ # ```ruby
134
+ # argentina.infobox.fetch('leader_title_1') # => one Var node
135
+ # argentina.infobox.fetch('leader_title_1',
136
+ # 'leader_name_1') # => two Var nodes
137
+ # argentina.infobox.fetch(/leader_title_\d+/) # => several Var nodes
138
+ # ```
139
+ #
140
+ # @return [Nodes<Var>]
141
+ def fetch(*patterns)
142
+ Nodes[*patterns.map{|p| variables.find(name: p)}.flatten]
143
+ end
144
+
145
+ # Fetches hash `{name => variable}`, by same patterns as {#fetch}.
146
+ #
147
+ # @return [Hash<String => Var>]
148
+ def fetch_hash(*patterns)
149
+ fetch(*patterns).map{|v| [v.name, v]}.to_h
150
+ end
151
+
152
+ # Fetches date by list of variable names containing date components.
153
+ #
154
+ # _(Experimental, subject to change or enchance.)_
155
+ #
156
+ # Explanation: if you have template like
157
+ # ```
158
+ # {{birth date and age|1953|2|19|df=y}}
159
+ # ```
160
+ # ...there is a short way to obtain date from it:
161
+ # ```ruby
162
+ # template.fetch_date('1', '2', '3') # => Date.new(1953,2,19)
163
+ # ```
164
+ #
165
+ # @return [Date]
166
+ def fetch_date(*patterns)
167
+ components = fetch(*patterns)
168
+ components.pop while components.last.nil? && !components.empty?
169
+
170
+ if components.empty?
171
+ nil
172
+ else
173
+ Date.new(*components.map{|v| v.to_s.to_i})
174
+ end
175
+ end
176
+
177
+ include Linkable
178
+
179
+ # @!method follow
180
+ # Extracts template source and returns it parsed (or nil,
181
+ # if template not found).
182
+ #
183
+ # **NB**: Infoboxer does NO variable substitution or other template
184
+ # evaluation actions. Moreover, it will almost certainly NOT parse
185
+ # template definitions correctly. You should use this method ONLY
186
+ # for "transclusion" templates (parts of content, which are
187
+ # included into other pages "as is").
188
+ #
189
+ # Look for example at [this page's](https://en.wikipedia.org/wiki/Tropical_and_subtropical_coniferous_forests)
190
+ # [source](https://en.wikipedia.org/w/index.php?title=Tropical_and_subtropical_coniferous_forests&action=edit):
191
+ # each subtable about some region is just a transclusion of
192
+ # template. This can be processed like:
193
+ #
194
+ # ```ruby
195
+ # Infoboxer.wp.get('Tropical and subtropical coniferous forests').
196
+ # templates(name: /forests^/).
197
+ # follow.tables #.and_so_on
198
+ # ```
199
+ #
200
+ # @return {MediaWiki::Page}
201
+ #
202
+ # **See also** {Linkable#follow} for general notes on the following links.
203
+
204
+ # Wikilink name of this template's source.
205
+ def link
206
+ # FIXME: super-naive for now, doesn't thinks about subpages and stuff.
207
+ "Template:#{name}"
208
+ end
209
+
210
+ # @private
54
211
  # Internal, used by {Parser}.
55
212
  def empty?
56
213
  false
@@ -29,11 +29,13 @@ module Infoboxer
29
29
  "#{indent(level)}#{text} <#{descr}>\n"
30
30
  end
31
31
 
32
+ # @private
32
33
  # Internal, used by {Parser}
33
34
  def can_merge?(other)
34
35
  other.is_a?(String) || other.is_a?(Text)
35
36
  end
36
37
 
38
+ # @private
37
39
  # Internal, used by {Parser}
38
40
  def merge!(other)
39
41
  if other.is_a?(String)
@@ -45,6 +47,7 @@ module Infoboxer
45
47
  end
46
48
  end
47
49
 
50
+ # @private
48
51
  # Internal, used by {Parser}
49
52
  def empty?
50
53
  raw_text.empty?
@@ -1,4 +1,6 @@
1
1
  # encoding: utf-8
2
+ require_relative 'linkable'
3
+
2
4
  module Infoboxer
3
5
  module Tree
4
6
  # Internal MediaWiki link class.
@@ -6,6 +8,8 @@ module Infoboxer
6
8
  # See [Wikipedia docs](https://en.wikipedia.org/wiki/Help:Link#Wikilinks)
7
9
  # for extensive explanation of Wikilink concept.
8
10
  #
11
+ # Note, that Wikilink is {Linkable}, so you can {Linkable#follow #follow}
12
+ # it to obtain linked pages.
9
13
  class Wikilink < Link
10
14
  def initialize(*)
11
15
  super
@@ -37,20 +41,7 @@ module Infoboxer
37
41
  # See {#topic} for explanation.
38
42
  attr_reader :refinement
39
43
 
40
- # Extracts wiki page by this link and returns it parsed (or nil,
41
- # if page not found).
42
- #
43
- # @return {MediaWiki::Page}
44
- #
45
- # **See also**:
46
- # * {Tree::Nodes#follow} for extracting multiple links at once;
47
- # * {MediaWiki#get} for basic information on page extraction.
48
- def follow
49
- page = lookup_parents(MediaWiki::Page).first or
50
- fail("Not in a page from real source")
51
- page.client or fail("MediaWiki client not set")
52
- page.client.get(link)
53
- end
44
+ include Linkable
54
45
 
55
46
  private
56
47
 
@@ -1,4 +1,4 @@
1
1
  # encoding: utf-8
2
2
  module Infoboxer
3
- VERSION = '0.1.0'
3
+ VERSION = '0.1.1'
4
4
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: infoboxer
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Victor Shepelev
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2015-08-07 00:00:00.000000000 Z
11
+ date: 2015-08-11 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: htmlentities
@@ -178,6 +178,20 @@ dependencies:
178
178
  - - ">="
179
179
  - !ruby/object:Gem::Version
180
180
  version: '0'
181
+ - !ruby/object:Gem::Dependency
182
+ name: inch
183
+ requirement: !ruby/object:Gem::Requirement
184
+ requirements:
185
+ - - ">="
186
+ - !ruby/object:Gem::Version
187
+ version: '0'
188
+ type: :development
189
+ prerelease: false
190
+ version_requirements: !ruby/object:Gem::Requirement
191
+ requirements:
192
+ - - ">="
193
+ - !ruby/object:Gem::Version
194
+ version: '0'
181
195
  description: |2
182
196
  Infoboxer is library targeting use of Wikipedia (or any other
183
197
  MediaWiki-based wiki) as a rich powerful data source.
@@ -188,6 +202,8 @@ extra_rdoc_files: []
188
202
  files:
189
203
  - ".dokaz"
190
204
  - ".yardopts"
205
+ - CHANGELOG.md
206
+ - CONTRIBUTING.md
191
207
  - LICENSE.txt
192
208
  - Parsing.md
193
209
  - README.md
@@ -225,6 +241,7 @@ files:
225
241
  - lib/infoboxer/tree/html.rb
226
242
  - lib/infoboxer/tree/image.rb
227
243
  - lib/infoboxer/tree/inline.rb
244
+ - lib/infoboxer/tree/linkable.rb
228
245
  - lib/infoboxer/tree/list.rb
229
246
  - lib/infoboxer/tree/node.rb
230
247
  - lib/infoboxer/tree/nodes.rb
@@ -245,7 +262,7 @@ files:
245
262
  - regression/pages/south_america.wiki
246
263
  - regression/pages/ukraine.wiki
247
264
  - regression/pages/usa.wiki
248
- homepage: https://github.com/zverok/infoboxer
265
+ homepage: https://github.com/molybdenum-99/infoboxer
249
266
  licenses:
250
267
  - MIT
251
268
  metadata: {}