wikiscript 0.3.1 → 0.3.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +193 -6
- data/Rakefile +1 -1
- data/lib/wikiscript/page.rb +15 -3
- data/lib/wikiscript/version.rb +1 -1
- data/test/test_page_reader.rb +51 -14
- metadata +6 -4
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: a34183c24ce2eac79cf72edcec39c562e6c74065
|
4
|
+
data.tar.gz: 56fff65e58fc1fbbbc2eeaa111d22d6ebce84f63
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 85fa8e16dffbdfdaf6b683dae530a692eb6c570a03345ddf2a013b248bbb51aaab9ce495850bdd564a163e35c37f0f80b3f6d47b5ee994e8effe02c413ac8ba8
|
7
|
+
data.tar.gz: ca8d9133a251f634db722575100d260c79859437fb393d9ac5b8181d0725f7d54a5290b4dfa0436a13aa73338c818adeea5c89c06b03b2f2e7360aa102f3e1c4
|
data/README.md
CHANGED
@@ -12,16 +12,203 @@ Read-only access to wikikpedia pages.
|
|
12
12
|
Example - Get wikitext source (via `en.wikipedia.org/w/index.php?action=raw&title=<title>`):
|
13
13
|
|
14
14
|
|
15
|
+
``` ruby
|
16
|
+
page = Wikiscript::Page.get( '2022_FIFA_World_Cup' ) # same as Wikiscript.get
|
17
|
+
page.text
|
15
18
|
```
|
16
|
-
>> page = Wikiscript::Page.new( '2014_FIFA_World_Cup_squads' )
|
17
|
-
>> page.text
|
18
19
|
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
20
|
+
prints
|
21
|
+
|
22
|
+
```
|
23
|
+
The '''2022 FIFA World Cup''' is scheduled to be the 22nd edition of the [[FIFA World Cup]],
|
24
|
+
the quadrennial international men's [[association football]] championship contested by the
|
25
|
+
[[List of men's national association football teams|national teams]] of the member associations of [[FIFA]].
|
26
|
+
It is scheduled to take place in [[Qatar]] in 2022. This will be the first World Cup ever to be held
|
27
|
+
in the [[Arab world]] and the first in a Muslim-majority country...
|
28
|
+
```
|
29
|
+
|
30
|
+
Or build your own page from scratch (no download):
|
31
|
+
|
32
|
+
``` ruby
|
33
|
+
page = Wikiscript::Page.new( <<TXT, title: '2022_FIFA_World_Cup' )
|
34
|
+
The '''2022 FIFA World Cup''' is scheduled to be the 22nd edition of the [[FIFA World Cup]],
|
35
|
+
the quadrennial international men's [[association football]] championship contested by the
|
36
|
+
[[List of men's national association football teams|national teams]] of the member associations of [[FIFA]].
|
37
|
+
It is scheduled to take place in [[Qatar]] in 2022. This will be the first World Cup ever to be held
|
38
|
+
in the [[Arab world]] and the first in a Muslim-majority country...
|
39
|
+
TXT
|
40
|
+
page.text
|
41
|
+
```
|
42
|
+
|
43
|
+
prints
|
44
|
+
|
45
|
+
```
|
46
|
+
The '''2022 FIFA World Cup''' is scheduled to be the 22nd edition of the [[FIFA World Cup]],
|
47
|
+
the quadrennial international men's [[association football]] championship contested by the
|
48
|
+
[[List of men's national association football teams|national teams]] of the member associations of [[FIFA]].
|
49
|
+
It is scheduled to take place in [[Qatar]] in 2022. This will be the first World Cup ever to be held
|
50
|
+
in the [[Arab world]] and the first in a Muslim-majority country...
|
51
|
+
```
|
52
|
+
|
53
|
+
|
54
|
+
### Tables
|
55
|
+
|
56
|
+
Parse wiki tables into an array. Example:
|
57
|
+
|
58
|
+
``` ruby
|
59
|
+
table = Wikiscript.parse_table( <<TXT )
|
60
|
+
{|
|
61
|
+
|-
|
62
|
+
! header1
|
63
|
+
! header2
|
64
|
+
! header3
|
65
|
+
|-
|
66
|
+
| row1cell1
|
67
|
+
| row1cell2
|
68
|
+
| row1cell3
|
69
|
+
|-
|
70
|
+
| row2cell1
|
71
|
+
| row2cell2
|
72
|
+
| row2cell3
|
73
|
+
|}
|
74
|
+
TXT
|
75
|
+
|
76
|
+
# -or-
|
77
|
+
|
78
|
+
table = Wikiscript.parse_table( <<TXT )
|
79
|
+
{|
|
80
|
+
! header1 !! header2 !! header3
|
81
|
+
|-
|
82
|
+
| row1cell1 || row1cell2 || row1cell3
|
83
|
+
|-
|
84
|
+
| row2cell1 || row2cell2 || row2cell3
|
85
|
+
|}
|
86
|
+
TXT
|
87
|
+
|
88
|
+
# -or-
|
89
|
+
|
90
|
+
table = Wikiscript.parse_table( <<TXT )
|
91
|
+
{|
|
92
|
+
|-
|
93
|
+
!
|
94
|
+
header1
|
95
|
+
!
|
96
|
+
header2
|
97
|
+
!
|
98
|
+
header3
|
99
|
+
|-
|
100
|
+
|
|
101
|
+
row1cell1
|
102
|
+
|
|
103
|
+
row1cell2
|
104
|
+
|
|
105
|
+
row1cell3
|
106
|
+
|-
|
107
|
+
|
|
108
|
+
row2cell1
|
109
|
+
|
|
110
|
+
row2cell2
|
111
|
+
|
|
112
|
+
row2cell3
|
113
|
+
|}
|
114
|
+
TXT
|
115
|
+
```
|
116
|
+
|
117
|
+
resulting in:
|
118
|
+
|
119
|
+
``` ruby
|
120
|
+
pp table
|
121
|
+
#=> [["header1", "header2", "header3"],
|
122
|
+
# ["row1cell1", "row1cell2", "row1cell3"],
|
123
|
+
# ["row2cell1", "row2cell2", "row2cell3"]]
|
23
124
|
```
|
24
125
|
|
126
|
+
Note: `parse_table` will strip/remove (leading) style attributes (e.g. `àttribute="value" |` and (inline) bold and italic emphases (e.g. `''`) from the (cell) text. Example:
|
127
|
+
|
128
|
+
``` ruby
|
129
|
+
table = Wikiscript.parse_table( <<TXT )
|
130
|
+
{|
|
131
|
+
|-
|
132
|
+
! style="width:200px;"|Club
|
133
|
+
! style="width:150px;"|City
|
134
|
+
|-
|
135
|
+
|[[Biu Chun Rangers]]||[[Sham Shui Po]]
|
136
|
+
|-
|
137
|
+
|bgcolor=#ffff44 |''[[Eastern Sports Club|Eastern]]''||[[Mong Kok]]
|
138
|
+
|-
|
139
|
+
|[[HKFC Soccer Section]]||[[Happy Valley, Hong Kong|Happy Valley]]
|
140
|
+
|}
|
141
|
+
TXT
|
142
|
+
```
|
143
|
+
|
144
|
+
resulting in:
|
145
|
+
|
146
|
+
``` ruby
|
147
|
+
pp table
|
148
|
+
#=> [["Club", "City"],
|
149
|
+
# ["[[Biu Chun Rangers]]", "[[Sham Shui Po]]"],
|
150
|
+
# ["[[Eastern Sports Club|Eastern]]", "[[Mong Kok]]"],
|
151
|
+
# ["[[HKFC Soccer Section]]", "[[Happy Valley, Hong Kong|Happy Valley]]"]]
|
152
|
+
```
|
153
|
+
|
154
|
+
### Links
|
155
|
+
|
156
|
+
Split links into two parts. Note: The alternate link title is optional. Example:
|
157
|
+
|
158
|
+
``` ruby
|
159
|
+
link, title = Wikiscript.parse_link( '[[La Florida, Chile|La Florida]]' )
|
160
|
+
link #=> "La Florida, Chile"
|
161
|
+
title #=> "La Florida"
|
162
|
+
|
163
|
+
link, title = Wikiscript.parse_link( '[[ La Florida, Chile]]' )
|
164
|
+
link #=> "La Florida, Chile"
|
165
|
+
title #=> nil
|
166
|
+
|
167
|
+
link, title = Wikiscript.parse_link( 'La Florida' )
|
168
|
+
link #=> nil
|
169
|
+
title #=> nil
|
170
|
+
```
|
171
|
+
|
172
|
+
### Document Element Structure
|
173
|
+
|
174
|
+
Get the document's element structure.
|
175
|
+
Note: For now only section headings (`h1`, `h2`, `h3`, ...) and tables are supported.
|
176
|
+
Example:
|
177
|
+
|
178
|
+
``` ruby
|
179
|
+
nodes = Wikiscript.parse( <<TXT )
|
180
|
+
=Heading 1==
|
181
|
+
==Heading 2==
|
182
|
+
===Heading 3===
|
183
|
+
|
184
|
+
{|
|
185
|
+
|-
|
186
|
+
! header1
|
187
|
+
! header2
|
188
|
+
! header3
|
189
|
+
|-
|
190
|
+
| row1cell1
|
191
|
+
| row1cell2
|
192
|
+
| row1cell3
|
193
|
+
|-
|
194
|
+
| row2cell1
|
195
|
+
| row2cell2
|
196
|
+
| row2cell3
|
197
|
+
|}
|
198
|
+
TXT
|
199
|
+
|
200
|
+
pp nodes
|
201
|
+
#=> [[:h1, "Heading 1"],
|
202
|
+
# [:h2, "Heading 2"],
|
203
|
+
# [:h3, "Heading 3"],
|
204
|
+
# [:table, [["header1", "header2", "header3"],
|
205
|
+
# ["row1cell1", "row1cell2", "row1cell3"],
|
206
|
+
# ["row2cell1", "row2cell2", "row2cell3"]]]
|
207
|
+
```
|
208
|
+
|
209
|
+
|
210
|
+
That's all for now. More functionality will get added over time.
|
211
|
+
|
25
212
|
|
26
213
|
|
27
214
|
## Install
|
data/Rakefile
CHANGED
@@ -5,7 +5,7 @@ Hoe.spec 'wikiscript' do
|
|
5
5
|
|
6
6
|
self.version = Wikiscript::VERSION
|
7
7
|
|
8
|
-
self.summary =
|
8
|
+
self.summary = "wikiscript - scripts for wikipedia (get wikitext for page, parse tables 'n' links, etc.)"
|
9
9
|
self.description = summary
|
10
10
|
|
11
11
|
self.urls = ['https://github.com/wikiscript/wikiscript']
|
data/lib/wikiscript/page.rb
CHANGED
@@ -33,11 +33,23 @@ module Wikiscript
|
|
33
33
|
end
|
34
34
|
|
35
35
|
def text
|
36
|
-
@text ||= get
|
36
|
+
@text ||= get # cache text (from request)
|
37
|
+
end
|
38
|
+
|
39
|
+
def nodes
|
40
|
+
@nodes ||= parse # cache text (from parse)
|
41
|
+
end
|
42
|
+
|
43
|
+
def each ## loop over all nodes / elements -note: nodes is a (flat) list (array) for now
|
44
|
+
nodes.each do |node|
|
45
|
+
yield( node )
|
46
|
+
end
|
37
47
|
end
|
38
48
|
|
39
49
|
|
40
50
|
def get ## "force" refresh text (get/fetch/download)
|
51
|
+
@nodes = nil ## note: reset cached parsed nodes too
|
52
|
+
|
41
53
|
@text = Client.new.text( @title, lang: @lang )
|
42
54
|
@text
|
43
55
|
end
|
@@ -45,9 +57,9 @@ module Wikiscript
|
|
45
57
|
alias_method :download, :get
|
46
58
|
|
47
59
|
|
48
|
-
|
49
60
|
def parse ## todo/change: use/find a different name e.g. doc/elements/etc. - why? why not?
|
50
|
-
PageReader.parse( text )
|
61
|
+
@nodes = PageReader.parse( text )
|
62
|
+
@nodes
|
51
63
|
end
|
52
64
|
end # class Page
|
53
65
|
end # Wikiscript
|
data/lib/wikiscript/version.rb
CHANGED
data/test/test_page_reader.rb
CHANGED
@@ -11,7 +11,7 @@ require 'helper'
|
|
11
11
|
class TestPageReader < MiniTest::Test
|
12
12
|
|
13
13
|
def test_basic
|
14
|
-
|
14
|
+
nodes = Wikiscript.parse( <<TXT )
|
15
15
|
=Heading 1==
|
16
16
|
==Heading 2==
|
17
17
|
===Heading 3===
|
@@ -32,15 +32,15 @@ class TestPageReader < MiniTest::Test
|
|
32
32
|
|}
|
33
33
|
TXT
|
34
34
|
|
35
|
-
pp
|
35
|
+
pp nodes
|
36
36
|
|
37
|
-
assert_equal 4,
|
38
|
-
assert_equal [:h1, 'Heading 1'],
|
39
|
-
assert_equal [:h2, 'Heading 2'],
|
40
|
-
assert_equal [:h3, 'Heading 3'],
|
37
|
+
assert_equal 4, nodes.size
|
38
|
+
assert_equal [:h1, 'Heading 1'], nodes[0]
|
39
|
+
assert_equal [:h2, 'Heading 2'], nodes[1]
|
40
|
+
assert_equal [:h3, 'Heading 3'], nodes[2]
|
41
41
|
assert_equal [:table, [['header1', 'header2', 'header3'],
|
42
42
|
['row1cell1', 'row1cell2', 'row1cell3'],
|
43
|
-
['row2cell1', 'row2cell2', 'row2cell3']]],
|
43
|
+
['row2cell1', 'row2cell2', 'row2cell3']]], nodes[3]
|
44
44
|
end
|
45
45
|
|
46
46
|
def test_parse
|
@@ -65,16 +65,53 @@ TXT
|
|
65
65
|
|}
|
66
66
|
TXT
|
67
67
|
|
68
|
-
|
69
|
-
pp
|
68
|
+
nodes = page.parse
|
69
|
+
pp nodes
|
70
70
|
|
71
|
-
assert_equal 4,
|
72
|
-
assert_equal [:h1, 'Heading 1'],
|
73
|
-
assert_equal [:h2, 'Heading 2'],
|
74
|
-
assert_equal [:h3, 'Heading 3'],
|
71
|
+
assert_equal 4, nodes.size
|
72
|
+
assert_equal [:h1, 'Heading 1'], nodes[0]
|
73
|
+
assert_equal [:h2, 'Heading 2'], nodes[1]
|
74
|
+
assert_equal [:h3, 'Heading 3'], nodes[2]
|
75
75
|
assert_equal [:table, [['header1', 'header2', 'header3'],
|
76
76
|
['row1cell1', 'row1cell2', 'row1cell3'],
|
77
|
-
['row2cell1', 'row2cell2', 'row2cell3']]],
|
77
|
+
['row2cell1', 'row2cell2', 'row2cell3']]], nodes[3]
|
78
|
+
end
|
79
|
+
|
80
|
+
def test_each
|
81
|
+
page = Wikiscript::Page.new( <<TXT )
|
82
|
+
=Heading 1==
|
83
|
+
==Heading 2==
|
84
|
+
===Heading 3===
|
85
|
+
|
86
|
+
{|
|
87
|
+
|-
|
88
|
+
! header1
|
89
|
+
! header2
|
90
|
+
! header3
|
91
|
+
|-
|
92
|
+
| row1cell1
|
93
|
+
| row1cell2
|
94
|
+
| row1cell3
|
95
|
+
|-
|
96
|
+
| row2cell1
|
97
|
+
| row2cell2
|
98
|
+
| row2cell3
|
99
|
+
|}
|
100
|
+
TXT
|
101
|
+
|
102
|
+
nodes = []
|
103
|
+
page.each do |node|
|
104
|
+
nodes << node
|
105
|
+
end
|
106
|
+
pp nodes
|
107
|
+
|
108
|
+
assert_equal 4, nodes.size
|
109
|
+
assert_equal [:h1, 'Heading 1'], nodes[0]
|
110
|
+
assert_equal [:h2, 'Heading 2'], nodes[1]
|
111
|
+
assert_equal [:h3, 'Heading 3'], nodes[2]
|
112
|
+
assert_equal [:table, [['header1', 'header2', 'header3'],
|
113
|
+
['row1cell1', 'row1cell2', 'row1cell3'],
|
114
|
+
['row2cell1', 'row2cell2', 'row2cell3']]], nodes[3]
|
78
115
|
end
|
79
116
|
|
80
117
|
end # class TestPageReader
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: wikiscript
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.3.
|
4
|
+
version: 0.3.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Gerald Bauer
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2019-09-
|
11
|
+
date: 2019-09-22 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: logutils
|
@@ -66,7 +66,8 @@ dependencies:
|
|
66
66
|
- - "~>"
|
67
67
|
- !ruby/object:Gem::Version
|
68
68
|
version: '3.16'
|
69
|
-
description: wikiscript - scripts for wikipedia (get wikitext for page
|
69
|
+
description: wikiscript - scripts for wikipedia (get wikitext for page, parse tables
|
70
|
+
'n' links, etc.)
|
70
71
|
email: opensport@googlegroups.com
|
71
72
|
executables: []
|
72
73
|
extensions: []
|
@@ -120,5 +121,6 @@ rubyforge_project:
|
|
120
121
|
rubygems_version: 2.5.2
|
121
122
|
signing_key:
|
122
123
|
specification_version: 4
|
123
|
-
summary: wikiscript - scripts for wikipedia (get wikitext for page
|
124
|
+
summary: wikiscript - scripts for wikipedia (get wikitext for page, parse tables 'n'
|
125
|
+
links, etc.)
|
124
126
|
test_files: []
|