wikiscript 0.3.1 → 0.3.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +193 -6
- data/Rakefile +1 -1
- data/lib/wikiscript/page.rb +15 -3
- data/lib/wikiscript/version.rb +1 -1
- data/test/test_page_reader.rb +51 -14
- metadata +6 -4
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: a34183c24ce2eac79cf72edcec39c562e6c74065
|
4
|
+
data.tar.gz: 56fff65e58fc1fbbbc2eeaa111d22d6ebce84f63
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 85fa8e16dffbdfdaf6b683dae530a692eb6c570a03345ddf2a013b248bbb51aaab9ce495850bdd564a163e35c37f0f80b3f6d47b5ee994e8effe02c413ac8ba8
|
7
|
+
data.tar.gz: ca8d9133a251f634db722575100d260c79859437fb393d9ac5b8181d0725f7d54a5290b4dfa0436a13aa73338c818adeea5c89c06b03b2f2e7360aa102f3e1c4
|
data/README.md
CHANGED
@@ -12,16 +12,203 @@ Read-only access to wikikpedia pages.
|
|
12
12
|
Example - Get wikitext source (via `en.wikipedia.org/w/index.php?action=raw&title=<title>`):
|
13
13
|
|
14
14
|
|
15
|
+
``` ruby
|
16
|
+
page = Wikiscript::Page.get( '2022_FIFA_World_Cup' ) # same as Wikiscript.get
|
17
|
+
page.text
|
15
18
|
```
|
16
|
-
>> page = Wikiscript::Page.new( '2014_FIFA_World_Cup_squads' )
|
17
|
-
>> page.text
|
18
19
|
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
20
|
+
prints
|
21
|
+
|
22
|
+
```
|
23
|
+
The '''2022 FIFA World Cup''' is scheduled to be the 22nd edition of the [[FIFA World Cup]],
|
24
|
+
the quadrennial international men's [[association football]] championship contested by the
|
25
|
+
[[List of men's national association football teams|national teams]] of the member associations of [[FIFA]].
|
26
|
+
It is scheduled to take place in [[Qatar]] in 2022. This will be the first World Cup ever to be held
|
27
|
+
in the [[Arab world]] and the first in a Muslim-majority country...
|
28
|
+
```
|
29
|
+
|
30
|
+
Or build your own page from scratch (no download):
|
31
|
+
|
32
|
+
``` ruby
|
33
|
+
page = Wikiscript::Page.new( <<TXT, title: '2022_FIFA_World_Cup' )
|
34
|
+
The '''2022 FIFA World Cup''' is scheduled to be the 22nd edition of the [[FIFA World Cup]],
|
35
|
+
the quadrennial international men's [[association football]] championship contested by the
|
36
|
+
[[List of men's national association football teams|national teams]] of the member associations of [[FIFA]].
|
37
|
+
It is scheduled to take place in [[Qatar]] in 2022. This will be the first World Cup ever to be held
|
38
|
+
in the [[Arab world]] and the first in a Muslim-majority country...
|
39
|
+
TXT
|
40
|
+
page.text
|
41
|
+
```
|
42
|
+
|
43
|
+
prints
|
44
|
+
|
45
|
+
```
|
46
|
+
The '''2022 FIFA World Cup''' is scheduled to be the 22nd edition of the [[FIFA World Cup]],
|
47
|
+
the quadrennial international men's [[association football]] championship contested by the
|
48
|
+
[[List of men's national association football teams|national teams]] of the member associations of [[FIFA]].
|
49
|
+
It is scheduled to take place in [[Qatar]] in 2022. This will be the first World Cup ever to be held
|
50
|
+
in the [[Arab world]] and the first in a Muslim-majority country...
|
51
|
+
```
|
52
|
+
|
53
|
+
|
54
|
+
### Tables
|
55
|
+
|
56
|
+
Parse wiki tables into an array. Example:
|
57
|
+
|
58
|
+
``` ruby
|
59
|
+
table = Wikiscript.parse_table( <<TXT )
|
60
|
+
{|
|
61
|
+
|-
|
62
|
+
! header1
|
63
|
+
! header2
|
64
|
+
! header3
|
65
|
+
|-
|
66
|
+
| row1cell1
|
67
|
+
| row1cell2
|
68
|
+
| row1cell3
|
69
|
+
|-
|
70
|
+
| row2cell1
|
71
|
+
| row2cell2
|
72
|
+
| row2cell3
|
73
|
+
|}
|
74
|
+
TXT
|
75
|
+
|
76
|
+
# -or-
|
77
|
+
|
78
|
+
table = Wikiscript.parse_table( <<TXT )
|
79
|
+
{|
|
80
|
+
! header1 !! header2 !! header3
|
81
|
+
|-
|
82
|
+
| row1cell1 || row1cell2 || row1cell3
|
83
|
+
|-
|
84
|
+
| row2cell1 || row2cell2 || row2cell3
|
85
|
+
|}
|
86
|
+
TXT
|
87
|
+
|
88
|
+
# -or-
|
89
|
+
|
90
|
+
table = Wikiscript.parse_table( <<TXT )
|
91
|
+
{|
|
92
|
+
|-
|
93
|
+
!
|
94
|
+
header1
|
95
|
+
!
|
96
|
+
header2
|
97
|
+
!
|
98
|
+
header3
|
99
|
+
|-
|
100
|
+
|
|
101
|
+
row1cell1
|
102
|
+
|
|
103
|
+
row1cell2
|
104
|
+
|
|
105
|
+
row1cell3
|
106
|
+
|-
|
107
|
+
|
|
108
|
+
row2cell1
|
109
|
+
|
|
110
|
+
row2cell2
|
111
|
+
|
|
112
|
+
row2cell3
|
113
|
+
|}
|
114
|
+
TXT
|
115
|
+
```
|
116
|
+
|
117
|
+
resulting in:
|
118
|
+
|
119
|
+
``` ruby
|
120
|
+
pp table
|
121
|
+
#=> [["header1", "header2", "header3"],
|
122
|
+
# ["row1cell1", "row1cell2", "row1cell3"],
|
123
|
+
# ["row2cell1", "row2cell2", "row2cell3"]]
|
23
124
|
```
|
24
125
|
|
126
|
+
Note: `parse_table` will strip/remove (leading) style attributes (e.g. `àttribute="value" |` and (inline) bold and italic emphases (e.g. `''`) from the (cell) text. Example:
|
127
|
+
|
128
|
+
``` ruby
|
129
|
+
table = Wikiscript.parse_table( <<TXT )
|
130
|
+
{|
|
131
|
+
|-
|
132
|
+
! style="width:200px;"|Club
|
133
|
+
! style="width:150px;"|City
|
134
|
+
|-
|
135
|
+
|[[Biu Chun Rangers]]||[[Sham Shui Po]]
|
136
|
+
|-
|
137
|
+
|bgcolor=#ffff44 |''[[Eastern Sports Club|Eastern]]''||[[Mong Kok]]
|
138
|
+
|-
|
139
|
+
|[[HKFC Soccer Section]]||[[Happy Valley, Hong Kong|Happy Valley]]
|
140
|
+
|}
|
141
|
+
TXT
|
142
|
+
```
|
143
|
+
|
144
|
+
resulting in:
|
145
|
+
|
146
|
+
``` ruby
|
147
|
+
pp table
|
148
|
+
#=> [["Club", "City"],
|
149
|
+
# ["[[Biu Chun Rangers]]", "[[Sham Shui Po]]"],
|
150
|
+
# ["[[Eastern Sports Club|Eastern]]", "[[Mong Kok]]"],
|
151
|
+
# ["[[HKFC Soccer Section]]", "[[Happy Valley, Hong Kong|Happy Valley]]"]]
|
152
|
+
```
|
153
|
+
|
154
|
+
### Links
|
155
|
+
|
156
|
+
Split links into two parts. Note: The alternate link title is optional. Example:
|
157
|
+
|
158
|
+
``` ruby
|
159
|
+
link, title = Wikiscript.parse_link( '[[La Florida, Chile|La Florida]]' )
|
160
|
+
link #=> "La Florida, Chile"
|
161
|
+
title #=> "La Florida"
|
162
|
+
|
163
|
+
link, title = Wikiscript.parse_link( '[[ La Florida, Chile]]' )
|
164
|
+
link #=> "La Florida, Chile"
|
165
|
+
title #=> nil
|
166
|
+
|
167
|
+
link, title = Wikiscript.parse_link( 'La Florida' )
|
168
|
+
link #=> nil
|
169
|
+
title #=> nil
|
170
|
+
```
|
171
|
+
|
172
|
+
### Document Element Structure
|
173
|
+
|
174
|
+
Get the document's element structure.
|
175
|
+
Note: For now only section headings (`h1`, `h2`, `h3`, ...) and tables are supported.
|
176
|
+
Example:
|
177
|
+
|
178
|
+
``` ruby
|
179
|
+
nodes = Wikiscript.parse( <<TXT )
|
180
|
+
=Heading 1==
|
181
|
+
==Heading 2==
|
182
|
+
===Heading 3===
|
183
|
+
|
184
|
+
{|
|
185
|
+
|-
|
186
|
+
! header1
|
187
|
+
! header2
|
188
|
+
! header3
|
189
|
+
|-
|
190
|
+
| row1cell1
|
191
|
+
| row1cell2
|
192
|
+
| row1cell3
|
193
|
+
|-
|
194
|
+
| row2cell1
|
195
|
+
| row2cell2
|
196
|
+
| row2cell3
|
197
|
+
|}
|
198
|
+
TXT
|
199
|
+
|
200
|
+
pp nodes
|
201
|
+
#=> [[:h1, "Heading 1"],
|
202
|
+
# [:h2, "Heading 2"],
|
203
|
+
# [:h3, "Heading 3"],
|
204
|
+
# [:table, [["header1", "header2", "header3"],
|
205
|
+
# ["row1cell1", "row1cell2", "row1cell3"],
|
206
|
+
# ["row2cell1", "row2cell2", "row2cell3"]]]
|
207
|
+
```
|
208
|
+
|
209
|
+
|
210
|
+
That's all for now. More functionality will get added over time.
|
211
|
+
|
25
212
|
|
26
213
|
|
27
214
|
## Install
|
data/Rakefile
CHANGED
@@ -5,7 +5,7 @@ Hoe.spec 'wikiscript' do
|
|
5
5
|
|
6
6
|
self.version = Wikiscript::VERSION
|
7
7
|
|
8
|
-
self.summary =
|
8
|
+
self.summary = "wikiscript - scripts for wikipedia (get wikitext for page, parse tables 'n' links, etc.)"
|
9
9
|
self.description = summary
|
10
10
|
|
11
11
|
self.urls = ['https://github.com/wikiscript/wikiscript']
|
data/lib/wikiscript/page.rb
CHANGED
@@ -33,11 +33,23 @@ module Wikiscript
|
|
33
33
|
end
|
34
34
|
|
35
35
|
def text
|
36
|
-
@text ||= get
|
36
|
+
@text ||= get # cache text (from request)
|
37
|
+
end
|
38
|
+
|
39
|
+
def nodes
|
40
|
+
@nodes ||= parse # cache text (from parse)
|
41
|
+
end
|
42
|
+
|
43
|
+
def each ## loop over all nodes / elements -note: nodes is a (flat) list (array) for now
|
44
|
+
nodes.each do |node|
|
45
|
+
yield( node )
|
46
|
+
end
|
37
47
|
end
|
38
48
|
|
39
49
|
|
40
50
|
def get ## "force" refresh text (get/fetch/download)
|
51
|
+
@nodes = nil ## note: reset cached parsed nodes too
|
52
|
+
|
41
53
|
@text = Client.new.text( @title, lang: @lang )
|
42
54
|
@text
|
43
55
|
end
|
@@ -45,9 +57,9 @@ module Wikiscript
|
|
45
57
|
alias_method :download, :get
|
46
58
|
|
47
59
|
|
48
|
-
|
49
60
|
def parse ## todo/change: use/find a different name e.g. doc/elements/etc. - why? why not?
|
50
|
-
PageReader.parse( text )
|
61
|
+
@nodes = PageReader.parse( text )
|
62
|
+
@nodes
|
51
63
|
end
|
52
64
|
end # class Page
|
53
65
|
end # Wikiscript
|
data/lib/wikiscript/version.rb
CHANGED
data/test/test_page_reader.rb
CHANGED
@@ -11,7 +11,7 @@ require 'helper'
|
|
11
11
|
class TestPageReader < MiniTest::Test
|
12
12
|
|
13
13
|
def test_basic
|
14
|
-
|
14
|
+
nodes = Wikiscript.parse( <<TXT )
|
15
15
|
=Heading 1==
|
16
16
|
==Heading 2==
|
17
17
|
===Heading 3===
|
@@ -32,15 +32,15 @@ class TestPageReader < MiniTest::Test
|
|
32
32
|
|}
|
33
33
|
TXT
|
34
34
|
|
35
|
-
pp
|
35
|
+
pp nodes
|
36
36
|
|
37
|
-
assert_equal 4,
|
38
|
-
assert_equal [:h1, 'Heading 1'],
|
39
|
-
assert_equal [:h2, 'Heading 2'],
|
40
|
-
assert_equal [:h3, 'Heading 3'],
|
37
|
+
assert_equal 4, nodes.size
|
38
|
+
assert_equal [:h1, 'Heading 1'], nodes[0]
|
39
|
+
assert_equal [:h2, 'Heading 2'], nodes[1]
|
40
|
+
assert_equal [:h3, 'Heading 3'], nodes[2]
|
41
41
|
assert_equal [:table, [['header1', 'header2', 'header3'],
|
42
42
|
['row1cell1', 'row1cell2', 'row1cell3'],
|
43
|
-
['row2cell1', 'row2cell2', 'row2cell3']]],
|
43
|
+
['row2cell1', 'row2cell2', 'row2cell3']]], nodes[3]
|
44
44
|
end
|
45
45
|
|
46
46
|
def test_parse
|
@@ -65,16 +65,53 @@ TXT
|
|
65
65
|
|}
|
66
66
|
TXT
|
67
67
|
|
68
|
-
|
69
|
-
pp
|
68
|
+
nodes = page.parse
|
69
|
+
pp nodes
|
70
70
|
|
71
|
-
assert_equal 4,
|
72
|
-
assert_equal [:h1, 'Heading 1'],
|
73
|
-
assert_equal [:h2, 'Heading 2'],
|
74
|
-
assert_equal [:h3, 'Heading 3'],
|
71
|
+
assert_equal 4, nodes.size
|
72
|
+
assert_equal [:h1, 'Heading 1'], nodes[0]
|
73
|
+
assert_equal [:h2, 'Heading 2'], nodes[1]
|
74
|
+
assert_equal [:h3, 'Heading 3'], nodes[2]
|
75
75
|
assert_equal [:table, [['header1', 'header2', 'header3'],
|
76
76
|
['row1cell1', 'row1cell2', 'row1cell3'],
|
77
|
-
['row2cell1', 'row2cell2', 'row2cell3']]],
|
77
|
+
['row2cell1', 'row2cell2', 'row2cell3']]], nodes[3]
|
78
|
+
end
|
79
|
+
|
80
|
+
def test_each
|
81
|
+
page = Wikiscript::Page.new( <<TXT )
|
82
|
+
=Heading 1==
|
83
|
+
==Heading 2==
|
84
|
+
===Heading 3===
|
85
|
+
|
86
|
+
{|
|
87
|
+
|-
|
88
|
+
! header1
|
89
|
+
! header2
|
90
|
+
! header3
|
91
|
+
|-
|
92
|
+
| row1cell1
|
93
|
+
| row1cell2
|
94
|
+
| row1cell3
|
95
|
+
|-
|
96
|
+
| row2cell1
|
97
|
+
| row2cell2
|
98
|
+
| row2cell3
|
99
|
+
|}
|
100
|
+
TXT
|
101
|
+
|
102
|
+
nodes = []
|
103
|
+
page.each do |node|
|
104
|
+
nodes << node
|
105
|
+
end
|
106
|
+
pp nodes
|
107
|
+
|
108
|
+
assert_equal 4, nodes.size
|
109
|
+
assert_equal [:h1, 'Heading 1'], nodes[0]
|
110
|
+
assert_equal [:h2, 'Heading 2'], nodes[1]
|
111
|
+
assert_equal [:h3, 'Heading 3'], nodes[2]
|
112
|
+
assert_equal [:table, [['header1', 'header2', 'header3'],
|
113
|
+
['row1cell1', 'row1cell2', 'row1cell3'],
|
114
|
+
['row2cell1', 'row2cell2', 'row2cell3']]], nodes[3]
|
78
115
|
end
|
79
116
|
|
80
117
|
end # class TestPageReader
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: wikiscript
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.3.
|
4
|
+
version: 0.3.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Gerald Bauer
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2019-09-
|
11
|
+
date: 2019-09-22 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: logutils
|
@@ -66,7 +66,8 @@ dependencies:
|
|
66
66
|
- - "~>"
|
67
67
|
- !ruby/object:Gem::Version
|
68
68
|
version: '3.16'
|
69
|
-
description: wikiscript - scripts for wikipedia (get wikitext for page
|
69
|
+
description: wikiscript - scripts for wikipedia (get wikitext for page, parse tables
|
70
|
+
'n' links, etc.)
|
70
71
|
email: opensport@googlegroups.com
|
71
72
|
executables: []
|
72
73
|
extensions: []
|
@@ -120,5 +121,6 @@ rubyforge_project:
|
|
120
121
|
rubygems_version: 2.5.2
|
121
122
|
signing_key:
|
122
123
|
specification_version: 4
|
123
|
-
summary: wikiscript - scripts for wikipedia (get wikitext for page
|
124
|
+
summary: wikiscript - scripts for wikipedia (get wikitext for page, parse tables 'n'
|
125
|
+
links, etc.)
|
124
126
|
test_files: []
|