wikiscript 0.3.1 → 0.4.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +5 -5
- data/CHANGELOG.md +2 -0
- data/Manifest.txt +1 -8
- data/README.md +193 -6
- data/Rakefile +5 -5
- data/lib/wikiscript/client.rb +11 -28
- data/lib/wikiscript/outline_reader.rb +97 -0
- data/lib/wikiscript/page.rb +17 -6
- data/lib/wikiscript/page_reader.rb +3 -4
- data/lib/wikiscript/table_reader.rb +2 -3
- data/lib/wikiscript/version.rb +3 -3
- data/lib/wikiscript.rb +14 -15
- metadata +25 -27
- data/LICENSE.md +0 -116
- data/NOTES.md +0 -52
- data/test/helper.rb +0 -8
- data/test/test_link.rb +0 -31
- data/test/test_page.rb +0 -54
- data/test/test_page_de.rb +0 -38
- data/test/test_page_reader.rb +0 -80
- data/test/test_table_reader.rb +0 -109
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
|
-
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: 9c6ea507fa295074d82e5d81eb88828c4473b35e5aee7f04b79cfbc774bb9e72
|
4
|
+
data.tar.gz: 56a5e69536495a90e0c1635799c8023ca57fdfbc554f818d40f86046972f13fb
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 5d0bc2f5ec90d5a3c3f6c0c72c145f33a4faf1acccb28a3487059c715a8c60b5a5ed43fdd54318fe30669b1f64601a48eab0bfbba1f30da494f0b89a526cf054
|
7
|
+
data.tar.gz: ad591ed36382240029ddf9cf3e120fb938d3a253b8766cf04e0875b222b2349ee7473a1b0c851ee337afcd501d8830cea1da39d30eb78e60f9aa539d33408a59
|
data/CHANGELOG.md
CHANGED
data/Manifest.txt
CHANGED
@@ -1,18 +1,11 @@
|
|
1
1
|
CHANGELOG.md
|
2
|
-
LICENSE.md
|
3
2
|
Manifest.txt
|
4
|
-
NOTES.md
|
5
3
|
README.md
|
6
4
|
Rakefile
|
7
5
|
lib/wikiscript.rb
|
8
6
|
lib/wikiscript/client.rb
|
7
|
+
lib/wikiscript/outline_reader.rb
|
9
8
|
lib/wikiscript/page.rb
|
10
9
|
lib/wikiscript/page_reader.rb
|
11
10
|
lib/wikiscript/table_reader.rb
|
12
11
|
lib/wikiscript/version.rb
|
13
|
-
test/helper.rb
|
14
|
-
test/test_link.rb
|
15
|
-
test/test_page.rb
|
16
|
-
test/test_page_de.rb
|
17
|
-
test/test_page_reader.rb
|
18
|
-
test/test_table_reader.rb
|
data/README.md
CHANGED
@@ -12,16 +12,203 @@ Read-only access to wikikpedia pages.
|
|
12
12
|
Example - Get wikitext source (via `en.wikipedia.org/w/index.php?action=raw&title=<title>`):
|
13
13
|
|
14
14
|
|
15
|
+
``` ruby
|
16
|
+
page = Wikiscript::Page.get( '2022_FIFA_World_Cup' ) # same as Wikiscript.get
|
17
|
+
page.text
|
15
18
|
```
|
16
|
-
>> page = Wikiscript::Page.new( '2014_FIFA_World_Cup_squads' )
|
17
|
-
>> page.text
|
18
19
|
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
20
|
+
prints
|
21
|
+
|
22
|
+
```
|
23
|
+
The '''2022 FIFA World Cup''' is scheduled to be the 22nd edition of the [[FIFA World Cup]],
|
24
|
+
the quadrennial international men's [[association football]] championship contested by the
|
25
|
+
[[List of men's national association football teams|national teams]] of the member associations of [[FIFA]].
|
26
|
+
It is scheduled to take place in [[Qatar]] in 2022. This will be the first World Cup ever to be held
|
27
|
+
in the [[Arab world]] and the first in a Muslim-majority country...
|
28
|
+
```
|
29
|
+
|
30
|
+
Or build your own page from scratch (no download):
|
31
|
+
|
32
|
+
``` ruby
|
33
|
+
page = Wikiscript::Page.new( <<TXT, title: '2022_FIFA_World_Cup' )
|
34
|
+
The '''2022 FIFA World Cup''' is scheduled to be the 22nd edition of the [[FIFA World Cup]],
|
35
|
+
the quadrennial international men's [[association football]] championship contested by the
|
36
|
+
[[List of men's national association football teams|national teams]] of the member associations of [[FIFA]].
|
37
|
+
It is scheduled to take place in [[Qatar]] in 2022. This will be the first World Cup ever to be held
|
38
|
+
in the [[Arab world]] and the first in a Muslim-majority country...
|
39
|
+
TXT
|
40
|
+
page.text
|
41
|
+
```
|
42
|
+
|
43
|
+
prints
|
44
|
+
|
45
|
+
```
|
46
|
+
The '''2022 FIFA World Cup''' is scheduled to be the 22nd edition of the [[FIFA World Cup]],
|
47
|
+
the quadrennial international men's [[association football]] championship contested by the
|
48
|
+
[[List of men's national association football teams|national teams]] of the member associations of [[FIFA]].
|
49
|
+
It is scheduled to take place in [[Qatar]] in 2022. This will be the first World Cup ever to be held
|
50
|
+
in the [[Arab world]] and the first in a Muslim-majority country...
|
51
|
+
```
|
52
|
+
|
53
|
+
|
54
|
+
### Tables
|
55
|
+
|
56
|
+
Parse wiki tables into an array. Example:
|
57
|
+
|
58
|
+
``` ruby
|
59
|
+
table = Wikiscript.parse_table( <<TXT )
|
60
|
+
{|
|
61
|
+
|-
|
62
|
+
! header1
|
63
|
+
! header2
|
64
|
+
! header3
|
65
|
+
|-
|
66
|
+
| row1cell1
|
67
|
+
| row1cell2
|
68
|
+
| row1cell3
|
69
|
+
|-
|
70
|
+
| row2cell1
|
71
|
+
| row2cell2
|
72
|
+
| row2cell3
|
73
|
+
|}
|
74
|
+
TXT
|
75
|
+
|
76
|
+
# -or-
|
77
|
+
|
78
|
+
table = Wikiscript.parse_table( <<TXT )
|
79
|
+
{|
|
80
|
+
! header1 !! header2 !! header3
|
81
|
+
|-
|
82
|
+
| row1cell1 || row1cell2 || row1cell3
|
83
|
+
|-
|
84
|
+
| row2cell1 || row2cell2 || row2cell3
|
85
|
+
|}
|
86
|
+
TXT
|
87
|
+
|
88
|
+
# -or-
|
89
|
+
|
90
|
+
table = Wikiscript.parse_table( <<TXT )
|
91
|
+
{|
|
92
|
+
|-
|
93
|
+
!
|
94
|
+
header1
|
95
|
+
!
|
96
|
+
header2
|
97
|
+
!
|
98
|
+
header3
|
99
|
+
|-
|
100
|
+
|
|
101
|
+
row1cell1
|
102
|
+
|
|
103
|
+
row1cell2
|
104
|
+
|
|
105
|
+
row1cell3
|
106
|
+
|-
|
107
|
+
|
|
108
|
+
row2cell1
|
109
|
+
|
|
110
|
+
row2cell2
|
111
|
+
|
|
112
|
+
row2cell3
|
113
|
+
|}
|
114
|
+
TXT
|
115
|
+
```
|
116
|
+
|
117
|
+
resulting in:
|
118
|
+
|
119
|
+
``` ruby
|
120
|
+
pp table
|
121
|
+
#=> [["header1", "header2", "header3"],
|
122
|
+
# ["row1cell1", "row1cell2", "row1cell3"],
|
123
|
+
# ["row2cell1", "row2cell2", "row2cell3"]]
|
23
124
|
```
|
24
125
|
|
126
|
+
Note: `parse_table` will strip/remove (leading) style attributes (e.g. `àttribute="value" |` and (inline) bold and italic emphases (e.g. `''`) from the (cell) text. Example:
|
127
|
+
|
128
|
+
``` ruby
|
129
|
+
table = Wikiscript.parse_table( <<TXT )
|
130
|
+
{|
|
131
|
+
|-
|
132
|
+
! style="width:200px;"|Club
|
133
|
+
! style="width:150px;"|City
|
134
|
+
|-
|
135
|
+
|[[Biu Chun Rangers]]||[[Sham Shui Po]]
|
136
|
+
|-
|
137
|
+
|bgcolor=#ffff44 |''[[Eastern Sports Club|Eastern]]''||[[Mong Kok]]
|
138
|
+
|-
|
139
|
+
|[[HKFC Soccer Section]]||[[Happy Valley, Hong Kong|Happy Valley]]
|
140
|
+
|}
|
141
|
+
TXT
|
142
|
+
```
|
143
|
+
|
144
|
+
resulting in:
|
145
|
+
|
146
|
+
``` ruby
|
147
|
+
pp table
|
148
|
+
#=> [["Club", "City"],
|
149
|
+
# ["[[Biu Chun Rangers]]", "[[Sham Shui Po]]"],
|
150
|
+
# ["[[Eastern Sports Club|Eastern]]", "[[Mong Kok]]"],
|
151
|
+
# ["[[HKFC Soccer Section]]", "[[Happy Valley, Hong Kong|Happy Valley]]"]]
|
152
|
+
```
|
153
|
+
|
154
|
+
### Links
|
155
|
+
|
156
|
+
Split links into two parts. Note: The alternate link title is optional. Example:
|
157
|
+
|
158
|
+
``` ruby
|
159
|
+
link, title = Wikiscript.parse_link( '[[La Florida, Chile|La Florida]]' )
|
160
|
+
link #=> "La Florida, Chile"
|
161
|
+
title #=> "La Florida"
|
162
|
+
|
163
|
+
link, title = Wikiscript.parse_link( '[[ La Florida, Chile]]' )
|
164
|
+
link #=> "La Florida, Chile"
|
165
|
+
title #=> nil
|
166
|
+
|
167
|
+
link, title = Wikiscript.parse_link( 'La Florida' )
|
168
|
+
link #=> nil
|
169
|
+
title #=> nil
|
170
|
+
```
|
171
|
+
|
172
|
+
### Document Element Structure
|
173
|
+
|
174
|
+
Get the document's element structure.
|
175
|
+
Note: For now only section headings (`h1`, `h2`, `h3`, ...) and tables are supported.
|
176
|
+
Example:
|
177
|
+
|
178
|
+
``` ruby
|
179
|
+
nodes = Wikiscript.parse( <<TXT )
|
180
|
+
=Heading 1==
|
181
|
+
==Heading 2==
|
182
|
+
===Heading 3===
|
183
|
+
|
184
|
+
{|
|
185
|
+
|-
|
186
|
+
! header1
|
187
|
+
! header2
|
188
|
+
! header3
|
189
|
+
|-
|
190
|
+
| row1cell1
|
191
|
+
| row1cell2
|
192
|
+
| row1cell3
|
193
|
+
|-
|
194
|
+
| row2cell1
|
195
|
+
| row2cell2
|
196
|
+
| row2cell3
|
197
|
+
|}
|
198
|
+
TXT
|
199
|
+
|
200
|
+
pp nodes
|
201
|
+
#=> [[:h1, "Heading 1"],
|
202
|
+
# [:h2, "Heading 2"],
|
203
|
+
# [:h3, "Heading 3"],
|
204
|
+
# [:table, [["header1", "header2", "header3"],
|
205
|
+
# ["row1cell1", "row1cell2", "row1cell3"],
|
206
|
+
# ["row2cell1", "row2cell2", "row2cell3"]]]
|
207
|
+
```
|
208
|
+
|
209
|
+
|
210
|
+
That's all for now. More functionality will get added over time.
|
211
|
+
|
25
212
|
|
26
213
|
|
27
214
|
## Install
|
data/Rakefile
CHANGED
@@ -5,26 +5,26 @@ Hoe.spec 'wikiscript' do
|
|
5
5
|
|
6
6
|
self.version = Wikiscript::VERSION
|
7
7
|
|
8
|
-
self.summary =
|
8
|
+
self.summary = "wikiscript - scripts for wikipedia (get wikitext for page, parse tables 'n' links, etc.)"
|
9
9
|
self.description = summary
|
10
10
|
|
11
|
-
self.urls =
|
11
|
+
self.urls = { home: 'https://github.com/wikiscript/wikiscript' }
|
12
12
|
|
13
13
|
self.author = 'Gerald Bauer'
|
14
|
-
self.email = '
|
14
|
+
self.email = 'gerald.bauer@gmail.com'
|
15
15
|
|
16
16
|
# switch extension to .markdown for gihub formatting
|
17
17
|
self.readme_file = 'README.md'
|
18
18
|
self.history_file = 'CHANGELOG.md'
|
19
19
|
|
20
20
|
self.extra_deps = [
|
21
|
+
['cocos'],
|
21
22
|
['logutils' ],
|
22
|
-
['fetcher']
|
23
23
|
]
|
24
24
|
|
25
25
|
self.licenses = ['Public Domain']
|
26
26
|
|
27
27
|
self.spec_extras = {
|
28
|
-
required_ruby_version: '>=
|
28
|
+
required_ruby_version: '>= 3.1.0'
|
29
29
|
}
|
30
30
|
end
|
data/lib/wikiscript/client.rb
CHANGED
@@ -1,19 +1,13 @@
|
|
1
|
-
# encoding: utf-8
|
2
1
|
|
3
2
|
module Wikiscript
|
4
3
|
|
5
4
|
class Client
|
5
|
+
include Logging
|
6
6
|
|
7
|
-
|
8
|
-
|
9
|
-
SITE_BASE = 'http://{lang}.wikipedia.org/w/index.php'
|
7
|
+
SITE_BASE = 'https://{lang}.wikipedia.org/w/index.php'
|
10
8
|
|
11
9
|
### API_BASE = 'http://en.wikipedia.org/w/api.php'
|
12
10
|
|
13
|
-
def initialize( opts={} )
|
14
|
-
@opts = opts
|
15
|
-
@worker = Fetcher::Worker.new
|
16
|
-
end
|
17
11
|
|
18
12
|
## change to: wikitext or raw why? why not? or to raw? why? why not?
|
19
13
|
def text( title, lang: Wikiscript.lang )
|
@@ -24,7 +18,7 @@ module Wikiscript
|
|
24
18
|
# becomes
|
25
19
|
# http://en.wikipedia.org/w/index.php or
|
26
20
|
# http://de.wikipedia.org/w/index.php etc
|
27
|
-
base_url = SITE_BASE.
|
21
|
+
base_url = SITE_BASE.sub( "{lang}", lang )
|
28
22
|
params = { action: 'raw',
|
29
23
|
title: title }
|
30
24
|
|
@@ -33,6 +27,10 @@ module Wikiscript
|
|
33
27
|
|
34
28
|
private
|
35
29
|
def build_query( h )
|
30
|
+
|
31
|
+
## todo/fix - check what to use for params encode
|
32
|
+
## e.g. escape_component or such?
|
33
|
+
## fix add params upstream to weblclient - why? why not?
|
36
34
|
h.map do |k,v|
|
37
35
|
"#{CGI.escape(k.to_s)}=#{CGI.escape(v.to_s)}"
|
38
36
|
end.join( '&' )
|
@@ -48,27 +46,12 @@ private
|
|
48
46
|
## fix: pass in uri (add to fetcher check for is_a? URI etc.)
|
49
47
|
uri_string = "#{base_url}?#{query}"
|
50
48
|
|
51
|
-
response =
|
52
|
-
|
53
|
-
if response.code == '200'
|
54
|
-
t = response.body
|
55
|
-
###
|
56
|
-
# NB: Net::HTTP will NOT set encoding UTF-8 etc.
|
57
|
-
# will mostly be ASCII
|
58
|
-
# - try to change encoding to UTF-8 ourselves
|
59
|
-
logger.debug "t.encoding.name (before): #{t.encoding.name}"
|
60
|
-
#####
|
61
|
-
# NB: ASCII-8BIT == BINARY == Encoding Unknown; Raw Bytes Here
|
49
|
+
response = Webclient.get( uri_string )
|
62
50
|
|
63
|
-
|
64
|
-
|
65
|
-
# - note: force_encoding will NOT change the chars only change the assumed encoding w/o translation
|
66
|
-
t = t.force_encoding( Encoding::UTF_8 )
|
67
|
-
logger.debug "t.encoding.name (after): #{t.encoding.name}"
|
68
|
-
## pp t
|
69
|
-
t
|
51
|
+
if response.status.ok?
|
52
|
+
response.text
|
70
53
|
else
|
71
|
-
logger.error "
|
54
|
+
logger.error "HTTP ERROR - #{response.status.code} #{response.status.message}"
|
72
55
|
exit 1 ### exit for now on error - why? why not?
|
73
56
|
## nil
|
74
57
|
end
|
@@ -0,0 +1,97 @@
|
|
1
|
+
|
2
|
+
module Wikiscript
|
3
|
+
|
4
|
+
class OutlineReader
|
5
|
+
|
6
|
+
def self.read( path )
|
7
|
+
txt = File.open( path, 'r:utf-8' ) { |f| f.read }
|
8
|
+
parse( txt )
|
9
|
+
end
|
10
|
+
|
11
|
+
def self.parse( txt )
|
12
|
+
new( txt ).parse
|
13
|
+
end
|
14
|
+
|
15
|
+
def initialize( txt )
|
16
|
+
@txt = txt
|
17
|
+
end
|
18
|
+
|
19
|
+
|
20
|
+
HEADING_RE = %r{\A
|
21
|
+
(?<marker>={1,}) ## 1. leading ======
|
22
|
+
[ ]*
|
23
|
+
(?<text>[^=]+?) ## 2. text (note: for now no "inline" = allowed)
|
24
|
+
[ ]*
|
25
|
+
=* ## 3. (optional) trailing ====
|
26
|
+
\z}x
|
27
|
+
|
28
|
+
def parse
|
29
|
+
outline = [] ## outline / page structure
|
30
|
+
|
31
|
+
start_para = true ## start new para(graph) on new text line?
|
32
|
+
|
33
|
+
@txt.each_line do |line|
|
34
|
+
|
35
|
+
##
|
36
|
+
## (auto-)sanitize first
|
37
|
+
## - => "vanilla" space
|
38
|
+
## - 1–2 => 1-2 - "vanilla" dash
|
39
|
+
## todo - move up into txt!!!
|
40
|
+
line = line.gsub( ' ', ' ' )
|
41
|
+
line = line.gsub( /[–]/, '-' )
|
42
|
+
|
43
|
+
|
44
|
+
line = line.strip ## todo/fix: keep leading and trailing spaces - why? why not?
|
45
|
+
|
46
|
+
if line.empty? ## todo/fix: keep blank line nodes?? and just remove comments and process headings?! - why? why not?
|
47
|
+
start_para = true
|
48
|
+
next
|
49
|
+
end
|
50
|
+
|
51
|
+
break if line == '__END__'
|
52
|
+
|
53
|
+
next if line.start_with?( '#' ) ## skip comments too
|
54
|
+
## strip inline (until end-of-line) comments too
|
55
|
+
## e.g Eupen | KAS Eupen ## [de]
|
56
|
+
## => Eupen | KAS Eupen
|
57
|
+
## e.g bq Bonaire, BOE # CONCACAF
|
58
|
+
## => bq Bonaire, BOE
|
59
|
+
line = line.sub( /#.*/, '' ).strip
|
60
|
+
|
61
|
+
## note: like in wikimedia markup (and markdown) all optional trailing ==== too
|
62
|
+
if m=HEADING_RE.match( line )
|
63
|
+
start_para = true
|
64
|
+
|
65
|
+
heading_marker = m[:marker]
|
66
|
+
heading_level = heading_marker.length ## count number of = for heading level
|
67
|
+
heading = m[:text].strip
|
68
|
+
|
69
|
+
outline << [:"h#{heading_level}", heading]
|
70
|
+
elsif line == '----' ## make more generic/flexible - why? why not?
|
71
|
+
start_para = true
|
72
|
+
outline << [:hr]
|
73
|
+
## The horizontal rule represents a paragraph-level thematic break. Do not use in article content,
|
74
|
+
## as rules are used only after main sections, and this is automatic.
|
75
|
+
## HTML equivalent: <hr /> (which can be indented,
|
76
|
+
## whereas ---- always starts at the left margin.)
|
77
|
+
else ## assume it's a (plain/regular) text line
|
78
|
+
if start_para
|
79
|
+
outline << [:p, [line]]
|
80
|
+
start_para = false
|
81
|
+
else
|
82
|
+
node = outline[-1] ## get last entry
|
83
|
+
if node[0] == :p ## assert it's a p(aragraph) node!!!
|
84
|
+
node[1] << line ## add line to p(aragraph)
|
85
|
+
else
|
86
|
+
puts "!! ERROR - invalid outline state / format - expected p(aragraph) node; got:"
|
87
|
+
pp node
|
88
|
+
exit 1
|
89
|
+
end
|
90
|
+
end
|
91
|
+
end
|
92
|
+
end
|
93
|
+
outline
|
94
|
+
end # method parse
|
95
|
+
end # class OutlineReader
|
96
|
+
|
97
|
+
end # module Wikiscript
|
data/lib/wikiscript/page.rb
CHANGED
@@ -1,10 +1,9 @@
|
|
1
|
-
# encoding: utf-8
|
2
1
|
|
3
2
|
module Wikiscript
|
4
3
|
|
5
4
|
class Page
|
6
5
|
|
7
|
-
include
|
6
|
+
include Logging
|
8
7
|
|
9
8
|
attr_reader :title, :lang
|
10
9
|
|
@@ -16,7 +15,7 @@ module Wikiscript
|
|
16
15
|
end
|
17
16
|
|
18
17
|
def self.read( path )
|
19
|
-
text = File.open( path, 'r:utf-8' ).read
|
18
|
+
text = File.open( path, 'r:utf-8' ) { |f| f.read }
|
20
19
|
o = new( text, title: "File:#{path}" ) ## use auto-generated File:<path> title path - why? why not?
|
21
20
|
o
|
22
21
|
end
|
@@ -33,11 +32,23 @@ module Wikiscript
|
|
33
32
|
end
|
34
33
|
|
35
34
|
def text
|
36
|
-
@text ||= get
|
35
|
+
@text ||= get # cache text (from request)
|
36
|
+
end
|
37
|
+
|
38
|
+
def nodes
|
39
|
+
@nodes ||= parse # cache text (from parse)
|
40
|
+
end
|
41
|
+
|
42
|
+
def each ## loop over all nodes / elements -note: nodes is a (flat) list (array) for now
|
43
|
+
nodes.each do |node|
|
44
|
+
yield( node )
|
45
|
+
end
|
37
46
|
end
|
38
47
|
|
39
48
|
|
40
49
|
def get ## "force" refresh text (get/fetch/download)
|
50
|
+
@nodes = nil ## note: reset cached parsed nodes too
|
51
|
+
|
41
52
|
@text = Client.new.text( @title, lang: @lang )
|
42
53
|
@text
|
43
54
|
end
|
@@ -45,9 +56,9 @@ module Wikiscript
|
|
45
56
|
alias_method :download, :get
|
46
57
|
|
47
58
|
|
48
|
-
|
49
59
|
def parse ## todo/change: use/find a different name e.g. doc/elements/etc. - why? why not?
|
50
|
-
PageReader.parse( text )
|
60
|
+
@nodes = PageReader.parse( text )
|
61
|
+
@nodes
|
51
62
|
end
|
52
63
|
end # class Page
|
53
64
|
end # Wikiscript
|
@@ -1,11 +1,10 @@
|
|
1
|
-
# encoding: utf-8
|
2
1
|
|
3
2
|
module Wikiscript
|
4
3
|
|
5
4
|
class PageReader
|
6
5
|
|
7
|
-
def self.read( path )
|
8
|
-
txt = File.open( path, 'r:utf-8' ).read
|
6
|
+
def self.read( path )
|
7
|
+
txt = File.open( path, 'r:utf-8' ) { |f| f.read }
|
9
8
|
parse( txt )
|
10
9
|
end
|
11
10
|
|
@@ -54,7 +53,7 @@ class PageReader
|
|
54
53
|
table_txt << line << "\n"
|
55
54
|
else
|
56
55
|
## note: skip unknown line types for now
|
57
|
-
|
56
|
+
|
58
57
|
## puts "** !!! ERROR !!! unknown line type in wiki page:"
|
59
58
|
## pp line
|
60
59
|
## exit 1
|
@@ -1,11 +1,10 @@
|
|
1
|
-
# encoding: utf-8
|
2
1
|
|
3
2
|
module Wikiscript
|
4
3
|
|
5
4
|
class TableReader
|
6
5
|
|
7
|
-
def self.read( path )
|
8
|
-
txt = File.open( path, 'r:utf-8' ).read
|
6
|
+
def self.read( path )
|
7
|
+
txt = File.open( path, 'r:utf-8' ) { |f| f.read }
|
9
8
|
parse( txt )
|
10
9
|
end
|
11
10
|
|
data/lib/wikiscript/version.rb
CHANGED
@@ -1,12 +1,12 @@
|
|
1
1
|
|
2
2
|
module Wikiscript
|
3
|
-
VERSION = '0.
|
3
|
+
VERSION = '0.4.0'
|
4
4
|
|
5
5
|
def self.banner
|
6
|
-
"wikiscript/#{VERSION} on Ruby #{RUBY_VERSION} (#{RUBY_RELEASE_DATE}) [#{RUBY_PLATFORM}]"
|
6
|
+
"wikiscript/#{VERSION} on Ruby #{RUBY_VERSION} (#{RUBY_RELEASE_DATE}) [#{RUBY_PLATFORM}] in (#{root})"
|
7
7
|
end
|
8
8
|
|
9
9
|
def self.root
|
10
|
-
|
10
|
+
File.expand_path( File.dirname(File.dirname(File.dirname(__FILE__))) )
|
11
11
|
end
|
12
12
|
end
|
data/lib/wikiscript.rb
CHANGED
@@ -1,26 +1,25 @@
|
|
1
|
-
|
2
|
-
|
3
|
-
## stdlibs
|
4
|
-
|
5
|
-
require 'net/http'
|
6
|
-
require 'uri'
|
7
|
-
require 'cgi'
|
8
|
-
require 'pp'
|
1
|
+
## stdlibs via cocos
|
2
|
+
require 'cocos'
|
9
3
|
|
10
4
|
|
11
5
|
## 3rd party gems/libs
|
12
6
|
## require 'props'
|
13
7
|
|
14
8
|
require 'logutils'
|
15
|
-
require 'fetcher'
|
16
9
|
|
17
|
-
|
10
|
+
module Wikiscript
|
11
|
+
Logging = LogUtils::Logging
|
12
|
+
end
|
18
13
|
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
14
|
+
|
15
|
+
|
16
|
+
# our own code
|
17
|
+
require_relative 'wikiscript/version' # let it always go first
|
18
|
+
require_relative 'wikiscript/client'
|
19
|
+
require_relative 'wikiscript/table_reader'
|
20
|
+
require_relative 'wikiscript/page_reader'
|
21
|
+
require_relative 'wikiscript/outline_reader'
|
22
|
+
require_relative 'wikiscript/page'
|
24
23
|
|
25
24
|
|
26
25
|
|
metadata
CHANGED
@@ -1,17 +1,17 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: wikiscript
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.4.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Gerald Bauer
|
8
|
-
autorequire:
|
8
|
+
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2024-09-01 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
|
-
name:
|
14
|
+
name: cocos
|
15
15
|
requirement: !ruby/object:Gem::Requirement
|
16
16
|
requirements:
|
17
17
|
- - ">="
|
@@ -25,7 +25,7 @@ dependencies:
|
|
25
25
|
- !ruby/object:Gem::Version
|
26
26
|
version: '0'
|
27
27
|
- !ruby/object:Gem::Dependency
|
28
|
-
name:
|
28
|
+
name: logutils
|
29
29
|
requirement: !ruby/object:Gem::Requirement
|
30
30
|
requirements:
|
31
31
|
- - ">="
|
@@ -42,64 +42,62 @@ dependencies:
|
|
42
42
|
name: rdoc
|
43
43
|
requirement: !ruby/object:Gem::Requirement
|
44
44
|
requirements:
|
45
|
-
- - "
|
45
|
+
- - ">="
|
46
46
|
- !ruby/object:Gem::Version
|
47
47
|
version: '4.0'
|
48
|
+
- - "<"
|
49
|
+
- !ruby/object:Gem::Version
|
50
|
+
version: '7'
|
48
51
|
type: :development
|
49
52
|
prerelease: false
|
50
53
|
version_requirements: !ruby/object:Gem::Requirement
|
51
54
|
requirements:
|
52
|
-
- - "
|
55
|
+
- - ">="
|
53
56
|
- !ruby/object:Gem::Version
|
54
57
|
version: '4.0'
|
58
|
+
- - "<"
|
59
|
+
- !ruby/object:Gem::Version
|
60
|
+
version: '7'
|
55
61
|
- !ruby/object:Gem::Dependency
|
56
62
|
name: hoe
|
57
63
|
requirement: !ruby/object:Gem::Requirement
|
58
64
|
requirements:
|
59
65
|
- - "~>"
|
60
66
|
- !ruby/object:Gem::Version
|
61
|
-
version: '
|
67
|
+
version: '4.1'
|
62
68
|
type: :development
|
63
69
|
prerelease: false
|
64
70
|
version_requirements: !ruby/object:Gem::Requirement
|
65
71
|
requirements:
|
66
72
|
- - "~>"
|
67
73
|
- !ruby/object:Gem::Version
|
68
|
-
version: '
|
69
|
-
description: wikiscript - scripts for wikipedia (get wikitext for page
|
70
|
-
|
74
|
+
version: '4.1'
|
75
|
+
description: wikiscript - scripts for wikipedia (get wikitext for page, parse tables
|
76
|
+
'n' links, etc.)
|
77
|
+
email: gerald.bauer@gmail.com
|
71
78
|
executables: []
|
72
79
|
extensions: []
|
73
80
|
extra_rdoc_files:
|
74
81
|
- CHANGELOG.md
|
75
|
-
- LICENSE.md
|
76
82
|
- Manifest.txt
|
77
|
-
- NOTES.md
|
78
83
|
- README.md
|
79
84
|
files:
|
80
85
|
- CHANGELOG.md
|
81
|
-
- LICENSE.md
|
82
86
|
- Manifest.txt
|
83
|
-
- NOTES.md
|
84
87
|
- README.md
|
85
88
|
- Rakefile
|
86
89
|
- lib/wikiscript.rb
|
87
90
|
- lib/wikiscript/client.rb
|
91
|
+
- lib/wikiscript/outline_reader.rb
|
88
92
|
- lib/wikiscript/page.rb
|
89
93
|
- lib/wikiscript/page_reader.rb
|
90
94
|
- lib/wikiscript/table_reader.rb
|
91
95
|
- lib/wikiscript/version.rb
|
92
|
-
- test/helper.rb
|
93
|
-
- test/test_link.rb
|
94
|
-
- test/test_page.rb
|
95
|
-
- test/test_page_de.rb
|
96
|
-
- test/test_page_reader.rb
|
97
|
-
- test/test_table_reader.rb
|
98
96
|
homepage: https://github.com/wikiscript/wikiscript
|
99
97
|
licenses:
|
100
98
|
- Public Domain
|
101
99
|
metadata: {}
|
102
|
-
post_install_message:
|
100
|
+
post_install_message:
|
103
101
|
rdoc_options:
|
104
102
|
- "--main"
|
105
103
|
- README.md
|
@@ -109,16 +107,16 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
109
107
|
requirements:
|
110
108
|
- - ">="
|
111
109
|
- !ruby/object:Gem::Version
|
112
|
-
version:
|
110
|
+
version: 3.1.0
|
113
111
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
114
112
|
requirements:
|
115
113
|
- - ">="
|
116
114
|
- !ruby/object:Gem::Version
|
117
115
|
version: '0'
|
118
116
|
requirements: []
|
119
|
-
|
120
|
-
|
121
|
-
signing_key:
|
117
|
+
rubygems_version: 3.4.10
|
118
|
+
signing_key:
|
122
119
|
specification_version: 4
|
123
|
-
summary: wikiscript - scripts for wikipedia (get wikitext for page
|
120
|
+
summary: wikiscript - scripts for wikipedia (get wikitext for page, parse tables 'n'
|
121
|
+
links, etc.)
|
124
122
|
test_files: []
|
data/LICENSE.md
DELETED
@@ -1,116 +0,0 @@
|
|
1
|
-
CC0 1.0 Universal
|
2
|
-
|
3
|
-
Statement of Purpose
|
4
|
-
|
5
|
-
The laws of most jurisdictions throughout the world automatically confer
|
6
|
-
exclusive Copyright and Related Rights (defined below) upon the creator and
|
7
|
-
subsequent owner(s) (each and all, an "owner") of an original work of
|
8
|
-
authorship and/or a database (each, a "Work").
|
9
|
-
|
10
|
-
Certain owners wish to permanently relinquish those rights to a Work for the
|
11
|
-
purpose of contributing to a commons of creative, cultural and scientific
|
12
|
-
works ("Commons") that the public can reliably and without fear of later
|
13
|
-
claims of infringement build upon, modify, incorporate in other works, reuse
|
14
|
-
and redistribute as freely as possible in any form whatsoever and for any
|
15
|
-
purposes, including without limitation commercial purposes. These owners may
|
16
|
-
contribute to the Commons to promote the ideal of a free culture and the
|
17
|
-
further production of creative, cultural and scientific works, or to gain
|
18
|
-
reputation or greater distribution for their Work in part through the use and
|
19
|
-
efforts of others.
|
20
|
-
|
21
|
-
For these and/or other purposes and motivations, and without any expectation
|
22
|
-
of additional consideration or compensation, the person associating CC0 with a
|
23
|
-
Work (the "Affirmer"), to the extent that he or she is an owner of Copyright
|
24
|
-
and Related Rights in the Work, voluntarily elects to apply CC0 to the Work
|
25
|
-
and publicly distribute the Work under its terms, with knowledge of his or her
|
26
|
-
Copyright and Related Rights in the Work and the meaning and intended legal
|
27
|
-
effect of CC0 on those rights.
|
28
|
-
|
29
|
-
1. Copyright and Related Rights. A Work made available under CC0 may be
|
30
|
-
protected by copyright and related or neighboring rights ("Copyright and
|
31
|
-
Related Rights"). Copyright and Related Rights include, but are not limited
|
32
|
-
to, the following:
|
33
|
-
|
34
|
-
i. the right to reproduce, adapt, distribute, perform, display, communicate,
|
35
|
-
and translate a Work;
|
36
|
-
|
37
|
-
ii. moral rights retained by the original author(s) and/or performer(s);
|
38
|
-
|
39
|
-
iii. publicity and privacy rights pertaining to a person's image or likeness
|
40
|
-
depicted in a Work;
|
41
|
-
|
42
|
-
iv. rights protecting against unfair competition in regards to a Work,
|
43
|
-
subject to the limitations in paragraph 4(a), below;
|
44
|
-
|
45
|
-
v. rights protecting the extraction, dissemination, use and reuse of data in
|
46
|
-
a Work;
|
47
|
-
|
48
|
-
vi. database rights (such as those arising under Directive 96/9/EC of the
|
49
|
-
European Parliament and of the Council of 11 March 1996 on the legal
|
50
|
-
protection of databases, and under any national implementation thereof,
|
51
|
-
including any amended or successor version of such directive); and
|
52
|
-
|
53
|
-
vii. other similar, equivalent or corresponding rights throughout the world
|
54
|
-
based on applicable law or treaty, and any national implementations thereof.
|
55
|
-
|
56
|
-
2. Waiver. To the greatest extent permitted by, but not in contravention of,
|
57
|
-
applicable law, Affirmer hereby overtly, fully, permanently, irrevocably and
|
58
|
-
unconditionally waives, abandons, and surrenders all of Affirmer's Copyright
|
59
|
-
and Related Rights and associated claims and causes of action, whether now
|
60
|
-
known or unknown (including existing as well as future claims and causes of
|
61
|
-
action), in the Work (i) in all territories worldwide, (ii) for the maximum
|
62
|
-
duration provided by applicable law or treaty (including future time
|
63
|
-
extensions), (iii) in any current or future medium and for any number of
|
64
|
-
copies, and (iv) for any purpose whatsoever, including without limitation
|
65
|
-
commercial, advertising or promotional purposes (the "Waiver"). Affirmer makes
|
66
|
-
the Waiver for the benefit of each member of the public at large and to the
|
67
|
-
detriment of Affirmer's heirs and successors, fully intending that such Waiver
|
68
|
-
shall not be subject to revocation, rescission, cancellation, termination, or
|
69
|
-
any other legal or equitable action to disrupt the quiet enjoyment of the Work
|
70
|
-
by the public as contemplated by Affirmer's express Statement of Purpose.
|
71
|
-
|
72
|
-
3. Public License Fallback. Should any part of the Waiver for any reason be
|
73
|
-
judged legally invalid or ineffective under applicable law, then the Waiver
|
74
|
-
shall be preserved to the maximum extent permitted taking into account
|
75
|
-
Affirmer's express Statement of Purpose. In addition, to the extent the Waiver
|
76
|
-
is so judged Affirmer hereby grants to each affected person a royalty-free,
|
77
|
-
non transferable, non sublicensable, non exclusive, irrevocable and
|
78
|
-
unconditional license to exercise Affirmer's Copyright and Related Rights in
|
79
|
-
the Work (i) in all territories worldwide, (ii) for the maximum duration
|
80
|
-
provided by applicable law or treaty (including future time extensions), (iii)
|
81
|
-
in any current or future medium and for any number of copies, and (iv) for any
|
82
|
-
purpose whatsoever, including without limitation commercial, advertising or
|
83
|
-
promotional purposes (the "License"). The License shall be deemed effective as
|
84
|
-
of the date CC0 was applied by Affirmer to the Work. Should any part of the
|
85
|
-
License for any reason be judged legally invalid or ineffective under
|
86
|
-
applicable law, such partial invalidity or ineffectiveness shall not
|
87
|
-
invalidate the remainder of the License, and in such case Affirmer hereby
|
88
|
-
affirms that he or she will not (i) exercise any of his or her remaining
|
89
|
-
Copyright and Related Rights in the Work or (ii) assert any associated claims
|
90
|
-
and causes of action with respect to the Work, in either case contrary to
|
91
|
-
Affirmer's express Statement of Purpose.
|
92
|
-
|
93
|
-
4. Limitations and Disclaimers.
|
94
|
-
|
95
|
-
a. No trademark or patent rights held by Affirmer are waived, abandoned,
|
96
|
-
surrendered, licensed or otherwise affected by this document.
|
97
|
-
|
98
|
-
b. Affirmer offers the Work as-is and makes no representations or warranties
|
99
|
-
of any kind concerning the Work, express, implied, statutory or otherwise,
|
100
|
-
including without limitation warranties of title, merchantability, fitness
|
101
|
-
for a particular purpose, non infringement, or the absence of latent or
|
102
|
-
other defects, accuracy, or the present or absence of errors, whether or not
|
103
|
-
discoverable, all to the greatest extent permissible under applicable law.
|
104
|
-
|
105
|
-
c. Affirmer disclaims responsibility for clearing rights of other persons
|
106
|
-
that may apply to the Work or any use thereof, including without limitation
|
107
|
-
any person's Copyright and Related Rights in the Work. Further, Affirmer
|
108
|
-
disclaims responsibility for obtaining any necessary consents, permissions
|
109
|
-
or other rights required for any use of the Work.
|
110
|
-
|
111
|
-
d. Affirmer understands and acknowledges that Creative Commons is not a
|
112
|
-
party to this document and has no duty or obligation with respect to this
|
113
|
-
CC0 or use of the Work.
|
114
|
-
|
115
|
-
For more information, please see
|
116
|
-
<http://creativecommons.org/publicdomain/zero/1.0/>
|
data/NOTES.md
DELETED
@@ -1,52 +0,0 @@
|
|
1
|
-
# Notes
|
2
|
-
|
3
|
-
|
4
|
-
## Alternatives
|
5
|
-
|
6
|
-
Wikipedia
|
7
|
-
|
8
|
-
- [wikipedia-client](https://rubygems.org/gems/wikipedia-client) by Ken Pratt et al - ruby client for the Wikipedia API
|
9
|
-
- <https://github.com/kenpratt/wikipedia-client>
|
10
|
-
- <https://www.rubydoc.info/gems/wikipedia-client>
|
11
|
-
|
12
|
-
<!-- break -->
|
13
|
-
|
14
|
-
- [infoboxer](https://rubygems.org/gems/infoboxer) by Victor Shepelev et al - pure-Ruby Wikipedia (and generic MediaWiki) client and parser, targeting information extraction
|
15
|
-
- <https://github.com/molybdenum-99/infoboxer>
|
16
|
-
- <https://www.rubydoc.info/gems/infoboxer>
|
17
|
-
|
18
|
-
<!-- break -->
|
19
|
-
|
20
|
-
More
|
21
|
-
|
22
|
-
- <https://github.com/molybdenum-99/reality>
|
23
|
-
- https://github.com/molybdenum-99/mediawiktory
|
24
|
-
|
25
|
-
|
26
|
-
Wikidata
|
27
|
-
|
28
|
-
- [wikidata](https://rubygems.org/gems/wikidata) by Wil Gieseler
|
29
|
-
- <https://github.com/wilg/wikidata>
|
30
|
-
- <https://www.rubydoc.info/gems/wikidata>
|
31
|
-
|
32
|
-
<!-- break -->
|
33
|
-
|
34
|
-
- [wikidata-fetcher](https://rubygems.org/gems/wikidata-fetcher)
|
35
|
-
- <https://github.com/everypolitician/wikidata-fetcher>
|
36
|
-
|
37
|
-
<!-- break -->
|
38
|
-
|
39
|
-
- [mediawiki_api-wikidata](https://rubygems.org/gems/mediawiki_api-wikidata)
|
40
|
-
- <https://github.com/wmde/WikidataApiGem>
|
41
|
-
|
42
|
-
|
43
|
-
|
44
|
-
**Python**
|
45
|
-
|
46
|
-
- <https://pypi.org/project/wptools/> - Wikipedia tools (for Humans)
|
47
|
-
- <https://github.com/siznax/wptools/>
|
48
|
-
|
49
|
-
|
50
|
-
## Wikipedia
|
51
|
-
|
52
|
-
- Wikipedia API reference: <http://en.wikipedia.org/w/api.php>
|
data/test/helper.rb
DELETED
data/test/test_link.rb
DELETED
@@ -1,31 +0,0 @@
|
|
1
|
-
# encoding: utf-8
|
2
|
-
|
3
|
-
###
|
4
|
-
# to run use
|
5
|
-
# ruby -I ./lib -I ./test test/test_link.rb
|
6
|
-
|
7
|
-
|
8
|
-
require 'helper'
|
9
|
-
|
10
|
-
|
11
|
-
class TestLink < MiniTest::Test
|
12
|
-
|
13
|
-
def test_unlink
|
14
|
-
assert_equal 'Santiago (La Florida)', Wikiscript.unlink( '[[Santiago]] ([[La Florida, Chile|La Florida]])' )
|
15
|
-
end
|
16
|
-
|
17
|
-
def test_parse_link
|
18
|
-
link, title = Wikiscript.parse_link( '[[La Florida, Chile|La Florida]]' )
|
19
|
-
assert_equal 'La Florida, Chile', link
|
20
|
-
assert_equal 'La Florida', title
|
21
|
-
|
22
|
-
link, title = Wikiscript.parse_link( '[[ La Florida, Chile | La Florida ]]' )
|
23
|
-
assert_equal 'La Florida, Chile', link
|
24
|
-
assert_equal 'La Florida', title
|
25
|
-
|
26
|
-
link, title = Wikiscript.parse_link( 'La Florida' )
|
27
|
-
assert link == nil
|
28
|
-
assert title == nil
|
29
|
-
end
|
30
|
-
|
31
|
-
end # class TestLink
|
data/test/test_page.rb
DELETED
@@ -1,54 +0,0 @@
|
|
1
|
-
# encoding: utf-8
|
2
|
-
|
3
|
-
###
|
4
|
-
# to run use
|
5
|
-
# ruby -I ./lib -I ./test test/test_page.rb
|
6
|
-
|
7
|
-
|
8
|
-
require 'helper'
|
9
|
-
|
10
|
-
|
11
|
-
class TestPage < MiniTest::Test
|
12
|
-
|
13
|
-
def setup
|
14
|
-
Wikiscript.lang = :en
|
15
|
-
end
|
16
|
-
|
17
|
-
def test_austria_en
|
18
|
-
page = Wikiscript::Page.get( 'Austria' )
|
19
|
-
# [debug] GET /w/index.php?action=raw&title=Austria uri=http://en.wikipedia.org/w/index.php?action=raw&title=Austria, redirect_limit=5
|
20
|
-
# [debug] 301 TLS Redirect location=https://en.wikipedia.org/w/index.php?action=raw&title=Austria
|
21
|
-
# [debug] GET /w/index.php?action=raw&title=Austria uri=https://en.wikipedia.org/w/index.php?action=raw&title=Austria, redirect_limit=4
|
22
|
-
# [debug] 200 OK
|
23
|
-
|
24
|
-
text = page.text
|
25
|
-
|
26
|
-
## print first 600 chars
|
27
|
-
pp text[0..600]
|
28
|
-
|
29
|
-
## check for some snippets
|
30
|
-
assert /{{Infobox country/ =~ text
|
31
|
-
assert /common_name = Austria/ =~ text
|
32
|
-
assert /capital = \[\[Vienna\]\]/ =~ text
|
33
|
-
# assert /The origins of modern-day Austria date back to the time/ =~ text
|
34
|
-
end
|
35
|
-
|
36
|
-
def test_sankt_poelten_en
|
37
|
-
page = Wikiscript::Page.get( 'Sankt_Pölten' )
|
38
|
-
# [debug] GET /w/index.php?action=raw&title=Sankt_P%C3%B6lten uri=http://en.wikipedia.org/w/index.php?action=raw&title=Sankt_P%C3%B6lten, redirect_limit=5
|
39
|
-
# [debug] 301 TLS Redirect location=https://en.wikipedia.org/w/index.php?action=raw&title=Sankt_P%C3%B6lten
|
40
|
-
# [debug] GET /w/index.php?action=raw&title=Sankt_P%C3%B6lten uri=https://en.wikipedia.org/w/index.php?action=raw&title=Sankt_P%C3%B6lten, redirect_limit=4
|
41
|
-
# [debug] 200 OK
|
42
|
-
|
43
|
-
text = page.text
|
44
|
-
|
45
|
-
## print first 600 chars
|
46
|
-
pp text[0..600]
|
47
|
-
|
48
|
-
## check for some snippets
|
49
|
-
assert /{{Infobox settlement/ =~ text
|
50
|
-
assert /name\s+=\s+Sankt Pölten/ =~ text
|
51
|
-
# assert /'''Sankt Pölten''' \(''St. Pölten''\) is the capital city of/ =~ text
|
52
|
-
end
|
53
|
-
|
54
|
-
end # class TestPage
|
data/test/test_page_de.rb
DELETED
@@ -1,38 +0,0 @@
|
|
1
|
-
# encoding: utf-8
|
2
|
-
|
3
|
-
|
4
|
-
###
|
5
|
-
# to run use
|
6
|
-
# ruby -I ./lib -I ./test test/test_page_de.rb
|
7
|
-
|
8
|
-
|
9
|
-
require 'helper'
|
10
|
-
|
11
|
-
|
12
|
-
class TestPageDe < MiniTest::Test
|
13
|
-
|
14
|
-
def setup
|
15
|
-
Wikiscript.lang = :de
|
16
|
-
end
|
17
|
-
|
18
|
-
def test_st_poelten_de
|
19
|
-
page = Wikiscript::Page.get( 'St._Pölten' )
|
20
|
-
# [debug] GET /w/index.php?action=raw&title=St._P%C3%B6lten uri=http://de.wikipedia.org/w/index.php?action=raw&title=St._P%C3%B6lten, redirect_limit=5
|
21
|
-
# [debug] 301 TLS Redirect location=https://de.wikipedia.org/w/index.php?action=raw&title=St._P%C3%B6lten
|
22
|
-
# [debug] GET /w/index.php?action=raw&title=St._P%C3%B6lten uri=https://de.wikipedia.org/w/index.php?action=raw&title=St._P%C3%B6lten, redirect_limit=4
|
23
|
-
# [debug] 200 OK
|
24
|
-
|
25
|
-
|
26
|
-
text = page.text
|
27
|
-
|
28
|
-
## print first 600 chars
|
29
|
-
pp text[0..600]
|
30
|
-
|
31
|
-
## check for some snippets
|
32
|
-
assert /{{Infobox Gemeinde in Österreich/ =~ text
|
33
|
-
assert /Name\s+=\s+St\. Pölten/ =~ text
|
34
|
-
assert /'''St\. Pölten''' \(amtlicher Name,/ =~ text
|
35
|
-
## assert /Die Stadt liegt am Fluss \[\[Traisen \(Fluss\)\|Traisen\]\]/ =~ text
|
36
|
-
end
|
37
|
-
|
38
|
-
end # class TestPageDe
|
data/test/test_page_reader.rb
DELETED
@@ -1,80 +0,0 @@
|
|
1
|
-
# encoding: utf-8
|
2
|
-
|
3
|
-
###
|
4
|
-
# to run use
|
5
|
-
# ruby -I ./lib -I ./test test/test_page_reader.rb
|
6
|
-
|
7
|
-
|
8
|
-
require 'helper'
|
9
|
-
|
10
|
-
|
11
|
-
class TestPageReader < MiniTest::Test
|
12
|
-
|
13
|
-
def test_basic
|
14
|
-
el = Wikiscript.parse( <<TXT )
|
15
|
-
=Heading 1==
|
16
|
-
==Heading 2==
|
17
|
-
===Heading 3===
|
18
|
-
|
19
|
-
{|
|
20
|
-
|-
|
21
|
-
! header1
|
22
|
-
! header2
|
23
|
-
! header3
|
24
|
-
|-
|
25
|
-
| row1cell1
|
26
|
-
| row1cell2
|
27
|
-
| row1cell3
|
28
|
-
|-
|
29
|
-
| row2cell1
|
30
|
-
| row2cell2
|
31
|
-
| row2cell3
|
32
|
-
|}
|
33
|
-
TXT
|
34
|
-
|
35
|
-
pp el
|
36
|
-
|
37
|
-
assert_equal 4, el.size
|
38
|
-
assert_equal [:h1, 'Heading 1'], el[0]
|
39
|
-
assert_equal [:h2, 'Heading 2'], el[1]
|
40
|
-
assert_equal [:h3, 'Heading 3'], el[2]
|
41
|
-
assert_equal [:table, [['header1', 'header2', 'header3'],
|
42
|
-
['row1cell1', 'row1cell2', 'row1cell3'],
|
43
|
-
['row2cell1', 'row2cell2', 'row2cell3']]], el[3]
|
44
|
-
end
|
45
|
-
|
46
|
-
def test_parse
|
47
|
-
page = Wikiscript::Page.new( <<TXT )
|
48
|
-
=Heading 1==
|
49
|
-
==Heading 2==
|
50
|
-
===Heading 3===
|
51
|
-
|
52
|
-
{|
|
53
|
-
|-
|
54
|
-
! header1
|
55
|
-
! header2
|
56
|
-
! header3
|
57
|
-
|-
|
58
|
-
| row1cell1
|
59
|
-
| row1cell2
|
60
|
-
| row1cell3
|
61
|
-
|-
|
62
|
-
| row2cell1
|
63
|
-
| row2cell2
|
64
|
-
| row2cell3
|
65
|
-
|}
|
66
|
-
TXT
|
67
|
-
|
68
|
-
el = page.parse
|
69
|
-
pp el
|
70
|
-
|
71
|
-
assert_equal 4, el.size
|
72
|
-
assert_equal [:h1, 'Heading 1'], el[0]
|
73
|
-
assert_equal [:h2, 'Heading 2'], el[1]
|
74
|
-
assert_equal [:h3, 'Heading 3'], el[2]
|
75
|
-
assert_equal [:table, [['header1', 'header2', 'header3'],
|
76
|
-
['row1cell1', 'row1cell2', 'row1cell3'],
|
77
|
-
['row2cell1', 'row2cell2', 'row2cell3']]], el[3]
|
78
|
-
end
|
79
|
-
|
80
|
-
end # class TestPageReader
|
data/test/test_table_reader.rb
DELETED
@@ -1,109 +0,0 @@
|
|
1
|
-
# encoding: utf-8
|
2
|
-
|
3
|
-
###
|
4
|
-
# to run use
|
5
|
-
# ruby -I ./lib -I ./test test/test_table_reader.rb
|
6
|
-
|
7
|
-
|
8
|
-
require 'helper'
|
9
|
-
|
10
|
-
class TestTableReader < MiniTest::Test
|
11
|
-
|
12
|
-
def test_basic
|
13
|
-
table = Wikiscript.parse_table( <<TXT )
|
14
|
-
{|
|
15
|
-
|-
|
16
|
-
! header1
|
17
|
-
! header2
|
18
|
-
! header3
|
19
|
-
|-
|
20
|
-
| row1cell1
|
21
|
-
| row1cell2
|
22
|
-
| row1cell3
|
23
|
-
|-
|
24
|
-
| row2cell1
|
25
|
-
| row2cell2
|
26
|
-
| row2cell3
|
27
|
-
|}
|
28
|
-
TXT
|
29
|
-
|
30
|
-
assert_equal 3, table.size ## three rows
|
31
|
-
assert_equal ['header1', 'header2', 'header3'], table[0]
|
32
|
-
assert_equal ['row1cell1', 'row1cell2', 'row1cell3'], table[1]
|
33
|
-
assert_equal ['row2cell1', 'row2cell2', 'row2cell3'], table[2]
|
34
|
-
end
|
35
|
-
|
36
|
-
def test_basic_ii ## with optional (missing) row divider before headers
|
37
|
-
table = Wikiscript.parse_table( <<TXT )
|
38
|
-
{|
|
39
|
-
! header1 !! header2 !! header3
|
40
|
-
|-
|
41
|
-
| row1cell1 || row1cell2 || row1cell3
|
42
|
-
|-
|
43
|
-
| row2cell1 || row2cell2 || row2cell3
|
44
|
-
|}
|
45
|
-
TXT
|
46
|
-
|
47
|
-
assert_equal 3, table.size ## three rows
|
48
|
-
assert_equal ['header1', 'header2', 'header3'], table[0]
|
49
|
-
assert_equal ['row1cell1', 'row1cell2', 'row1cell3'], table[1]
|
50
|
-
assert_equal ['row2cell1', 'row2cell2', 'row2cell3'], table[2]
|
51
|
-
end
|
52
|
-
|
53
|
-
def test_basic_iii # with continuing header column lines
|
54
|
-
table = Wikiscript.parse_table( <<TXT )
|
55
|
-
{|
|
56
|
-
|-
|
57
|
-
!
|
58
|
-
header1
|
59
|
-
!
|
60
|
-
header2
|
61
|
-
!
|
62
|
-
header3
|
63
|
-
|-
|
64
|
-
|
|
65
|
-
row1cell1
|
66
|
-
|
|
67
|
-
row1cell2
|
68
|
-
|
|
69
|
-
row1cell3
|
70
|
-
|-
|
71
|
-
|
|
72
|
-
row2cell1
|
73
|
-
|
|
74
|
-
row2cell2
|
75
|
-
|
|
76
|
-
row2cell3
|
77
|
-
|}
|
78
|
-
TXT
|
79
|
-
|
80
|
-
assert_equal 3, table.size ## three rows
|
81
|
-
assert_equal ['header1', 'header2', 'header3'], table[0]
|
82
|
-
assert_equal ['row1cell1', 'row1cell2', 'row1cell3'], table[1]
|
83
|
-
assert_equal ['row2cell1', 'row2cell2', 'row2cell3'], table[2]
|
84
|
-
end
|
85
|
-
|
86
|
-
|
87
|
-
def test_strip_attributes_and_emphases
|
88
|
-
table = Wikiscript.parse_table( <<TXT )
|
89
|
-
{|
|
90
|
-
|-
|
91
|
-
! style="width:200px;"|Club
|
92
|
-
! style="width:150px;"|City
|
93
|
-
|-
|
94
|
-
|[[Biu Chun Rangers]]||[[Sham Shui Po]]
|
95
|
-
|-
|
96
|
-
|bgcolor=#ffff44 |''[[Eastern Sports Club|Eastern]]''||[[Mong Kok]]
|
97
|
-
|-
|
98
|
-
|[[HKFC Soccer Section]]||[[Happy Valley, Hong Kong|Happy Valley]]
|
99
|
-
|}
|
100
|
-
TXT
|
101
|
-
|
102
|
-
assert_equal 4, table.size ## four rows
|
103
|
-
assert_equal ['Club', 'City'], table[0]
|
104
|
-
assert_equal ['[[Biu Chun Rangers]]', '[[Sham Shui Po]]'], table[1]
|
105
|
-
assert_equal ['[[Eastern Sports Club|Eastern]]', '[[Mong Kok]]'], table[2]
|
106
|
-
assert_equal ['[[HKFC Soccer Section]]', '[[Happy Valley, Hong Kong|Happy Valley]]'], table[3]
|
107
|
-
end
|
108
|
-
|
109
|
-
end # class TestTableReader
|