rsssf 0.0.1 → 0.1.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.gemtest +0 -0
- data/Manifest.txt +11 -0
- data/README.md +171 -2
- data/lib/rsssf.rb +13 -0
- data/lib/rsssf/fetch.rb +80 -0
- data/lib/rsssf/html2txt.rb +157 -0
- data/lib/rsssf/page.rb +295 -0
- data/lib/rsssf/patch.rb +28 -0
- data/lib/rsssf/repo.rb +220 -0
- data/lib/rsssf/reports/page.rb +64 -0
- data/lib/rsssf/reports/schedule.rb +77 -0
- data/lib/rsssf/schedule.rb +31 -0
- data/lib/rsssf/utils.rb +75 -0
- data/lib/rsssf/version.rb +2 -2
- data/test/helper.rb +12 -0
- data/test/test_utils.rb +83 -0
- metadata +13 -1
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 9c33c4bef56bf6e79c9339b5010a3129b98dc9fa
|
4
|
+
data.tar.gz: 4d97b1862dfbaab48282593947c6d7b28d047dfe
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 8da437c2a72c364f81d53ca1491cdfca4c11c0f187912702e653523cb54f636ec910f8690f9af690cf632f0c98335b9ab10983676d0e67e1a59f3c7d7a5919c7
|
7
|
+
data.tar.gz: b8a13466f301863e6ae0e4e449f1da08d5ae17a7481bf2055e12976f6f14fda79a8ebd38aa5cfb5f097d21b36e43bc322f362483d1c522a005c9cf49e576f732
|
data/.gemtest
ADDED
File without changes
|
data/Manifest.txt
CHANGED
@@ -3,4 +3,15 @@ Manifest.txt
|
|
3
3
|
README.md
|
4
4
|
Rakefile
|
5
5
|
lib/rsssf.rb
|
6
|
+
lib/rsssf/fetch.rb
|
7
|
+
lib/rsssf/html2txt.rb
|
8
|
+
lib/rsssf/page.rb
|
9
|
+
lib/rsssf/patch.rb
|
10
|
+
lib/rsssf/repo.rb
|
11
|
+
lib/rsssf/reports/page.rb
|
12
|
+
lib/rsssf/reports/schedule.rb
|
13
|
+
lib/rsssf/schedule.rb
|
14
|
+
lib/rsssf/utils.rb
|
6
15
|
lib/rsssf/version.rb
|
16
|
+
test/helper.rb
|
17
|
+
test/test_utils.rb
|
data/README.md
CHANGED
@@ -1,16 +1,172 @@
|
|
1
1
|
# rsssf - tools 'n' scripts for RSSSF (Rec.Sport.Soccer Statistics Foundation) archive data
|
2
2
|
|
3
3
|
|
4
|
-
* home :: [github.com/sportdb/rsssf](https://github.com/sportdb/
|
4
|
+
* home :: [github.com/sportdb/rsssf](https://github.com/sportdb/rsssf)
|
5
5
|
* bugs :: [github.com/sportdb/rsssf/issues](https://github.com/sportdb/rsssf/issues)
|
6
6
|
* gem :: [rubygems.org/gems/rsssf](https://rubygems.org/gems/rsssf)
|
7
7
|
* rdoc :: [rubydoc.info/gems/rsssf](http://rubydoc.info/gems/rsssf)
|
8
8
|
* forum :: [opensport](http://groups.google.com/group/opensport)
|
9
9
|
|
10
10
|
|
11
|
+
## What's the Rec.Sport.Soccer Statistics Foundation (RSSSF)?
|
12
|
+
|
13
|
+
The RSSSF collects and offers football (soccer) league tables, match results and more
|
14
|
+
from all over the world online in plain text.
|
15
|
+
|
16
|
+
Example:
|
17
|
+
|
18
|
+
```
|
19
|
+
Round 1
|
20
|
+
[May 25]
|
21
|
+
Vasco da Gama 1-0 Portuguesa
|
22
|
+
[Carlos Tenório 47']
|
23
|
+
Vitória 2-2 Internacional
|
24
|
+
[Maxi Biancucchi 2', Gabriel Paulista 11'; Diego Forlán 29', Fred 63']
|
25
|
+
Corinthians 1-1 Botafogo
|
26
|
+
[Paulinho 73'; Rafael Marques 24']
|
27
|
+
[May 26]
|
28
|
+
Grêmio 2-0 Náutico [played in Caxias do Sul-RS]
|
29
|
+
[Zé Roberto 15', Elano 70']
|
30
|
+
Ponte Preta 0-2 São Paulo
|
31
|
+
[Lúcio 9', Jádson 44'p]
|
32
|
+
Criciúma 3-1 Bahia
|
33
|
+
[Matheus Ferraz 45'+1', Lins 46', João Vítor 82'; Diones 72']
|
34
|
+
Santos 0-0 Flamengo [played in Brasília-DF]
|
35
|
+
Fluminense 2-1 Atlético/PR [played in Macaé-RJ]
|
36
|
+
[Rafael Sóbis 15'p, Samuel 53'; Manoel 28']
|
37
|
+
Cruzeiro 5-0 Goiás
|
38
|
+
[Diego Souza 5', Bruno Rodrigo 30', Nílton 40',79', Borges 42']
|
39
|
+
Coritiba 2-1 Atlético/MG
|
40
|
+
[Deivid 53', Arthur 90'+1'; Diego Tardelli 51']
|
41
|
+
```
|
42
|
+
|
43
|
+
[Find out more about the Rec.Sport.Soccer Statistics Foundation (RSSSF) »](http://www.rsssf.com)
|
44
|
+
|
45
|
+
|
46
|
+
|
11
47
|
## Usage
|
12
48
|
|
13
|
-
|
49
|
+
### Working with Pages
|
50
|
+
|
51
|
+
To fetch pages from the world wide web use:
|
52
|
+
|
53
|
+
``` ruby
|
54
|
+
page = RsssfPage.from_url( 'http://www.rsssf.com/tablese/eng2015.html')
|
55
|
+
```
|
56
|
+
|
57
|
+
Note: The `RsssfPageFetcher` will convert the rsssf archive page
|
58
|
+
from hypertext (HTML) to plain text e.g.
|
59
|
+
|
60
|
+
```
|
61
|
+
<hr>
|
62
|
+
<a href="#premier">Premier League</A><br>
|
63
|
+
<a href="#cups">Cup Tournaments</A><br>
|
64
|
+
<a href="#champ">Championship</A><br>
|
65
|
+
<a href="#first">Division 1</A><br>
|
66
|
+
<a href="#second">Division 2</A><br>
|
67
|
+
<a href="#conf">Conference</A>
|
68
|
+
<hr>
|
69
|
+
<h4><a name="premier">Premier League</A></h4>
|
70
|
+
<pre>
|
71
|
+
Final Table:
|
72
|
+
|
73
|
+
1.Chelsea 38 26 9 3 73-32 87 Champions
|
74
|
+
2.Manchester City 38 24 7 7 83-38 79
|
75
|
+
3.Arsenal 38 22 9 7 71-36 75
|
76
|
+
...
|
77
|
+
```
|
78
|
+
|
79
|
+
will become
|
80
|
+
|
81
|
+
```
|
82
|
+
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
|
83
|
+
‹Premier League›
|
84
|
+
‹Cup Tournaments›
|
85
|
+
‹Championship›
|
86
|
+
‹Division 1›
|
87
|
+
‹Division 2›
|
88
|
+
‹Conference›
|
89
|
+
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
|
90
|
+
|
91
|
+
#### Premier League
|
92
|
+
|
93
|
+
|
94
|
+
Final Table:
|
95
|
+
|
96
|
+
1.Chelsea 38 26 9 3 73-32 87 Champions
|
97
|
+
2.Manchester City 38 24 7 7 83-38 79
|
98
|
+
3.Arsenal 38 22 9 7 71-36 75
|
99
|
+
...
|
100
|
+
```
|
101
|
+
|
102
|
+
|
103
|
+
### Working with Repos
|
104
|
+
|
105
|
+
To fetch pages from the world wide web for many seasons in batch setup and use a repo.
|
106
|
+
|
107
|
+
Step 1: List all archive pages
|
108
|
+
|
109
|
+
In the `tables/config.yml` list all archive pages to fetch. Example:
|
110
|
+
|
111
|
+
``` yaml
|
112
|
+
2010-11: tablese/eng2011.html
|
113
|
+
2011-12: tablese/eng2012.html
|
114
|
+
2012-13: tablese/eng2013.html
|
115
|
+
2013-14: tablese/eng2014.html
|
116
|
+
2014-15: tablese/eng2015.html
|
117
|
+
```
|
118
|
+
|
119
|
+
Step 2: Fetch all archive pages
|
120
|
+
|
121
|
+
Use:
|
122
|
+
|
123
|
+
``` ruby
|
124
|
+
repo = RsssfRepo.new( './eng-england', title: 'England (and Wales)' )
|
125
|
+
repo.fetch_pages
|
126
|
+
```
|
127
|
+
|
128
|
+
Bonus: To create a summary of all pages fetched (e.g. authors, last_updated, sections, etc.).
|
129
|
+
Use:
|
130
|
+
|
131
|
+
``` ruby
|
132
|
+
repo.make_pages_report
|
133
|
+
```
|
134
|
+
|
135
|
+
Example - `tables/README.md`:
|
136
|
+
|
137
|
+
|
138
|
+
football.db RSSSF Archive Data Summary for England (and Wales)
|
139
|
+
|
140
|
+
_Last Update: 2015-11-26 18:22:22 +0200_
|
141
|
+
|
142
|
+
| Season | File | Authors | Last Updated | Lines (Chars) | Sections |
|
143
|
+
| :------ | :------ | :------- | :----------- | ------------: | :------- |
|
144
|
+
| 2014-15 | [eng2015.txt](https://github.com/rsssf/eng-england/blob/master/tables/eng2015.txt) | Ian King and Karel Stokkermans | 4 Jun 2015 | 1249 (34138) | Premier League, Cup Tournaments, Championship, Division 1, Division 2, Conference |
|
145
|
+
| 2013-14 | [eng2014.txt](https://github.com/rsssf/eng-england/blob/master/tables/eng2014.txt) | Ian King and Karel Stokkermans | 5 Feb 2015 | 1254 (34294) | Premier League, Cup Tournaments, Championship, Division 1, Division 2, Conference |
|
146
|
+
| 2012-13 | [eng2013.txt](https://github.com/rsssf/eng-england/blob/master/tables/eng2013.txt) | Karel Stokkermans | 5 Feb 2015 | 1269 (34531) | Premiership, Cup Tournaments, Championship, Division 1, Division 2, Conference |
|
147
|
+
| 2011-12 | [eng2012.txt](https://github.com/rsssf/eng-england/blob/master/tables/eng2012.txt) | Karel Stokkermans | 5 Feb 2015 | 691 (21925) | Premiership, Cup Tournaments, Championship, Division 1, Division 2, Conference |
|
148
|
+
| 2010-11 | [eng2011.txt](https://github.com/rsssf/eng-england/blob/master/tables/eng2011.txt) | Ian King, Karel Stokkermans and Jan Schoenmakers | 5 Feb 2015 | 959 (37393) | Premiership, Cup Tournaments, Championship, Division 1, Division 2, Conference |
|
149
|
+
|
150
|
+
|
151
|
+
That's it.
|
152
|
+
|
153
|
+
|
154
|
+
### Preparing Archive Pages for SQL Database Imports (e.g. football.db)
|
155
|
+
|
156
|
+
To import match schedules (fixtures and results) and more using the football.db machinery
|
157
|
+
prepare "simple" single league (or cup) pages with standings tables etc. stripped out.
|
158
|
+
For example, to break-out the Premier League and FA Cup from the `eng2015.txt`
|
159
|
+
archive page use:
|
160
|
+
|
161
|
+
``` ruby
|
162
|
+
page = RsssfPage.from_url( 'http://www.rsssf.com/tablese/eng2015.html')
|
163
|
+
|
164
|
+
schedule = page.find_schedule( header: 'Premier League') ## returns RsssfSchedule obj
|
165
|
+
schedule.save( './1-premierleague.txt' )
|
166
|
+
|
167
|
+
schedule = page.find_schedule( header: 'FA Cup', cup: true )
|
168
|
+
schedule.save( './facup.txt' )
|
169
|
+
```
|
14
170
|
|
15
171
|
|
16
172
|
|
@@ -21,6 +177,19 @@ Just install the gem:
|
|
21
177
|
$ gem install rsssf
|
22
178
|
|
23
179
|
|
180
|
+
|
181
|
+
## RSSSF Datasets
|
182
|
+
|
183
|
+
See the rsssf github org for pre-processed ready-to-import datasets. Prepared repos include:
|
184
|
+
|
185
|
+
- [`eng-england`](https://github.com/rsssf/eng-england) - rsssf archive data for England - Premier League, Championship, FA Cup etc.
|
186
|
+
- [`de-deutschland`](https://github.com/rsssf/de-deutschland) - rsssf archive data for Germany (Deutschland) - Deutsche Bundesliga, 2. Bundesliga, 3. Liga, DFB Pokal etc.
|
187
|
+
- [`es-espana`](https://github.com/rsssf/es-espana) - rsssf archive data for España (Spain) - Primera División / La Liga, Copa de Rey, etc.
|
188
|
+
- [`at-austria`](https://github.com/rsssf/at-austria) - rsssf archive data for Austria (Österreich) - Österr. Bundesliga, Erste Liga, ÖFB Pokal etc.
|
189
|
+
- [`br-brazil`](https://github.com/rsssf/br-brazil) - rsssf archive data for Brazil (Brasil) - Campeonato Brasileiro Série A / Brasileirão etc.
|
190
|
+
- and more
|
191
|
+
|
192
|
+
|
24
193
|
## License
|
25
194
|
|
26
195
|
The `rsssf` scripts are dedicated to the public domain.
|
data/lib/rsssf.rb
CHANGED
@@ -14,6 +14,19 @@ require 'fetcher' ## used for Fetcher::Worker.new.fetch etc.
|
|
14
14
|
## our own code
|
15
15
|
require 'rsssf/version' # note: let version always go first
|
16
16
|
|
17
|
+
require 'rsssf/utils' # include Utils - goes first
|
18
|
+
require 'rsssf/html2txt' # include Filters - goes first
|
19
|
+
|
20
|
+
require 'rsssf/fetch'
|
21
|
+
require 'rsssf/page'
|
22
|
+
require 'rsssf/schedule'
|
23
|
+
require 'rsssf/patch'
|
24
|
+
|
25
|
+
require 'rsssf/reports/schedule'
|
26
|
+
require 'rsssf/reports/page'
|
27
|
+
|
28
|
+
require 'rsssf/repo'
|
29
|
+
|
17
30
|
|
18
31
|
|
19
32
|
|
data/lib/rsssf/fetch.rb
ADDED
@@ -0,0 +1,80 @@
|
|
1
|
+
# encoding: utf-8
|
2
|
+
|
3
|
+
module Rsssf
|
4
|
+
|
5
|
+
class PageFetcher
|
6
|
+
|
7
|
+
include Filters # e.g. html2text, sanitize etc.
|
8
|
+
|
9
|
+
|
10
|
+
def initialize
|
11
|
+
@worker = Fetcher::Worker.new
|
12
|
+
end
|
13
|
+
|
14
|
+
def fetch( src_url )
|
15
|
+
|
16
|
+
## note: assume plain 7-bit ascii for now
|
17
|
+
## -- assume rsssf uses ISO_8859_15 (updated version of ISO_8859_1) -- does NOT use utf-8 character encoding!!!
|
18
|
+
html = @worker.read( src_url )
|
19
|
+
|
20
|
+
### todo/fix: first check if html is all ascii-7bit e.g.
|
21
|
+
## includes only chars from 64 to 127!!!
|
22
|
+
|
23
|
+
## normalize newlines
|
24
|
+
## remove \r (form feed) used by Windows; just use \n (new line)
|
25
|
+
html = html.gsub( "\r", '' )
|
26
|
+
|
27
|
+
## note:
|
28
|
+
## assume (default) to ISO 3166-15 (an updated version of ISO 3166-1) for now
|
29
|
+
##
|
30
|
+
## other possible alternatives - try:
|
31
|
+
## - Windows CP 1562 or
|
32
|
+
## - ISO 3166-2 (for eastern european languages )
|
33
|
+
##
|
34
|
+
## note: german umlaut use the same code (int)
|
35
|
+
## in ISO 3166-1/15 and 2 and Windows CP1562 (other chars ARE different!!!)
|
36
|
+
|
37
|
+
html = html.force_encoding( Encoding::ISO_8859_15 )
|
38
|
+
html = html.encode( Encoding::UTF_8 ) # try conversion to utf-8
|
39
|
+
|
40
|
+
## check for html entities
|
41
|
+
html = html.gsub( "ä", 'ä' )
|
42
|
+
html = html.gsub( "ö", 'ö' )
|
43
|
+
html = html.gsub( "ü", 'ü' )
|
44
|
+
html = html.gsub( "Ä", 'Ä' )
|
45
|
+
html = html.gsub( "Ö", 'Ö' )
|
46
|
+
html = html.gsub( "Ü", 'Ü' )
|
47
|
+
html = html.gsub( "ß", 'ß' )
|
48
|
+
|
49
|
+
html = html.gsub( "&oulm;", 'ö' ) ## support typo in entity (ö)
|
50
|
+
html = html.gsub( "&slig;", "ß" ) ## support typo in entity (ß)
|
51
|
+
|
52
|
+
html = html.gsub( "É", 'É' )
|
53
|
+
html = html.gsub( "ø", 'ø' )
|
54
|
+
|
55
|
+
## check for more entities
|
56
|
+
html = html.gsub( /&[^;]+;/) do |match|
|
57
|
+
puts "*** found unencoded html entity #{match}"
|
58
|
+
match ## pass through as is (1:1)
|
59
|
+
end
|
60
|
+
## todo/fix: add more entities
|
61
|
+
|
62
|
+
|
63
|
+
txt = html_to_txt( html )
|
64
|
+
|
65
|
+
header = <<EOS
|
66
|
+
<!--
|
67
|
+
source: #{src_url}
|
68
|
+
-->
|
69
|
+
|
70
|
+
EOS
|
71
|
+
|
72
|
+
header+txt ## return txt w/ header
|
73
|
+
end ## method fetch
|
74
|
+
|
75
|
+
end ## class PageFetcher
|
76
|
+
end ## module Rsssf
|
77
|
+
|
78
|
+
## add (shortcut) alias
|
79
|
+
RsssfPageFetcher = Rsssf::PageFetcher
|
80
|
+
|
@@ -0,0 +1,157 @@
|
|
1
|
+
# encoding: utf-8
|
2
|
+
|
3
|
+
module Rsssf
|
4
|
+
module Filters
|
5
|
+
|
6
|
+
def html_to_txt( html )
|
7
|
+
|
8
|
+
###
|
9
|
+
# todo: check if any tags (still) present??
|
10
|
+
|
11
|
+
|
12
|
+
## cut off everything before body
|
13
|
+
html = html.sub( /.+?<BODY>\s*/im, '' )
|
14
|
+
|
15
|
+
## cut off everything after body (closing)
|
16
|
+
html = html.sub( /<\/BODY>.*/im, '' )
|
17
|
+
|
18
|
+
|
19
|
+
## remove cite
|
20
|
+
html = html.gsub( /<CITE>([^<]+)<\/CITE>/im ) do |_|
|
21
|
+
puts " remove cite >#{$1}<"
|
22
|
+
"#{$1}"
|
23
|
+
end
|
24
|
+
|
25
|
+
html = html.gsub( /\s*<HR>\s*/im ) do |match|
|
26
|
+
match = match.gsub( "\n", '$$' ) ## make newlines visible for debugging
|
27
|
+
puts " replace horizontal rule (hr) - >#{match}<"
|
28
|
+
"\n=-=-=-=-=-=-=-=-=-=-=-=-=-=-=\n" ## check what hr to use use - . - . - or =-=-=-= or somehting distinct?
|
29
|
+
end
|
30
|
+
|
31
|
+
## replace break (br)
|
32
|
+
## note: do NOT use m/multiline for now - why? why not??
|
33
|
+
html = html.gsub( /<BR>\s*/i ) do |match| ## note: include (swallow) "extra" newline
|
34
|
+
match = match.gsub( "\n", '$$' ) ## make newlines visible for debugging
|
35
|
+
puts " replace break (br) - >#{match}<"
|
36
|
+
"\n"
|
37
|
+
end
|
38
|
+
|
39
|
+
## remove anchors (a name)
|
40
|
+
html = html.gsub( /<A NAME[^>]*>(.+?)<\/A>/im ) do |match| ## note: use .+? non-greedy match
|
41
|
+
title = $1.to_s ## note: "save" caputure first; gets replaced by gsub (next regex call)
|
42
|
+
match = match.gsub( "\n", '$$' ) ## make newlines visible for debugging
|
43
|
+
puts " replace anchor (a) name >#{title}< - >#{match}<"
|
44
|
+
"#{title}"
|
45
|
+
end
|
46
|
+
|
47
|
+
## remove anchors (a href)
|
48
|
+
# note: heading 4 includes anchor (thus, let anchors go first)
|
49
|
+
# note: <a \newline href is used for authors email - thus incl. support for newline as space
|
50
|
+
html = html.gsub( /<A\s+HREF[^>]*>(.+?)<\/A>/im ) do |_| ## note: use .+? non-greedy match
|
51
|
+
puts " replace anchor (a) href >#{$1}<"
|
52
|
+
"‹#{$1}›"
|
53
|
+
end
|
54
|
+
|
55
|
+
## replace paragrah (p)
|
56
|
+
html = html.gsub( /\s*<P>\s*/im ) do |match| ## note: include (swallow) "extra" newline
|
57
|
+
match = match.gsub( "\n", '$$' ) ## make newlines visible for debugging
|
58
|
+
puts " replace paragraph (p) - >#{match}<"
|
59
|
+
"\n\n"
|
60
|
+
end
|
61
|
+
html = html.gsub( /<\/P>/i, '' ) ## replace paragraph (p) closing w/ nothing for now
|
62
|
+
|
63
|
+
## remove i
|
64
|
+
html = html.gsub( /<I>([^<]+)<\/I>/im ) do |_|
|
65
|
+
puts " remove italic (i) >#{$1}<"
|
66
|
+
"#{$1}"
|
67
|
+
end
|
68
|
+
|
69
|
+
|
70
|
+
## heading 2
|
71
|
+
html = html.gsub( /\s*<H2>([^<]+)<\/H2>\s*/im ) do |_|
|
72
|
+
puts " replace heading 2 (h2) >#{$1}<"
|
73
|
+
"\n\n## #{$1}\n\n" ## note: make sure to always add two newlines
|
74
|
+
end
|
75
|
+
|
76
|
+
## heading 4
|
77
|
+
html = html.gsub( /\s*<H4>([^<]+)<\/H4>\s*/im ) do |_|
|
78
|
+
puts " replace heading 4 (h4) >#{$1}<"
|
79
|
+
"\n\n#### #{$1}\n\n" ## note: make sure to always add two newlines
|
80
|
+
end
|
81
|
+
|
82
|
+
|
83
|
+
## remove b - note: might include anchors (thus, call after anchors)
|
84
|
+
html = html.gsub( /<B>([^<]+)<\/B>/im ) do |_|
|
85
|
+
puts " remove bold (b) >#{$1}<"
|
86
|
+
"**#{$1}**"
|
87
|
+
end
|
88
|
+
|
89
|
+
## replace preformatted (pre)
|
90
|
+
html = html.gsub( /<PRE>|<\/PRE>/i ) do |_|
|
91
|
+
puts " replace preformatted (pre)"
|
92
|
+
'' # replace w/ nothing for now (keep surrounding newlines)
|
93
|
+
end
|
94
|
+
|
95
|
+
=begin
|
96
|
+
puts
|
97
|
+
puts
|
98
|
+
puts "html:"
|
99
|
+
puts html[0..2000]
|
100
|
+
puts "-- snip --"
|
101
|
+
puts html[-1000..-1] ## print last hundred chars
|
102
|
+
=end
|
103
|
+
|
104
|
+
|
105
|
+
## cleanup whitespaces
|
106
|
+
## todo/fix: convert newline in space first
|
107
|
+
## and than collapse spaces etc.!!!
|
108
|
+
txt = ''
|
109
|
+
html.each_line do |line|
|
110
|
+
line = line.gsub( "\t", ' ' ) # replace all tabs w/ two spaces for nwo
|
111
|
+
line = line.rstrip # remove trailing whitespace (incl. newline/formfeed)
|
112
|
+
|
113
|
+
txt << line
|
114
|
+
txt << "\n"
|
115
|
+
end
|
116
|
+
|
117
|
+
### remove emails etc.
|
118
|
+
txt = sanitize( txt )
|
119
|
+
|
120
|
+
txt
|
121
|
+
end # method html_to_text
|
122
|
+
|
123
|
+
|
124
|
+
|
125
|
+
def sanitize( txt )
|
126
|
+
### remove emails for (spam/privacy) protection
|
127
|
+
## e.g. (selamm@example.es)
|
128
|
+
## (buuu@mscs.dal.ca)
|
129
|
+
## (kaxx@rsssf.com)
|
130
|
+
## (Manu_Maya@yakoo.co)
|
131
|
+
|
132
|
+
## note add support for optional ‹› enclosure (used by html2txt converted a href :mailto links)
|
133
|
+
## e.g. (‹selamm@example.es›)
|
134
|
+
|
135
|
+
email_pattern = "\\(‹?[a-z][a-z0-9_]+@[a-z]+(\\.[a-z]+)+›?\\)" ## note: just a string; needs to escape \\ twice!!!
|
136
|
+
|
137
|
+
## check for "free-standing e.g. on its own line" emails only for now
|
138
|
+
txt = txt.gsub( /\n#{email_pattern}\n/i ) do |match|
|
139
|
+
puts "removing (free-standing) email >#{match}<"
|
140
|
+
"\n" # return empty line
|
141
|
+
end
|
142
|
+
|
143
|
+
txt = txt.gsub( /#{email_pattern}/i ) do |match|
|
144
|
+
puts "remove email >#{match}<"
|
145
|
+
''
|
146
|
+
end
|
147
|
+
|
148
|
+
txt
|
149
|
+
end # method sanitize
|
150
|
+
|
151
|
+
end # module Filters
|
152
|
+
end # module Rsssf
|
153
|
+
|
154
|
+
## add (shortcut) alias
|
155
|
+
RsssfFilters = Rsssf::Filters
|
156
|
+
|
157
|
+
|