rsssf 0.0.1 → 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.gemtest +0 -0
- data/Manifest.txt +11 -0
- data/README.md +171 -2
- data/lib/rsssf.rb +13 -0
- data/lib/rsssf/fetch.rb +80 -0
- data/lib/rsssf/html2txt.rb +157 -0
- data/lib/rsssf/page.rb +295 -0
- data/lib/rsssf/patch.rb +28 -0
- data/lib/rsssf/repo.rb +220 -0
- data/lib/rsssf/reports/page.rb +64 -0
- data/lib/rsssf/reports/schedule.rb +77 -0
- data/lib/rsssf/schedule.rb +31 -0
- data/lib/rsssf/utils.rb +75 -0
- data/lib/rsssf/version.rb +2 -2
- data/test/helper.rb +12 -0
- data/test/test_utils.rb +83 -0
- metadata +13 -1
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 9c33c4bef56bf6e79c9339b5010a3129b98dc9fa
|
4
|
+
data.tar.gz: 4d97b1862dfbaab48282593947c6d7b28d047dfe
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 8da437c2a72c364f81d53ca1491cdfca4c11c0f187912702e653523cb54f636ec910f8690f9af690cf632f0c98335b9ab10983676d0e67e1a59f3c7d7a5919c7
|
7
|
+
data.tar.gz: b8a13466f301863e6ae0e4e449f1da08d5ae17a7481bf2055e12976f6f14fda79a8ebd38aa5cfb5f097d21b36e43bc322f362483d1c522a005c9cf49e576f732
|
data/.gemtest
ADDED
File without changes
|
data/Manifest.txt
CHANGED
@@ -3,4 +3,15 @@ Manifest.txt
|
|
3
3
|
README.md
|
4
4
|
Rakefile
|
5
5
|
lib/rsssf.rb
|
6
|
+
lib/rsssf/fetch.rb
|
7
|
+
lib/rsssf/html2txt.rb
|
8
|
+
lib/rsssf/page.rb
|
9
|
+
lib/rsssf/patch.rb
|
10
|
+
lib/rsssf/repo.rb
|
11
|
+
lib/rsssf/reports/page.rb
|
12
|
+
lib/rsssf/reports/schedule.rb
|
13
|
+
lib/rsssf/schedule.rb
|
14
|
+
lib/rsssf/utils.rb
|
6
15
|
lib/rsssf/version.rb
|
16
|
+
test/helper.rb
|
17
|
+
test/test_utils.rb
|
data/README.md
CHANGED
@@ -1,16 +1,172 @@
|
|
1
1
|
# rsssf - tools 'n' scripts for RSSSF (Rec.Sport.Soccer Statistics Foundation) archive data
|
2
2
|
|
3
3
|
|
4
|
-
* home :: [github.com/sportdb/rsssf](https://github.com/sportdb/
|
4
|
+
* home :: [github.com/sportdb/rsssf](https://github.com/sportdb/rsssf)
|
5
5
|
* bugs :: [github.com/sportdb/rsssf/issues](https://github.com/sportdb/rsssf/issues)
|
6
6
|
* gem :: [rubygems.org/gems/rsssf](https://rubygems.org/gems/rsssf)
|
7
7
|
* rdoc :: [rubydoc.info/gems/rsssf](http://rubydoc.info/gems/rsssf)
|
8
8
|
* forum :: [opensport](http://groups.google.com/group/opensport)
|
9
9
|
|
10
10
|
|
11
|
+
## What's the Rec.Sport.Soccer Statistics Foundation (RSSSF)?
|
12
|
+
|
13
|
+
The RSSSF collects and offers football (soccer) league tables, match results and more
|
14
|
+
from all over the world online in plain text.
|
15
|
+
|
16
|
+
Example:
|
17
|
+
|
18
|
+
```
|
19
|
+
Round 1
|
20
|
+
[May 25]
|
21
|
+
Vasco da Gama 1-0 Portuguesa
|
22
|
+
[Carlos Tenório 47']
|
23
|
+
Vitória 2-2 Internacional
|
24
|
+
[Maxi Biancucchi 2', Gabriel Paulista 11'; Diego Forlán 29', Fred 63']
|
25
|
+
Corinthians 1-1 Botafogo
|
26
|
+
[Paulinho 73'; Rafael Marques 24']
|
27
|
+
[May 26]
|
28
|
+
Grêmio 2-0 Náutico [played in Caxias do Sul-RS]
|
29
|
+
[Zé Roberto 15', Elano 70']
|
30
|
+
Ponte Preta 0-2 São Paulo
|
31
|
+
[Lúcio 9', Jádson 44'p]
|
32
|
+
Criciúma 3-1 Bahia
|
33
|
+
[Matheus Ferraz 45'+1', Lins 46', João Vítor 82'; Diones 72']
|
34
|
+
Santos 0-0 Flamengo [played in Brasília-DF]
|
35
|
+
Fluminense 2-1 Atlético/PR [played in Macaé-RJ]
|
36
|
+
[Rafael Sóbis 15'p, Samuel 53'; Manoel 28']
|
37
|
+
Cruzeiro 5-0 Goiás
|
38
|
+
[Diego Souza 5', Bruno Rodrigo 30', Nílton 40',79', Borges 42']
|
39
|
+
Coritiba 2-1 Atlético/MG
|
40
|
+
[Deivid 53', Arthur 90'+1'; Diego Tardelli 51']
|
41
|
+
```
|
42
|
+
|
43
|
+
[Find out more about the Rec.Sport.Soccer Statistics Foundation (RSSSF) »](http://www.rsssf.com)
|
44
|
+
|
45
|
+
|
46
|
+
|
11
47
|
## Usage
|
12
48
|
|
13
|
-
|
49
|
+
### Working with Pages
|
50
|
+
|
51
|
+
To fetch pages from the world wide web use:
|
52
|
+
|
53
|
+
``` ruby
|
54
|
+
page = RsssfPage.from_url( 'http://www.rsssf.com/tablese/eng2015.html')
|
55
|
+
```
|
56
|
+
|
57
|
+
Note: The `RsssfPageFetcher` will convert the rsssf archive page
|
58
|
+
from hypertext (HTML) to plain text e.g.
|
59
|
+
|
60
|
+
```
|
61
|
+
<hr>
|
62
|
+
<a href="#premier">Premier League</A><br>
|
63
|
+
<a href="#cups">Cup Tournaments</A><br>
|
64
|
+
<a href="#champ">Championship</A><br>
|
65
|
+
<a href="#first">Division 1</A><br>
|
66
|
+
<a href="#second">Division 2</A><br>
|
67
|
+
<a href="#conf">Conference</A>
|
68
|
+
<hr>
|
69
|
+
<h4><a name="premier">Premier League</A></h4>
|
70
|
+
<pre>
|
71
|
+
Final Table:
|
72
|
+
|
73
|
+
1.Chelsea 38 26 9 3 73-32 87 Champions
|
74
|
+
2.Manchester City 38 24 7 7 83-38 79
|
75
|
+
3.Arsenal 38 22 9 7 71-36 75
|
76
|
+
...
|
77
|
+
```
|
78
|
+
|
79
|
+
will become
|
80
|
+
|
81
|
+
```
|
82
|
+
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
|
83
|
+
‹Premier League›
|
84
|
+
‹Cup Tournaments›
|
85
|
+
‹Championship›
|
86
|
+
‹Division 1›
|
87
|
+
‹Division 2›
|
88
|
+
‹Conference›
|
89
|
+
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
|
90
|
+
|
91
|
+
#### Premier League
|
92
|
+
|
93
|
+
|
94
|
+
Final Table:
|
95
|
+
|
96
|
+
1.Chelsea 38 26 9 3 73-32 87 Champions
|
97
|
+
2.Manchester City 38 24 7 7 83-38 79
|
98
|
+
3.Arsenal 38 22 9 7 71-36 75
|
99
|
+
...
|
100
|
+
```
|
101
|
+
|
102
|
+
|
103
|
+
### Working with Repos
|
104
|
+
|
105
|
+
To fetch pages from the world wide web for many seasons in batch setup and use a repo.
|
106
|
+
|
107
|
+
Step 1: List all archive pages
|
108
|
+
|
109
|
+
In the `tables/config.yml` list all archive pages to fetch. Example:
|
110
|
+
|
111
|
+
``` yaml
|
112
|
+
2010-11: tablese/eng2011.html
|
113
|
+
2011-12: tablese/eng2012.html
|
114
|
+
2012-13: tablese/eng2013.html
|
115
|
+
2013-14: tablese/eng2014.html
|
116
|
+
2014-15: tablese/eng2015.html
|
117
|
+
```
|
118
|
+
|
119
|
+
Step 2: Fetch all archive pages
|
120
|
+
|
121
|
+
Use:
|
122
|
+
|
123
|
+
``` ruby
|
124
|
+
repo = RsssfRepo.new( './eng-england', title: 'England (and Wales)' )
|
125
|
+
repo.fetch_pages
|
126
|
+
```
|
127
|
+
|
128
|
+
Bonus: To create a summary of all pages fetched (e.g. authors, last_updated, sections, etc.).
|
129
|
+
Use:
|
130
|
+
|
131
|
+
``` ruby
|
132
|
+
repo.make_pages_report
|
133
|
+
```
|
134
|
+
|
135
|
+
Example - `tables/README.md`:
|
136
|
+
|
137
|
+
|
138
|
+
football.db RSSSF Archive Data Summary for England (and Wales)
|
139
|
+
|
140
|
+
_Last Update: 2015-11-26 18:22:22 +0200_
|
141
|
+
|
142
|
+
| Season | File | Authors | Last Updated | Lines (Chars) | Sections |
|
143
|
+
| :------ | :------ | :------- | :----------- | ------------: | :------- |
|
144
|
+
| 2014-15 | [eng2015.txt](https://github.com/rsssf/eng-england/blob/master/tables/eng2015.txt) | Ian King and Karel Stokkermans | 4 Jun 2015 | 1249 (34138) | Premier League, Cup Tournaments, Championship, Division 1, Division 2, Conference |
|
145
|
+
| 2013-14 | [eng2014.txt](https://github.com/rsssf/eng-england/blob/master/tables/eng2014.txt) | Ian King and Karel Stokkermans | 5 Feb 2015 | 1254 (34294) | Premier League, Cup Tournaments, Championship, Division 1, Division 2, Conference |
|
146
|
+
| 2012-13 | [eng2013.txt](https://github.com/rsssf/eng-england/blob/master/tables/eng2013.txt) | Karel Stokkermans | 5 Feb 2015 | 1269 (34531) | Premiership, Cup Tournaments, Championship, Division 1, Division 2, Conference |
|
147
|
+
| 2011-12 | [eng2012.txt](https://github.com/rsssf/eng-england/blob/master/tables/eng2012.txt) | Karel Stokkermans | 5 Feb 2015 | 691 (21925) | Premiership, Cup Tournaments, Championship, Division 1, Division 2, Conference |
|
148
|
+
| 2010-11 | [eng2011.txt](https://github.com/rsssf/eng-england/blob/master/tables/eng2011.txt) | Ian King, Karel Stokkermans and Jan Schoenmakers | 5 Feb 2015 | 959 (37393) | Premiership, Cup Tournaments, Championship, Division 1, Division 2, Conference |
|
149
|
+
|
150
|
+
|
151
|
+
That's it.
|
152
|
+
|
153
|
+
|
154
|
+
### Preparing Archive Pages for SQL Database Imports (e.g. football.db)
|
155
|
+
|
156
|
+
To import match schedules (fixtures and results) and more using the football.db machinery
|
157
|
+
prepare "simple" single league (or cup) pages with standings tables etc. stripped out.
|
158
|
+
For example, to break-out the Premier League and FA Cup from the `eng2015.txt`
|
159
|
+
archive page use:
|
160
|
+
|
161
|
+
``` ruby
|
162
|
+
page = RsssfPage.from_url( 'http://www.rsssf.com/tablese/eng2015.html')
|
163
|
+
|
164
|
+
schedule = page.find_schedule( header: 'Premier League') ## returns RsssfSchedule obj
|
165
|
+
schedule.save( './1-premierleague.txt' )
|
166
|
+
|
167
|
+
schedule = page.find_schedule( header: 'FA Cup', cup: true )
|
168
|
+
schedule.save( './facup.txt' )
|
169
|
+
```
|
14
170
|
|
15
171
|
|
16
172
|
|
@@ -21,6 +177,19 @@ Just install the gem:
|
|
21
177
|
$ gem install rsssf
|
22
178
|
|
23
179
|
|
180
|
+
|
181
|
+
## RSSSF Datasets
|
182
|
+
|
183
|
+
See the rsssf github org for pre-processed ready-to-import datasets. Prepared repos include:
|
184
|
+
|
185
|
+
- [`eng-england`](https://github.com/rsssf/eng-england) - rsssf archive data for England - Premier League, Championship, FA Cup etc.
|
186
|
+
- [`de-deutschland`](https://github.com/rsssf/de-deutschland) - rsssf archive data for Germany (Deutschland) - Deutsche Bundesliga, 2. Bundesliga, 3. Liga, DFB Pokal etc.
|
187
|
+
- [`es-espana`](https://github.com/rsssf/es-espana) - rsssf archive data for España (Spain) - Primera División / La Liga, Copa de Rey, etc.
|
188
|
+
- [`at-austria`](https://github.com/rsssf/at-austria) - rsssf archive data for Austria (Österreich) - Österr. Bundesliga, Erste Liga, ÖFB Pokal etc.
|
189
|
+
- [`br-brazil`](https://github.com/rsssf/br-brazil) - rsssf archive data for Brazil (Brasil) - Campeonato Brasileiro Série A / Brasileirão etc.
|
190
|
+
- and more
|
191
|
+
|
192
|
+
|
24
193
|
## License
|
25
194
|
|
26
195
|
The `rsssf` scripts are dedicated to the public domain.
|
data/lib/rsssf.rb
CHANGED
@@ -14,6 +14,19 @@ require 'fetcher' ## used for Fetcher::Worker.new.fetch etc.
|
|
14
14
|
## our own code
|
15
15
|
require 'rsssf/version' # note: let version always go first
|
16
16
|
|
17
|
+
require 'rsssf/utils' # include Utils - goes first
|
18
|
+
require 'rsssf/html2txt' # include Filters - goes first
|
19
|
+
|
20
|
+
require 'rsssf/fetch'
|
21
|
+
require 'rsssf/page'
|
22
|
+
require 'rsssf/schedule'
|
23
|
+
require 'rsssf/patch'
|
24
|
+
|
25
|
+
require 'rsssf/reports/schedule'
|
26
|
+
require 'rsssf/reports/page'
|
27
|
+
|
28
|
+
require 'rsssf/repo'
|
29
|
+
|
17
30
|
|
18
31
|
|
19
32
|
|
data/lib/rsssf/fetch.rb
ADDED
@@ -0,0 +1,80 @@
|
|
1
|
+
# encoding: utf-8
|
2
|
+
|
3
|
+
module Rsssf
|
4
|
+
|
5
|
+
class PageFetcher
|
6
|
+
|
7
|
+
include Filters # e.g. html2text, sanitize etc.
|
8
|
+
|
9
|
+
|
10
|
+
def initialize
|
11
|
+
@worker = Fetcher::Worker.new
|
12
|
+
end
|
13
|
+
|
14
|
+
def fetch( src_url )
|
15
|
+
|
16
|
+
## note: assume plain 7-bit ascii for now
|
17
|
+
## -- assume rsssf uses ISO_8859_15 (updated version of ISO_8859_1) -- does NOT use utf-8 character encoding!!!
|
18
|
+
html = @worker.read( src_url )
|
19
|
+
|
20
|
+
### todo/fix: first check if html is all ascii-7bit e.g.
|
21
|
+
## includes only chars from 64 to 127!!!
|
22
|
+
|
23
|
+
## normalize newlines
|
24
|
+
## remove \r (form feed) used by Windows; just use \n (new line)
|
25
|
+
html = html.gsub( "\r", '' )
|
26
|
+
|
27
|
+
## note:
|
28
|
+
## assume (default) to ISO 3166-15 (an updated version of ISO 3166-1) for now
|
29
|
+
##
|
30
|
+
## other possible alternatives - try:
|
31
|
+
## - Windows CP 1562 or
|
32
|
+
## - ISO 3166-2 (for eastern european languages )
|
33
|
+
##
|
34
|
+
## note: german umlaut use the same code (int)
|
35
|
+
## in ISO 3166-1/15 and 2 and Windows CP1562 (other chars ARE different!!!)
|
36
|
+
|
37
|
+
html = html.force_encoding( Encoding::ISO_8859_15 )
|
38
|
+
html = html.encode( Encoding::UTF_8 ) # try conversion to utf-8
|
39
|
+
|
40
|
+
## check for html entities
|
41
|
+
html = html.gsub( "ä", 'ä' )
|
42
|
+
html = html.gsub( "ö", 'ö' )
|
43
|
+
html = html.gsub( "ü", 'ü' )
|
44
|
+
html = html.gsub( "Ä", 'Ä' )
|
45
|
+
html = html.gsub( "Ö", 'Ö' )
|
46
|
+
html = html.gsub( "Ü", 'Ü' )
|
47
|
+
html = html.gsub( "ß", 'ß' )
|
48
|
+
|
49
|
+
html = html.gsub( "&oulm;", 'ö' ) ## support typo in entity (ö)
|
50
|
+
html = html.gsub( "&slig;", "ß" ) ## support typo in entity (ß)
|
51
|
+
|
52
|
+
html = html.gsub( "É", 'É' )
|
53
|
+
html = html.gsub( "ø", 'ø' )
|
54
|
+
|
55
|
+
## check for more entities
|
56
|
+
html = html.gsub( /&[^;]+;/) do |match|
|
57
|
+
puts "*** found unencoded html entity #{match}"
|
58
|
+
match ## pass through as is (1:1)
|
59
|
+
end
|
60
|
+
## todo/fix: add more entities
|
61
|
+
|
62
|
+
|
63
|
+
txt = html_to_txt( html )
|
64
|
+
|
65
|
+
header = <<EOS
|
66
|
+
<!--
|
67
|
+
source: #{src_url}
|
68
|
+
-->
|
69
|
+
|
70
|
+
EOS
|
71
|
+
|
72
|
+
header+txt ## return txt w/ header
|
73
|
+
end ## method fetch
|
74
|
+
|
75
|
+
end ## class PageFetcher
|
76
|
+
end ## module Rsssf
|
77
|
+
|
78
|
+
## add (shortcut) alias
|
79
|
+
RsssfPageFetcher = Rsssf::PageFetcher
|
80
|
+
|
@@ -0,0 +1,157 @@
|
|
1
|
+
# encoding: utf-8
|
2
|
+
|
3
|
+
module Rsssf
|
4
|
+
module Filters
|
5
|
+
|
6
|
+
def html_to_txt( html )
|
7
|
+
|
8
|
+
###
|
9
|
+
# todo: check if any tags (still) present??
|
10
|
+
|
11
|
+
|
12
|
+
## cut off everything before body
|
13
|
+
html = html.sub( /.+?<BODY>\s*/im, '' )
|
14
|
+
|
15
|
+
## cut off everything after body (closing)
|
16
|
+
html = html.sub( /<\/BODY>.*/im, '' )
|
17
|
+
|
18
|
+
|
19
|
+
## remove cite
|
20
|
+
html = html.gsub( /<CITE>([^<]+)<\/CITE>/im ) do |_|
|
21
|
+
puts " remove cite >#{$1}<"
|
22
|
+
"#{$1}"
|
23
|
+
end
|
24
|
+
|
25
|
+
html = html.gsub( /\s*<HR>\s*/im ) do |match|
|
26
|
+
match = match.gsub( "\n", '$$' ) ## make newlines visible for debugging
|
27
|
+
puts " replace horizontal rule (hr) - >#{match}<"
|
28
|
+
"\n=-=-=-=-=-=-=-=-=-=-=-=-=-=-=\n" ## check what hr to use use - . - . - or =-=-=-= or somehting distinct?
|
29
|
+
end
|
30
|
+
|
31
|
+
## replace break (br)
|
32
|
+
## note: do NOT use m/multiline for now - why? why not??
|
33
|
+
html = html.gsub( /<BR>\s*/i ) do |match| ## note: include (swallow) "extra" newline
|
34
|
+
match = match.gsub( "\n", '$$' ) ## make newlines visible for debugging
|
35
|
+
puts " replace break (br) - >#{match}<"
|
36
|
+
"\n"
|
37
|
+
end
|
38
|
+
|
39
|
+
## remove anchors (a name)
|
40
|
+
html = html.gsub( /<A NAME[^>]*>(.+?)<\/A>/im ) do |match| ## note: use .+? non-greedy match
|
41
|
+
title = $1.to_s ## note: "save" caputure first; gets replaced by gsub (next regex call)
|
42
|
+
match = match.gsub( "\n", '$$' ) ## make newlines visible for debugging
|
43
|
+
puts " replace anchor (a) name >#{title}< - >#{match}<"
|
44
|
+
"#{title}"
|
45
|
+
end
|
46
|
+
|
47
|
+
## remove anchors (a href)
|
48
|
+
# note: heading 4 includes anchor (thus, let anchors go first)
|
49
|
+
# note: <a \newline href is used for authors email - thus incl. support for newline as space
|
50
|
+
html = html.gsub( /<A\s+HREF[^>]*>(.+?)<\/A>/im ) do |_| ## note: use .+? non-greedy match
|
51
|
+
puts " replace anchor (a) href >#{$1}<"
|
52
|
+
"‹#{$1}›"
|
53
|
+
end
|
54
|
+
|
55
|
+
## replace paragrah (p)
|
56
|
+
html = html.gsub( /\s*<P>\s*/im ) do |match| ## note: include (swallow) "extra" newline
|
57
|
+
match = match.gsub( "\n", '$$' ) ## make newlines visible for debugging
|
58
|
+
puts " replace paragraph (p) - >#{match}<"
|
59
|
+
"\n\n"
|
60
|
+
end
|
61
|
+
html = html.gsub( /<\/P>/i, '' ) ## replace paragraph (p) closing w/ nothing for now
|
62
|
+
|
63
|
+
## remove i
|
64
|
+
html = html.gsub( /<I>([^<]+)<\/I>/im ) do |_|
|
65
|
+
puts " remove italic (i) >#{$1}<"
|
66
|
+
"#{$1}"
|
67
|
+
end
|
68
|
+
|
69
|
+
|
70
|
+
## heading 2
|
71
|
+
html = html.gsub( /\s*<H2>([^<]+)<\/H2>\s*/im ) do |_|
|
72
|
+
puts " replace heading 2 (h2) >#{$1}<"
|
73
|
+
"\n\n## #{$1}\n\n" ## note: make sure to always add two newlines
|
74
|
+
end
|
75
|
+
|
76
|
+
## heading 4
|
77
|
+
html = html.gsub( /\s*<H4>([^<]+)<\/H4>\s*/im ) do |_|
|
78
|
+
puts " replace heading 4 (h4) >#{$1}<"
|
79
|
+
"\n\n#### #{$1}\n\n" ## note: make sure to always add two newlines
|
80
|
+
end
|
81
|
+
|
82
|
+
|
83
|
+
## remove b - note: might include anchors (thus, call after anchors)
|
84
|
+
html = html.gsub( /<B>([^<]+)<\/B>/im ) do |_|
|
85
|
+
puts " remove bold (b) >#{$1}<"
|
86
|
+
"**#{$1}**"
|
87
|
+
end
|
88
|
+
|
89
|
+
## replace preformatted (pre)
|
90
|
+
html = html.gsub( /<PRE>|<\/PRE>/i ) do |_|
|
91
|
+
puts " replace preformatted (pre)"
|
92
|
+
'' # replace w/ nothing for now (keep surrounding newlines)
|
93
|
+
end
|
94
|
+
|
95
|
+
=begin
|
96
|
+
puts
|
97
|
+
puts
|
98
|
+
puts "html:"
|
99
|
+
puts html[0..2000]
|
100
|
+
puts "-- snip --"
|
101
|
+
puts html[-1000..-1] ## print last hundred chars
|
102
|
+
=end
|
103
|
+
|
104
|
+
|
105
|
+
## cleanup whitespaces
|
106
|
+
## todo/fix: convert newline in space first
|
107
|
+
## and than collapse spaces etc.!!!
|
108
|
+
txt = ''
|
109
|
+
html.each_line do |line|
|
110
|
+
line = line.gsub( "\t", ' ' ) # replace all tabs w/ two spaces for nwo
|
111
|
+
line = line.rstrip # remove trailing whitespace (incl. newline/formfeed)
|
112
|
+
|
113
|
+
txt << line
|
114
|
+
txt << "\n"
|
115
|
+
end
|
116
|
+
|
117
|
+
### remove emails etc.
|
118
|
+
txt = sanitize( txt )
|
119
|
+
|
120
|
+
txt
|
121
|
+
end # method html_to_text
|
122
|
+
|
123
|
+
|
124
|
+
|
125
|
+
def sanitize( txt )
|
126
|
+
### remove emails for (spam/privacy) protection
|
127
|
+
## e.g. (selamm@example.es)
|
128
|
+
## (buuu@mscs.dal.ca)
|
129
|
+
## (kaxx@rsssf.com)
|
130
|
+
## (Manu_Maya@yakoo.co)
|
131
|
+
|
132
|
+
## note add support for optional ‹› enclosure (used by html2txt converted a href :mailto links)
|
133
|
+
## e.g. (‹selamm@example.es›)
|
134
|
+
|
135
|
+
email_pattern = "\\(‹?[a-z][a-z0-9_]+@[a-z]+(\\.[a-z]+)+›?\\)" ## note: just a string; needs to escape \\ twice!!!
|
136
|
+
|
137
|
+
## check for "free-standing e.g. on its own line" emails only for now
|
138
|
+
txt = txt.gsub( /\n#{email_pattern}\n/i ) do |match|
|
139
|
+
puts "removing (free-standing) email >#{match}<"
|
140
|
+
"\n" # return empty line
|
141
|
+
end
|
142
|
+
|
143
|
+
txt = txt.gsub( /#{email_pattern}/i ) do |match|
|
144
|
+
puts "remove email >#{match}<"
|
145
|
+
''
|
146
|
+
end
|
147
|
+
|
148
|
+
txt
|
149
|
+
end # method sanitize
|
150
|
+
|
151
|
+
end # module Filters
|
152
|
+
end # module Rsssf
|
153
|
+
|
154
|
+
## add (shortcut) alias
|
155
|
+
RsssfFilters = Rsssf::Filters
|
156
|
+
|
157
|
+
|