rsssf 0.0.1 → 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 0e916c424ef3d9ebd62d5c2fb1e09122806c5765
4
- data.tar.gz: 232f06dea9fd3f878e554b1d4a2029531468f842
3
+ metadata.gz: 9c33c4bef56bf6e79c9339b5010a3129b98dc9fa
4
+ data.tar.gz: 4d97b1862dfbaab48282593947c6d7b28d047dfe
5
5
  SHA512:
6
- metadata.gz: 898399cdcedd7890145278b17d6d7ab5d8f305160513def86152b69a345de5c19ddb4e2e770753c091216c203bf31a3bd2bd70ca33b0087a8d003292cdd97acf
7
- data.tar.gz: 0b76813d36aee3c59a83c2dfce1d916ee4677da585d52464a531dc2563e5bf96d33b345afe766bb4af962c03f4abcfed0a5b8523e485c85c7e85e406fd9ab8c6
6
+ metadata.gz: 8da437c2a72c364f81d53ca1491cdfca4c11c0f187912702e653523cb54f636ec910f8690f9af690cf632f0c98335b9ab10983676d0e67e1a59f3c7d7a5919c7
7
+ data.tar.gz: b8a13466f301863e6ae0e4e449f1da08d5ae17a7481bf2055e12976f6f14fda79a8ebd38aa5cfb5f097d21b36e43bc322f362483d1c522a005c9cf49e576f732
File without changes
@@ -3,4 +3,15 @@ Manifest.txt
3
3
  README.md
4
4
  Rakefile
5
5
  lib/rsssf.rb
6
+ lib/rsssf/fetch.rb
7
+ lib/rsssf/html2txt.rb
8
+ lib/rsssf/page.rb
9
+ lib/rsssf/patch.rb
10
+ lib/rsssf/repo.rb
11
+ lib/rsssf/reports/page.rb
12
+ lib/rsssf/reports/schedule.rb
13
+ lib/rsssf/schedule.rb
14
+ lib/rsssf/utils.rb
6
15
  lib/rsssf/version.rb
16
+ test/helper.rb
17
+ test/test_utils.rb
data/README.md CHANGED
@@ -1,16 +1,172 @@
1
1
  # rsssf - tools 'n' scripts for RSSSF (Rec.Sport.Soccer Statistics Foundation) archive data
2
2
 
3
3
 
4
- * home :: [github.com/sportdb/rsssf](https://github.com/sportdb/mrhyde)
4
+ * home :: [github.com/sportdb/rsssf](https://github.com/sportdb/rsssf)
5
5
  * bugs :: [github.com/sportdb/rsssf/issues](https://github.com/sportdb/rsssf/issues)
6
6
  * gem :: [rubygems.org/gems/rsssf](https://rubygems.org/gems/rsssf)
7
7
  * rdoc :: [rubydoc.info/gems/rsssf](http://rubydoc.info/gems/rsssf)
8
8
  * forum :: [opensport](http://groups.google.com/group/opensport)
9
9
 
10
10
 
11
+ ## What's the Rec.Sport.Soccer Statistics Foundation (RSSSF)?
12
+
13
+ The RSSSF collects and offers football (soccer) league tables, match results and more
14
+ from all over the world online in plain text.
15
+
16
+ Example:
17
+
18
+ ```
19
+ Round 1
20
+ [May 25]
21
+ Vasco da Gama 1-0 Portuguesa
22
+ [Carlos Tenório 47']
23
+ Vitória 2-2 Internacional
24
+ [Maxi Biancucchi 2', Gabriel Paulista 11'; Diego Forlán 29', Fred 63']
25
+ Corinthians 1-1 Botafogo
26
+ [Paulinho 73'; Rafael Marques 24']
27
+ [May 26]
28
+ Grêmio 2-0 Náutico [played in Caxias do Sul-RS]
29
+ [Zé Roberto 15', Elano 70']
30
+ Ponte Preta 0-2 São Paulo
31
+ [Lúcio 9', Jádson 44'p]
32
+ Criciúma 3-1 Bahia
33
+ [Matheus Ferraz 45'+1', Lins 46', João Vítor 82'; Diones 72']
34
+ Santos 0-0 Flamengo [played in Brasília-DF]
35
+ Fluminense 2-1 Atlético/PR [played in Macaé-RJ]
36
+ [Rafael Sóbis 15'p, Samuel 53'; Manoel 28']
37
+ Cruzeiro 5-0 Goiás
38
+ [Diego Souza 5', Bruno Rodrigo 30', Nílton 40',79', Borges 42']
39
+ Coritiba 2-1 Atlético/MG
40
+ [Deivid 53', Arthur 90'+1'; Diego Tardelli 51']
41
+ ```
42
+
43
+ [Find out more about the Rec.Sport.Soccer Statistics Foundation (RSSSF) »](http://www.rsssf.com)
44
+
45
+
46
+
11
47
  ## Usage
12
48
 
13
- To be done
49
+ ### Working with Pages
50
+
51
+ To fetch pages from the world wide web use:
52
+
53
+ ``` ruby
54
+ page = RsssfPage.from_url( 'http://www.rsssf.com/tablese/eng2015.html')
55
+ ```
56
+
57
+ Note: The `RsssfPageFetcher` will convert the rsssf archive page
58
+ from hypertext (HTML) to plain text e.g.
59
+
60
+ ```
61
+ <hr>
62
+ <a href="#premier">Premier League</A><br>
63
+ <a href="#cups">Cup Tournaments</A><br>
64
+ <a href="#champ">Championship</A><br>
65
+ <a href="#first">Division 1</A><br>
66
+ <a href="#second">Division 2</A><br>
67
+ <a href="#conf">Conference</A>
68
+ <hr>
69
+ <h4><a name="premier">Premier League</A></h4>
70
+ <pre>
71
+ Final Table:
72
+
73
+ 1.Chelsea 38 26 9 3 73-32 87 Champions
74
+ 2.Manchester City 38 24 7 7 83-38 79
75
+ 3.Arsenal 38 22 9 7 71-36 75
76
+ ...
77
+ ```
78
+
79
+ will become
80
+
81
+ ```
82
+ =-=-=-=-=-=-=-=-=-=-=-=-=-=-=
83
+ ‹Premier League›
84
+ ‹Cup Tournaments›
85
+ ‹Championship›
86
+ ‹Division 1›
87
+ ‹Division 2›
88
+ ‹Conference›
89
+ =-=-=-=-=-=-=-=-=-=-=-=-=-=-=
90
+
91
+ #### Premier League
92
+
93
+
94
+ Final Table:
95
+
96
+ 1.Chelsea 38 26 9 3 73-32 87 Champions
97
+ 2.Manchester City 38 24 7 7 83-38 79
98
+ 3.Arsenal 38 22 9 7 71-36 75
99
+ ...
100
+ ```
101
+
102
+
103
+ ### Working with Repos
104
+
105
+ To fetch pages from the world wide web for many seasons in batch setup and use a repo.
106
+
107
+ Step 1: List all archive pages
108
+
109
+ In the `tables/config.yml` list all archive pages to fetch. Example:
110
+
111
+ ``` yaml
112
+ 2010-11: tablese/eng2011.html
113
+ 2011-12: tablese/eng2012.html
114
+ 2012-13: tablese/eng2013.html
115
+ 2013-14: tablese/eng2014.html
116
+ 2014-15: tablese/eng2015.html
117
+ ```
118
+
119
+ Step 2: Fetch all archive pages
120
+
121
+ Use:
122
+
123
+ ``` ruby
124
+ repo = RsssfRepo.new( './eng-england', title: 'England (and Wales)' )
125
+ repo.fetch_pages
126
+ ```
127
+
128
+ Bonus: To create a summary of all pages fetched (e.g. authors, last_updated, sections, etc.).
129
+ Use:
130
+
131
+ ``` ruby
132
+ repo.make_pages_report
133
+ ```
134
+
135
+ Example - `tables/README.md`:
136
+
137
+
138
+ football.db RSSSF Archive Data Summary for England (and Wales)
139
+
140
+ _Last Update: 2015-11-26 18:22:22 +0200_
141
+
142
+ | Season | File | Authors | Last Updated | Lines (Chars) | Sections |
143
+ | :------ | :------ | :------- | :----------- | ------------: | :------- |
144
+ | 2014-15 | [eng2015.txt](https://github.com/rsssf/eng-england/blob/master/tables/eng2015.txt) | Ian King and Karel Stokkermans | 4 Jun 2015 | 1249 (34138) | Premier League, Cup Tournaments, Championship, Division 1, Division 2, Conference |
145
+ | 2013-14 | [eng2014.txt](https://github.com/rsssf/eng-england/blob/master/tables/eng2014.txt) | Ian King and Karel Stokkermans | 5 Feb 2015 | 1254 (34294) | Premier League, Cup Tournaments, Championship, Division 1, Division 2, Conference |
146
+ | 2012-13 | [eng2013.txt](https://github.com/rsssf/eng-england/blob/master/tables/eng2013.txt) | Karel Stokkermans | 5 Feb 2015 | 1269 (34531) | Premiership, Cup Tournaments, Championship, Division 1, Division 2, Conference |
147
+ | 2011-12 | [eng2012.txt](https://github.com/rsssf/eng-england/blob/master/tables/eng2012.txt) | Karel Stokkermans | 5 Feb 2015 | 691 (21925) | Premiership, Cup Tournaments, Championship, Division 1, Division 2, Conference |
148
+ | 2010-11 | [eng2011.txt](https://github.com/rsssf/eng-england/blob/master/tables/eng2011.txt) | Ian King, Karel Stokkermans and Jan Schoenmakers | 5 Feb 2015 | 959 (37393) | Premiership, Cup Tournaments, Championship, Division 1, Division 2, Conference |
149
+
150
+
151
+ That's it.
152
+
153
+
154
+ ### Preparing Archive Pages for SQL Database Imports (e.g. football.db)
155
+
156
+ To import match schedules (fixtures and results) and more using the football.db machinery
157
+ prepare "simple" single league (or cup) pages with standings tables etc. stripped out.
158
+ For example, to break-out the Premier League and FA Cup from the `eng2015.txt`
159
+ archive page use:
160
+
161
+ ``` ruby
162
+ page = RsssfPage.from_url( 'http://www.rsssf.com/tablese/eng2015.html')
163
+
164
+ schedule = page.find_schedule( header: 'Premier League') ## returns RsssfSchedule obj
165
+ schedule.save( './1-premierleague.txt' )
166
+
167
+ schedule = page.find_schedule( header: 'FA Cup', cup: true )
168
+ schedule.save( './facup.txt' )
169
+ ```
14
170
 
15
171
 
16
172
 
@@ -21,6 +177,19 @@ Just install the gem:
21
177
  $ gem install rsssf
22
178
 
23
179
 
180
+
181
+ ## RSSSF Datasets
182
+
183
+ See the rsssf github org for pre-processed ready-to-import datasets. Prepared repos include:
184
+
185
+ - [`eng-england`](https://github.com/rsssf/eng-england) - rsssf archive data for England - Premier League, Championship, FA Cup etc.
186
+ - [`de-deutschland`](https://github.com/rsssf/de-deutschland) - rsssf archive data for Germany (Deutschland) - Deutsche Bundesliga, 2. Bundesliga, 3. Liga, DFB Pokal etc.
187
+ - [`es-espana`](https://github.com/rsssf/es-espana) - rsssf archive data for España (Spain) - Primera División / La Liga, Copa de Rey, etc.
188
+ - [`at-austria`](https://github.com/rsssf/at-austria) - rsssf archive data for Austria (Österreich) - Österr. Bundesliga, Erste Liga, ÖFB Pokal etc.
189
+ - [`br-brazil`](https://github.com/rsssf/br-brazil) - rsssf archive data for Brazil (Brasil) - Campeonato Brasileiro Série A / Brasileirão etc.
190
+ - and more
191
+
192
+
24
193
  ## License
25
194
 
26
195
  The `rsssf` scripts are dedicated to the public domain.
@@ -14,6 +14,19 @@ require 'fetcher' ## used for Fetcher::Worker.new.fetch etc.
14
14
  ## our own code
15
15
  require 'rsssf/version' # note: let version always go first
16
16
 
17
+ require 'rsssf/utils' # include Utils - goes first
18
+ require 'rsssf/html2txt' # include Filters - goes first
19
+
20
+ require 'rsssf/fetch'
21
+ require 'rsssf/page'
22
+ require 'rsssf/schedule'
23
+ require 'rsssf/patch'
24
+
25
+ require 'rsssf/reports/schedule'
26
+ require 'rsssf/reports/page'
27
+
28
+ require 'rsssf/repo'
29
+
17
30
 
18
31
 
19
32
 
@@ -0,0 +1,80 @@
1
+ # encoding: utf-8
2
+
3
+ module Rsssf
4
+
5
+ class PageFetcher
6
+
7
+ include Filters # e.g. html2text, sanitize etc.
8
+
9
+
10
+ def initialize
11
+ @worker = Fetcher::Worker.new
12
+ end
13
+
14
+ def fetch( src_url )
15
+
16
+ ## note: assume plain 7-bit ascii for now
17
+ ## -- assume rsssf uses ISO_8859_15 (updated version of ISO_8859_1) -- does NOT use utf-8 character encoding!!!
18
+ html = @worker.read( src_url )
19
+
20
+ ### todo/fix: first check if html is all ascii-7bit e.g.
21
+ ## includes only chars from 64 to 127!!!
22
+
23
+ ## normalize newlines
24
+ ## remove \r (form feed) used by Windows; just use \n (new line)
25
+ html = html.gsub( "\r", '' )
26
+
27
+ ## note:
28
+ ## assume (default) to ISO 3166-15 (an updated version of ISO 3166-1) for now
29
+ ##
30
+ ## other possible alternatives - try:
31
+ ## - Windows CP 1562 or
32
+ ## - ISO 3166-2 (for eastern european languages )
33
+ ##
34
+ ## note: german umlaut use the same code (int)
35
+ ## in ISO 3166-1/15 and 2 and Windows CP1562 (other chars ARE different!!!)
36
+
37
+ html = html.force_encoding( Encoding::ISO_8859_15 )
38
+ html = html.encode( Encoding::UTF_8 ) # try conversion to utf-8
39
+
40
+ ## check for html entities
41
+ html = html.gsub( "&auml;", 'ä' )
42
+ html = html.gsub( "&ouml;", 'ö' )
43
+ html = html.gsub( "&uuml;", 'ü' )
44
+ html = html.gsub( "&Auml;", 'Ä' )
45
+ html = html.gsub( "&Ouml;", 'Ö' )
46
+ html = html.gsub( "&Uuml;", 'Ü' )
47
+ html = html.gsub( "&szlig;", 'ß' )
48
+
49
+ html = html.gsub( "&oulm;", 'ö' ) ## support typo in entity (&ouml;)
50
+ html = html.gsub( "&slig;", "ß" ) ## support typo in entity (&szlig;)
51
+
52
+ html = html.gsub( "&Eacute;", 'É' )
53
+ html = html.gsub( "&oslash;", 'ø' )
54
+
55
+ ## check for more entities
56
+ html = html.gsub( /&[^;]+;/) do |match|
57
+ puts "*** found unencoded html entity #{match}"
58
+ match ## pass through as is (1:1)
59
+ end
60
+ ## todo/fix: add more entities
61
+
62
+
63
+ txt = html_to_txt( html )
64
+
65
+ header = <<EOS
66
+ <!--
67
+ source: #{src_url}
68
+ -->
69
+
70
+ EOS
71
+
72
+ header+txt ## return txt w/ header
73
+ end ## method fetch
74
+
75
+ end ## class PageFetcher
76
+ end ## module Rsssf
77
+
78
+ ## add (shortcut) alias
79
+ RsssfPageFetcher = Rsssf::PageFetcher
80
+
@@ -0,0 +1,157 @@
1
+ # encoding: utf-8
2
+
3
+ module Rsssf
4
+ module Filters
5
+
6
+ def html_to_txt( html )
7
+
8
+ ###
9
+ # todo: check if any tags (still) present??
10
+
11
+
12
+ ## cut off everything before body
13
+ html = html.sub( /.+?<BODY>\s*/im, '' )
14
+
15
+ ## cut off everything after body (closing)
16
+ html = html.sub( /<\/BODY>.*/im, '' )
17
+
18
+
19
+ ## remove cite
20
+ html = html.gsub( /<CITE>([^<]+)<\/CITE>/im ) do |_|
21
+ puts " remove cite >#{$1}<"
22
+ "#{$1}"
23
+ end
24
+
25
+ html = html.gsub( /\s*<HR>\s*/im ) do |match|
26
+ match = match.gsub( "\n", '$$' ) ## make newlines visible for debugging
27
+ puts " replace horizontal rule (hr) - >#{match}<"
28
+ "\n=-=-=-=-=-=-=-=-=-=-=-=-=-=-=\n" ## check what hr to use use - . - . - or =-=-=-= or somehting distinct?
29
+ end
30
+
31
+ ## replace break (br)
32
+ ## note: do NOT use m/multiline for now - why? why not??
33
+ html = html.gsub( /<BR>\s*/i ) do |match| ## note: include (swallow) "extra" newline
34
+ match = match.gsub( "\n", '$$' ) ## make newlines visible for debugging
35
+ puts " replace break (br) - >#{match}<"
36
+ "\n"
37
+ end
38
+
39
+ ## remove anchors (a name)
40
+ html = html.gsub( /<A NAME[^>]*>(.+?)<\/A>/im ) do |match| ## note: use .+? non-greedy match
41
+ title = $1.to_s ## note: "save" caputure first; gets replaced by gsub (next regex call)
42
+ match = match.gsub( "\n", '$$' ) ## make newlines visible for debugging
43
+ puts " replace anchor (a) name >#{title}< - >#{match}<"
44
+ "#{title}"
45
+ end
46
+
47
+ ## remove anchors (a href)
48
+ # note: heading 4 includes anchor (thus, let anchors go first)
49
+ # note: <a \newline href is used for authors email - thus incl. support for newline as space
50
+ html = html.gsub( /<A\s+HREF[^>]*>(.+?)<\/A>/im ) do |_| ## note: use .+? non-greedy match
51
+ puts " replace anchor (a) href >#{$1}<"
52
+ "‹#{$1}›"
53
+ end
54
+
55
+ ## replace paragrah (p)
56
+ html = html.gsub( /\s*<P>\s*/im ) do |match| ## note: include (swallow) "extra" newline
57
+ match = match.gsub( "\n", '$$' ) ## make newlines visible for debugging
58
+ puts " replace paragraph (p) - >#{match}<"
59
+ "\n\n"
60
+ end
61
+ html = html.gsub( /<\/P>/i, '' ) ## replace paragraph (p) closing w/ nothing for now
62
+
63
+ ## remove i
64
+ html = html.gsub( /<I>([^<]+)<\/I>/im ) do |_|
65
+ puts " remove italic (i) >#{$1}<"
66
+ "#{$1}"
67
+ end
68
+
69
+
70
+ ## heading 2
71
+ html = html.gsub( /\s*<H2>([^<]+)<\/H2>\s*/im ) do |_|
72
+ puts " replace heading 2 (h2) >#{$1}<"
73
+ "\n\n## #{$1}\n\n" ## note: make sure to always add two newlines
74
+ end
75
+
76
+ ## heading 4
77
+ html = html.gsub( /\s*<H4>([^<]+)<\/H4>\s*/im ) do |_|
78
+ puts " replace heading 4 (h4) >#{$1}<"
79
+ "\n\n#### #{$1}\n\n" ## note: make sure to always add two newlines
80
+ end
81
+
82
+
83
+ ## remove b - note: might include anchors (thus, call after anchors)
84
+ html = html.gsub( /<B>([^<]+)<\/B>/im ) do |_|
85
+ puts " remove bold (b) >#{$1}<"
86
+ "**#{$1}**"
87
+ end
88
+
89
+ ## replace preformatted (pre)
90
+ html = html.gsub( /<PRE>|<\/PRE>/i ) do |_|
91
+ puts " replace preformatted (pre)"
92
+ '' # replace w/ nothing for now (keep surrounding newlines)
93
+ end
94
+
95
+ =begin
96
+ puts
97
+ puts
98
+ puts "html:"
99
+ puts html[0..2000]
100
+ puts "-- snip --"
101
+ puts html[-1000..-1] ## print last hundred chars
102
+ =end
103
+
104
+
105
+ ## cleanup whitespaces
106
+ ## todo/fix: convert newline in space first
107
+ ## and than collapse spaces etc.!!!
108
+ txt = ''
109
+ html.each_line do |line|
110
+ line = line.gsub( "\t", ' ' ) # replace all tabs w/ two spaces for nwo
111
+ line = line.rstrip # remove trailing whitespace (incl. newline/formfeed)
112
+
113
+ txt << line
114
+ txt << "\n"
115
+ end
116
+
117
+ ### remove emails etc.
118
+ txt = sanitize( txt )
119
+
120
+ txt
121
+ end # method html_to_text
122
+
123
+
124
+
125
+ def sanitize( txt )
126
+ ### remove emails for (spam/privacy) protection
127
+ ## e.g. (selamm@example.es)
128
+ ## (buuu@mscs.dal.ca)
129
+ ## (kaxx@rsssf.com)
130
+ ## (Manu_Maya@yakoo.co)
131
+
132
+ ## note add support for optional ‹› enclosure (used by html2txt converted a href :mailto links)
133
+ ## e.g. (‹selamm@example.es›)
134
+
135
+ email_pattern = "\\(‹?[a-z][a-z0-9_]+@[a-z]+(\\.[a-z]+)+›?\\)" ## note: just a string; needs to escape \\ twice!!!
136
+
137
+ ## check for "free-standing e.g. on its own line" emails only for now
138
+ txt = txt.gsub( /\n#{email_pattern}\n/i ) do |match|
139
+ puts "removing (free-standing) email >#{match}<"
140
+ "\n" # return empty line
141
+ end
142
+
143
+ txt = txt.gsub( /#{email_pattern}/i ) do |match|
144
+ puts "remove email >#{match}<"
145
+ ''
146
+ end
147
+
148
+ txt
149
+ end # method sanitize
150
+
151
+ end # module Filters
152
+ end # module Rsssf
153
+
154
+ ## add (shortcut) alias
155
+ RsssfFilters = Rsssf::Filters
156
+
157
+