rsssf 0.0.1 → 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 0e916c424ef3d9ebd62d5c2fb1e09122806c5765
4
- data.tar.gz: 232f06dea9fd3f878e554b1d4a2029531468f842
3
+ metadata.gz: 9c33c4bef56bf6e79c9339b5010a3129b98dc9fa
4
+ data.tar.gz: 4d97b1862dfbaab48282593947c6d7b28d047dfe
5
5
  SHA512:
6
- metadata.gz: 898399cdcedd7890145278b17d6d7ab5d8f305160513def86152b69a345de5c19ddb4e2e770753c091216c203bf31a3bd2bd70ca33b0087a8d003292cdd97acf
7
- data.tar.gz: 0b76813d36aee3c59a83c2dfce1d916ee4677da585d52464a531dc2563e5bf96d33b345afe766bb4af962c03f4abcfed0a5b8523e485c85c7e85e406fd9ab8c6
6
+ metadata.gz: 8da437c2a72c364f81d53ca1491cdfca4c11c0f187912702e653523cb54f636ec910f8690f9af690cf632f0c98335b9ab10983676d0e67e1a59f3c7d7a5919c7
7
+ data.tar.gz: b8a13466f301863e6ae0e4e449f1da08d5ae17a7481bf2055e12976f6f14fda79a8ebd38aa5cfb5f097d21b36e43bc322f362483d1c522a005c9cf49e576f732
File without changes
@@ -3,4 +3,15 @@ Manifest.txt
3
3
  README.md
4
4
  Rakefile
5
5
  lib/rsssf.rb
6
+ lib/rsssf/fetch.rb
7
+ lib/rsssf/html2txt.rb
8
+ lib/rsssf/page.rb
9
+ lib/rsssf/patch.rb
10
+ lib/rsssf/repo.rb
11
+ lib/rsssf/reports/page.rb
12
+ lib/rsssf/reports/schedule.rb
13
+ lib/rsssf/schedule.rb
14
+ lib/rsssf/utils.rb
6
15
  lib/rsssf/version.rb
16
+ test/helper.rb
17
+ test/test_utils.rb
data/README.md CHANGED
@@ -1,16 +1,172 @@
1
1
  # rsssf - tools 'n' scripts for RSSSF (Rec.Sport.Soccer Statistics Foundation) archive data
2
2
 
3
3
 
4
- * home :: [github.com/sportdb/rsssf](https://github.com/sportdb/mrhyde)
4
+ * home :: [github.com/sportdb/rsssf](https://github.com/sportdb/rsssf)
5
5
  * bugs :: [github.com/sportdb/rsssf/issues](https://github.com/sportdb/rsssf/issues)
6
6
  * gem :: [rubygems.org/gems/rsssf](https://rubygems.org/gems/rsssf)
7
7
  * rdoc :: [rubydoc.info/gems/rsssf](http://rubydoc.info/gems/rsssf)
8
8
  * forum :: [opensport](http://groups.google.com/group/opensport)
9
9
 
10
10
 
11
+ ## What's the Rec.Sport.Soccer Statistics Foundation (RSSSF)?
12
+
13
+ The RSSSF collects and offers football (soccer) league tables, match results and more
14
+ from all over the world online in plain text.
15
+
16
+ Example:
17
+
18
+ ```
19
+ Round 1
20
+ [May 25]
21
+ Vasco da Gama 1-0 Portuguesa
22
+ [Carlos Tenório 47']
23
+ Vitória 2-2 Internacional
24
+ [Maxi Biancucchi 2', Gabriel Paulista 11'; Diego Forlán 29', Fred 63']
25
+ Corinthians 1-1 Botafogo
26
+ [Paulinho 73'; Rafael Marques 24']
27
+ [May 26]
28
+ Grêmio 2-0 Náutico [played in Caxias do Sul-RS]
29
+ [Zé Roberto 15', Elano 70']
30
+ Ponte Preta 0-2 São Paulo
31
+ [Lúcio 9', Jádson 44'p]
32
+ Criciúma 3-1 Bahia
33
+ [Matheus Ferraz 45'+1', Lins 46', João Vítor 82'; Diones 72']
34
+ Santos 0-0 Flamengo [played in Brasília-DF]
35
+ Fluminense 2-1 Atlético/PR [played in Macaé-RJ]
36
+ [Rafael Sóbis 15'p, Samuel 53'; Manoel 28']
37
+ Cruzeiro 5-0 Goiás
38
+ [Diego Souza 5', Bruno Rodrigo 30', Nílton 40',79', Borges 42']
39
+ Coritiba 2-1 Atlético/MG
40
+ [Deivid 53', Arthur 90'+1'; Diego Tardelli 51']
41
+ ```
42
+
43
+ [Find out more about the Rec.Sport.Soccer Statistics Foundation (RSSSF) »](http://www.rsssf.com)
44
+
45
+
46
+
11
47
  ## Usage
12
48
 
13
- To be done
49
+ ### Working with Pages
50
+
51
+ To fetch pages from the world wide web use:
52
+
53
+ ``` ruby
54
+ page = RsssfPage.from_url( 'http://www.rsssf.com/tablese/eng2015.html')
55
+ ```
56
+
57
+ Note: The `RsssfPageFetcher` will convert the rsssf archive page
58
+ from hypertext (HTML) to plain text e.g.
59
+
60
+ ```
61
+ <hr>
62
+ <a href="#premier">Premier League</A><br>
63
+ <a href="#cups">Cup Tournaments</A><br>
64
+ <a href="#champ">Championship</A><br>
65
+ <a href="#first">Division 1</A><br>
66
+ <a href="#second">Division 2</A><br>
67
+ <a href="#conf">Conference</A>
68
+ <hr>
69
+ <h4><a name="premier">Premier League</A></h4>
70
+ <pre>
71
+ Final Table:
72
+
73
+ 1.Chelsea 38 26 9 3 73-32 87 Champions
74
+ 2.Manchester City 38 24 7 7 83-38 79
75
+ 3.Arsenal 38 22 9 7 71-36 75
76
+ ...
77
+ ```
78
+
79
+ will become
80
+
81
+ ```
82
+ =-=-=-=-=-=-=-=-=-=-=-=-=-=-=
83
+ ‹Premier League›
84
+ ‹Cup Tournaments›
85
+ ‹Championship›
86
+ ‹Division 1›
87
+ ‹Division 2›
88
+ ‹Conference›
89
+ =-=-=-=-=-=-=-=-=-=-=-=-=-=-=
90
+
91
+ #### Premier League
92
+
93
+
94
+ Final Table:
95
+
96
+ 1.Chelsea 38 26 9 3 73-32 87 Champions
97
+ 2.Manchester City 38 24 7 7 83-38 79
98
+ 3.Arsenal 38 22 9 7 71-36 75
99
+ ...
100
+ ```
101
+
102
+
103
+ ### Working with Repos
104
+
105
+ To fetch pages from the world wide web for many seasons in batch setup and use a repo.
106
+
107
+ Step 1: List all archive pages
108
+
109
+ In the `tables/config.yml` list all archive pages to fetch. Example:
110
+
111
+ ``` yaml
112
+ 2010-11: tablese/eng2011.html
113
+ 2011-12: tablese/eng2012.html
114
+ 2012-13: tablese/eng2013.html
115
+ 2013-14: tablese/eng2014.html
116
+ 2014-15: tablese/eng2015.html
117
+ ```
118
+
119
+ Step 2: Fetch all archive pages
120
+
121
+ Use:
122
+
123
+ ``` ruby
124
+ repo = RsssfRepo.new( './eng-england', title: 'England (and Wales)' )
125
+ repo.fetch_pages
126
+ ```
127
+
128
+ Bonus: To create a summary of all pages fetched (e.g. authors, last_updated, sections, etc.).
129
+ Use:
130
+
131
+ ``` ruby
132
+ repo.make_pages_report
133
+ ```
134
+
135
+ Example - `tables/README.md`:
136
+
137
+
138
+ football.db RSSSF Archive Data Summary for England (and Wales)
139
+
140
+ _Last Update: 2015-11-26 18:22:22 +0200_
141
+
142
+ | Season | File | Authors | Last Updated | Lines (Chars) | Sections |
143
+ | :------ | :------ | :------- | :----------- | ------------: | :------- |
144
+ | 2014-15 | [eng2015.txt](https://github.com/rsssf/eng-england/blob/master/tables/eng2015.txt) | Ian King and Karel Stokkermans | 4 Jun 2015 | 1249 (34138) | Premier League, Cup Tournaments, Championship, Division 1, Division 2, Conference |
145
+ | 2013-14 | [eng2014.txt](https://github.com/rsssf/eng-england/blob/master/tables/eng2014.txt) | Ian King and Karel Stokkermans | 5 Feb 2015 | 1254 (34294) | Premier League, Cup Tournaments, Championship, Division 1, Division 2, Conference |
146
+ | 2012-13 | [eng2013.txt](https://github.com/rsssf/eng-england/blob/master/tables/eng2013.txt) | Karel Stokkermans | 5 Feb 2015 | 1269 (34531) | Premiership, Cup Tournaments, Championship, Division 1, Division 2, Conference |
147
+ | 2011-12 | [eng2012.txt](https://github.com/rsssf/eng-england/blob/master/tables/eng2012.txt) | Karel Stokkermans | 5 Feb 2015 | 691 (21925) | Premiership, Cup Tournaments, Championship, Division 1, Division 2, Conference |
148
+ | 2010-11 | [eng2011.txt](https://github.com/rsssf/eng-england/blob/master/tables/eng2011.txt) | Ian King, Karel Stokkermans and Jan Schoenmakers | 5 Feb 2015 | 959 (37393) | Premiership, Cup Tournaments, Championship, Division 1, Division 2, Conference |
149
+
150
+
151
+ That's it.
152
+
153
+
154
+ ### Preparing Archive Pages for SQL Database Imports (e.g. football.db)
155
+
156
+ To import match schedules (fixtures and results) and more using the football.db machinery
157
+ prepare "simple" single league (or cup) pages with standings tables etc. stripped out.
158
+ For example, to break-out the Premier League and FA Cup from the `eng2015.txt`
159
+ archive page use:
160
+
161
+ ``` ruby
162
+ page = RsssfPage.from_url( 'http://www.rsssf.com/tablese/eng2015.html')
163
+
164
+ schedule = page.find_schedule( header: 'Premier League') ## returns RsssfSchedule obj
165
+ schedule.save( './1-premierleague.txt' )
166
+
167
+ schedule = page.find_schedule( header: 'FA Cup', cup: true )
168
+ schedule.save( './facup.txt' )
169
+ ```
14
170
 
15
171
 
16
172
 
@@ -21,6 +177,19 @@ Just install the gem:
21
177
  $ gem install rsssf
22
178
 
23
179
 
180
+
181
+ ## RSSSF Datasets
182
+
183
+ See the rsssf github org for pre-processed ready-to-import datasets. Prepared repos include:
184
+
185
+ - [`eng-england`](https://github.com/rsssf/eng-england) - rsssf archive data for England - Premier League, Championship, FA Cup etc.
186
+ - [`de-deutschland`](https://github.com/rsssf/de-deutschland) - rsssf archive data for Germany (Deutschland) - Deutsche Bundesliga, 2. Bundesliga, 3. Liga, DFB Pokal etc.
187
+ - [`es-espana`](https://github.com/rsssf/es-espana) - rsssf archive data for España (Spain) - Primera División / La Liga, Copa de Rey, etc.
188
+ - [`at-austria`](https://github.com/rsssf/at-austria) - rsssf archive data for Austria (Österreich) - Österr. Bundesliga, Erste Liga, ÖFB Pokal etc.
189
+ - [`br-brazil`](https://github.com/rsssf/br-brazil) - rsssf archive data for Brazil (Brasil) - Campeonato Brasileiro Série A / Brasileirão etc.
190
+ - and more
191
+
192
+
24
193
  ## License
25
194
 
26
195
  The `rsssf` scripts are dedicated to the public domain.
@@ -14,6 +14,19 @@ require 'fetcher' ## used for Fetcher::Worker.new.fetch etc.
14
14
  ## our own code
15
15
  require 'rsssf/version' # note: let version always go first
16
16
 
17
+ require 'rsssf/utils' # include Utils - goes first
18
+ require 'rsssf/html2txt' # include Filters - goes first
19
+
20
+ require 'rsssf/fetch'
21
+ require 'rsssf/page'
22
+ require 'rsssf/schedule'
23
+ require 'rsssf/patch'
24
+
25
+ require 'rsssf/reports/schedule'
26
+ require 'rsssf/reports/page'
27
+
28
+ require 'rsssf/repo'
29
+
17
30
 
18
31
 
19
32
 
@@ -0,0 +1,80 @@
1
+ # encoding: utf-8
2
+
3
+ module Rsssf
4
+
5
+ class PageFetcher
6
+
7
+ include Filters # e.g. html2text, sanitize etc.
8
+
9
+
10
+ def initialize
11
+ @worker = Fetcher::Worker.new
12
+ end
13
+
14
+ def fetch( src_url )
15
+
16
+ ## note: assume plain 7-bit ascii for now
17
+ ## -- assume rsssf uses ISO_8859_15 (updated version of ISO_8859_1) -- does NOT use utf-8 character encoding!!!
18
+ html = @worker.read( src_url )
19
+
20
+ ### todo/fix: first check if html is all ascii-7bit e.g.
21
+ ## includes only chars from 64 to 127!!!
22
+
23
+ ## normalize newlines
24
+ ## remove \r (form feed) used by Windows; just use \n (new line)
25
+ html = html.gsub( "\r", '' )
26
+
27
+ ## note:
28
+ ## assume (default) to ISO 3166-15 (an updated version of ISO 3166-1) for now
29
+ ##
30
+ ## other possible alternatives - try:
31
+ ## - Windows CP 1562 or
32
+ ## - ISO 3166-2 (for eastern european languages )
33
+ ##
34
+ ## note: german umlaut use the same code (int)
35
+ ## in ISO 3166-1/15 and 2 and Windows CP1562 (other chars ARE different!!!)
36
+
37
+ html = html.force_encoding( Encoding::ISO_8859_15 )
38
+ html = html.encode( Encoding::UTF_8 ) # try conversion to utf-8
39
+
40
+ ## check for html entities
41
+ html = html.gsub( "&auml;", 'ä' )
42
+ html = html.gsub( "&ouml;", 'ö' )
43
+ html = html.gsub( "&uuml;", 'ü' )
44
+ html = html.gsub( "&Auml;", 'Ä' )
45
+ html = html.gsub( "&Ouml;", 'Ö' )
46
+ html = html.gsub( "&Uuml;", 'Ü' )
47
+ html = html.gsub( "&szlig;", 'ß' )
48
+
49
+ html = html.gsub( "&oulm;", 'ö' ) ## support typo in entity (&ouml;)
50
+ html = html.gsub( "&slig;", "ß" ) ## support typo in entity (&szlig;)
51
+
52
+ html = html.gsub( "&Eacute;", 'É' )
53
+ html = html.gsub( "&oslash;", 'ø' )
54
+
55
+ ## check for more entities
56
+ html = html.gsub( /&[^;]+;/) do |match|
57
+ puts "*** found unencoded html entity #{match}"
58
+ match ## pass through as is (1:1)
59
+ end
60
+ ## todo/fix: add more entities
61
+
62
+
63
+ txt = html_to_txt( html )
64
+
65
+ header = <<EOS
66
+ <!--
67
+ source: #{src_url}
68
+ -->
69
+
70
+ EOS
71
+
72
+ header+txt ## return txt w/ header
73
+ end ## method fetch
74
+
75
+ end ## class PageFetcher
76
+ end ## module Rsssf
77
+
78
+ ## add (shortcut) alias
79
+ RsssfPageFetcher = Rsssf::PageFetcher
80
+
@@ -0,0 +1,157 @@
1
+ # encoding: utf-8
2
+
3
+ module Rsssf
4
+ module Filters
5
+
6
+ def html_to_txt( html )
7
+
8
+ ###
9
+ # todo: check if any tags (still) present??
10
+
11
+
12
+ ## cut off everything before body
13
+ html = html.sub( /.+?<BODY>\s*/im, '' )
14
+
15
+ ## cut off everything after body (closing)
16
+ html = html.sub( /<\/BODY>.*/im, '' )
17
+
18
+
19
+ ## remove cite
20
+ html = html.gsub( /<CITE>([^<]+)<\/CITE>/im ) do |_|
21
+ puts " remove cite >#{$1}<"
22
+ "#{$1}"
23
+ end
24
+
25
+ html = html.gsub( /\s*<HR>\s*/im ) do |match|
26
+ match = match.gsub( "\n", '$$' ) ## make newlines visible for debugging
27
+ puts " replace horizontal rule (hr) - >#{match}<"
28
+ "\n=-=-=-=-=-=-=-=-=-=-=-=-=-=-=\n" ## check what hr to use use - . - . - or =-=-=-= or somehting distinct?
29
+ end
30
+
31
+ ## replace break (br)
32
+ ## note: do NOT use m/multiline for now - why? why not??
33
+ html = html.gsub( /<BR>\s*/i ) do |match| ## note: include (swallow) "extra" newline
34
+ match = match.gsub( "\n", '$$' ) ## make newlines visible for debugging
35
+ puts " replace break (br) - >#{match}<"
36
+ "\n"
37
+ end
38
+
39
+ ## remove anchors (a name)
40
+ html = html.gsub( /<A NAME[^>]*>(.+?)<\/A>/im ) do |match| ## note: use .+? non-greedy match
41
+ title = $1.to_s ## note: "save" caputure first; gets replaced by gsub (next regex call)
42
+ match = match.gsub( "\n", '$$' ) ## make newlines visible for debugging
43
+ puts " replace anchor (a) name >#{title}< - >#{match}<"
44
+ "#{title}"
45
+ end
46
+
47
+ ## remove anchors (a href)
48
+ # note: heading 4 includes anchor (thus, let anchors go first)
49
+ # note: <a \newline href is used for authors email - thus incl. support for newline as space
50
+ html = html.gsub( /<A\s+HREF[^>]*>(.+?)<\/A>/im ) do |_| ## note: use .+? non-greedy match
51
+ puts " replace anchor (a) href >#{$1}<"
52
+ "‹#{$1}›"
53
+ end
54
+
55
+ ## replace paragrah (p)
56
+ html = html.gsub( /\s*<P>\s*/im ) do |match| ## note: include (swallow) "extra" newline
57
+ match = match.gsub( "\n", '$$' ) ## make newlines visible for debugging
58
+ puts " replace paragraph (p) - >#{match}<"
59
+ "\n\n"
60
+ end
61
+ html = html.gsub( /<\/P>/i, '' ) ## replace paragraph (p) closing w/ nothing for now
62
+
63
+ ## remove i
64
+ html = html.gsub( /<I>([^<]+)<\/I>/im ) do |_|
65
+ puts " remove italic (i) >#{$1}<"
66
+ "#{$1}"
67
+ end
68
+
69
+
70
+ ## heading 2
71
+ html = html.gsub( /\s*<H2>([^<]+)<\/H2>\s*/im ) do |_|
72
+ puts " replace heading 2 (h2) >#{$1}<"
73
+ "\n\n## #{$1}\n\n" ## note: make sure to always add two newlines
74
+ end
75
+
76
+ ## heading 4
77
+ html = html.gsub( /\s*<H4>([^<]+)<\/H4>\s*/im ) do |_|
78
+ puts " replace heading 4 (h4) >#{$1}<"
79
+ "\n\n#### #{$1}\n\n" ## note: make sure to always add two newlines
80
+ end
81
+
82
+
83
+ ## remove b - note: might include anchors (thus, call after anchors)
84
+ html = html.gsub( /<B>([^<]+)<\/B>/im ) do |_|
85
+ puts " remove bold (b) >#{$1}<"
86
+ "**#{$1}**"
87
+ end
88
+
89
+ ## replace preformatted (pre)
90
+ html = html.gsub( /<PRE>|<\/PRE>/i ) do |_|
91
+ puts " replace preformatted (pre)"
92
+ '' # replace w/ nothing for now (keep surrounding newlines)
93
+ end
94
+
95
+ =begin
96
+ puts
97
+ puts
98
+ puts "html:"
99
+ puts html[0..2000]
100
+ puts "-- snip --"
101
+ puts html[-1000..-1] ## print last hundred chars
102
+ =end
103
+
104
+
105
+ ## cleanup whitespaces
106
+ ## todo/fix: convert newline in space first
107
+ ## and than collapse spaces etc.!!!
108
+ txt = ''
109
+ html.each_line do |line|
110
+ line = line.gsub( "\t", ' ' ) # replace all tabs w/ two spaces for nwo
111
+ line = line.rstrip # remove trailing whitespace (incl. newline/formfeed)
112
+
113
+ txt << line
114
+ txt << "\n"
115
+ end
116
+
117
+ ### remove emails etc.
118
+ txt = sanitize( txt )
119
+
120
+ txt
121
+ end # method html_to_text
122
+
123
+
124
+
125
+ def sanitize( txt )
126
+ ### remove emails for (spam/privacy) protection
127
+ ## e.g. (selamm@example.es)
128
+ ## (buuu@mscs.dal.ca)
129
+ ## (kaxx@rsssf.com)
130
+ ## (Manu_Maya@yakoo.co)
131
+
132
+ ## note add support for optional ‹› enclosure (used by html2txt converted a href :mailto links)
133
+ ## e.g. (‹selamm@example.es›)
134
+
135
+ email_pattern = "\\(‹?[a-z][a-z0-9_]+@[a-z]+(\\.[a-z]+)+›?\\)" ## note: just a string; needs to escape \\ twice!!!
136
+
137
+ ## check for "free-standing e.g. on its own line" emails only for now
138
+ txt = txt.gsub( /\n#{email_pattern}\n/i ) do |match|
139
+ puts "removing (free-standing) email >#{match}<"
140
+ "\n" # return empty line
141
+ end
142
+
143
+ txt = txt.gsub( /#{email_pattern}/i ) do |match|
144
+ puts "remove email >#{match}<"
145
+ ''
146
+ end
147
+
148
+ txt
149
+ end # method sanitize
150
+
151
+ end # module Filters
152
+ end # module Rsssf
153
+
154
+ ## add (shortcut) alias
155
+ RsssfFilters = Rsssf::Filters
156
+
157
+