rsssf 0.1.0 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
- SHA1:
3
- metadata.gz: 9c33c4bef56bf6e79c9339b5010a3129b98dc9fa
4
- data.tar.gz: 4d97b1862dfbaab48282593947c6d7b28d047dfe
2
+ SHA256:
3
+ metadata.gz: 92c74803ad71cb9cac8376ef3e0e01890352fc0a0edb01208eab3a9c41f60767
4
+ data.tar.gz: 874bdc292143352c88e23b44ed23abb98312d4afd2fd2cc797d078bce1eef0ed
5
5
  SHA512:
6
- metadata.gz: 8da437c2a72c364f81d53ca1491cdfca4c11c0f187912702e653523cb54f636ec910f8690f9af690cf632f0c98335b9ab10983676d0e67e1a59f3c7d7a5919c7
7
- data.tar.gz: b8a13466f301863e6ae0e4e449f1da08d5ae17a7481bf2055e12976f6f14fda79a8ebd38aa5cfb5f097d21b36e43bc322f362483d1c522a005c9cf49e576f732
6
+ metadata.gz: 0cc04f3a78663d870ed8a4d5b77813a1601e49a67fe3ea9265972f8da5b43b88adde78d184bedc320edb29439d7abb6439e6b045181bedea48c6d6be037d1a86
7
+ data.tar.gz: e92c7acc956eacd756665e83baee9373cf2af422da5deddd51e91e1266d7498e005a8d0b2e88868a23b3929c7b1900ef5bf20c7bd02e260bd07152e3e53d850a
@@ -1,3 +1,5 @@
1
+ ### 0.2.0
2
+
1
3
  ### 0.0.1 / 2015-09-15
2
4
 
3
5
  * Everything is new. First release
data/Manifest.txt CHANGED
@@ -1,17 +1,14 @@
1
- HISTORY.md
1
+ CHANGELOG.md
2
2
  Manifest.txt
3
3
  README.md
4
4
  Rakefile
5
5
  lib/rsssf.rb
6
- lib/rsssf/fetch.rb
7
- lib/rsssf/html2txt.rb
6
+ lib/rsssf/convert.rb
7
+ lib/rsssf/download.rb
8
8
  lib/rsssf/page.rb
9
- lib/rsssf/patch.rb
10
9
  lib/rsssf/repo.rb
11
10
  lib/rsssf/reports/page.rb
12
11
  lib/rsssf/reports/schedule.rb
13
12
  lib/rsssf/schedule.rb
14
13
  lib/rsssf/utils.rb
15
14
  lib/rsssf/version.rb
16
- test/helper.rb
17
- test/test_utils.rb
data/README.md CHANGED
@@ -1,19 +1,18 @@
1
1
  # rsssf - tools 'n' scripts for RSSSF (Rec.Sport.Soccer Statistics Foundation) archive data
2
2
 
3
3
 
4
- * home :: [github.com/sportdb/rsssf](https://github.com/sportdb/rsssf)
5
- * bugs :: [github.com/sportdb/rsssf/issues](https://github.com/sportdb/rsssf/issues)
4
+ * home :: [github.com/sportdb/sport.db.sources](https://github.com/sportdb/sport.db.sources)
5
+ * bugs :: [github.com/sportdb/sport.db.sources/issues](https://github.com/sportdb/sport.db.sources/issues)
6
6
  * gem :: [rubygems.org/gems/rsssf](https://rubygems.org/gems/rsssf)
7
7
  * rdoc :: [rubydoc.info/gems/rsssf](http://rubydoc.info/gems/rsssf)
8
- * forum :: [opensport](http://groups.google.com/group/opensport)
9
8
 
10
9
 
11
- ## What's the Rec.Sport.Soccer Statistics Foundation (RSSSF)?
12
10
 
13
- The RSSSF collects and offers football (soccer) league tables, match results and more
14
- from all over the world online in plain text.
15
11
 
16
- Example:
12
+ ## What's the Rec.Sport.Soccer Statistics Foundation (RSSSF)?
13
+
14
+ The RSSSF collects and offers football (soccer) league tables, match results and more
15
+ from all over the world online in plain text. Example:
17
16
 
18
17
  ```
19
18
  Round 1
@@ -46,15 +45,41 @@ Coritiba 2-1 Atlético/MG
46
45
 
47
46
  ## Usage
48
47
 
49
- ### Working with Pages
50
48
 
51
- To fetch pages from the world wide web use:
49
+ ### Download (and Cache ) Pages
50
+
51
+ To download (and cache) pages from the world wide web use:
52
52
 
53
53
  ``` ruby
54
- page = RsssfPage.from_url( 'http://www.rsssf.com/tablese/eng2015.html')
54
+ Rsssf.download_page( 'https://rsssf.org/tablese/eng2024.html',
55
+ encoding: 'Windows-1252' )
56
+
57
+ Rsssf.download_page( 'https://rsssf.org/tablesb/braz2024.html',
58
+ encoding: 'Windows-1252' )
59
+ ```
60
+
61
+ Note: Most pages on rsssf.org use the Windows-1252 (character) encoding.
62
+ To "auto-magically" convert to unicode (utf-8)
63
+ add the encoding option (default is `UTF-8`).
64
+
65
+ Or as a convenience shortcut download (pre-configured table) pages by country code (e.g `eng` - England, `es` - Spain (España), `de` - Germany (Deutschland), `br` - Brazil (Brasil) etc.)
66
+ and season (e.g. `2023/24` or `2024` etc.)
67
+
68
+ ``` ruby
69
+ Rsssf.download_table( 'eng', season: '2023/24' )
70
+
71
+ Rsssf.download_table( 'br', season: '2024' )
55
72
  ```
56
73
 
57
- Note: The `RsssfPageFetcher` will convert the rsssf archive page
74
+
75
+ Note: The rsssf machinery uses a built-in web cache. All downloads get "auto-magically" cached (in `./cache/rsssf.org`).
76
+
77
+
78
+
79
+ ### Working with Pages
80
+
81
+
82
+ Note: The `RsssfPage` machinery will convert the rsssf archive page
58
83
  from hypertext (HTML) to plain text e.g.
59
84
 
60
85
  ```
@@ -121,7 +146,7 @@ Step 2: Fetch all archive pages
121
146
  Use:
122
147
 
123
148
  ``` ruby
124
- repo = RsssfRepo.new( './eng-england', title: 'England (and Wales)' )
149
+ repo = RsssfRepo.new( './england', title: 'England (and Wales)' )
125
150
  repo.fetch_pages
126
151
  ```
127
152
 
@@ -139,7 +164,7 @@ football.db RSSSF Archive Data Summary for England (and Wales)
139
164
 
140
165
  _Last Update: 2015-11-26 18:22:22 +0200_
141
166
 
142
- | Season | File | Authors | Last Updated | Lines (Chars) | Sections |
167
+ | Season | File | Authors | Last Updated | Lines (Chars) | Sections |
143
168
  | :------ | :------ | :------- | :----------- | ------------: | :------- |
144
169
  | 2014-15 | [eng2015.txt](https://github.com/rsssf/eng-england/blob/master/tables/eng2015.txt) | Ian King and Karel Stokkermans | 4 Jun 2015 | 1249 (34138) | Premier League, Cup Tournaments, Championship, Division 1, Division 2, Conference |
145
170
  | 2013-14 | [eng2014.txt](https://github.com/rsssf/eng-england/blob/master/tables/eng2014.txt) | Ian King and Karel Stokkermans | 5 Feb 2015 | 1254 (34294) | Premier League, Cup Tournaments, Championship, Division 1, Division 2, Conference |
@@ -170,23 +195,15 @@ schedule.save( './facup.txt' )
170
195
 
171
196
 
172
197
 
173
- ## Install
174
-
175
- Just install the gem:
176
-
177
- $ gem install rsssf
178
-
179
-
180
-
181
198
  ## RSSSF Datasets
182
199
 
183
200
  See the rsssf github org for pre-processed ready-to-import datasets. Prepared repos include:
184
201
 
185
- - [`eng-england`](https://github.com/rsssf/eng-england) - rsssf archive data for England - Premier League, Championship, FA Cup etc.
186
- - [`de-deutschland`](https://github.com/rsssf/de-deutschland) - rsssf archive data for Germany (Deutschland) - Deutsche Bundesliga, 2. Bundesliga, 3. Liga, DFB Pokal etc.
187
- - [`es-espana`](https://github.com/rsssf/es-espana) - rsssf archive data for España (Spain) - Primera División / La Liga, Copa de Rey, etc.
188
- - [`at-austria`](https://github.com/rsssf/at-austria) - rsssf archive data for Austria (Österreich) - Österr. Bundesliga, Erste Liga, ÖFB Pokal etc.
189
- - [`br-brazil`](https://github.com/rsssf/br-brazil) - rsssf archive data for Brazil (Brasil) - Campeonato Brasileiro Série A / Brasileirão etc.
202
+ - [`england`](https://github.com/rsssf/england) - rsssf archive data for England - Premier League, Championship, FA Cup etc.
203
+ - [`deutschland`](https://github.com/rsssf/deutschland) - rsssf archive data for Germany (Deutschland) - Deutsche Bundesliga, 2. Bundesliga, 3. Liga, DFB Pokal etc.
204
+ - [`espana`](https://github.com/rsssf/espana) - rsssf archive data for España (Spain) - Primera División / La Liga, Copa de Rey, etc.
205
+ - [`austria`](https://github.com/rsssf/austria) - rsssf archive data for Austria (Österreich) - Österr. Bundesliga, Erste Liga, ÖFB Pokal etc.
206
+ - [`brazil`](https://github.com/rsssf/brazil) - rsssf archive data for Brazil (Brasil) - Campeonato Brasileiro Série A / Brasileirão etc.
190
207
  - and more
191
208
 
192
209
 
data/Rakefile CHANGED
@@ -8,25 +8,26 @@ Hoe.spec 'rsssf' do
8
8
  self.summary = "rsssf - tools 'n' scripts for RSSSF (Rec.Sport.Soccer Statistics Foundation) archive data"
9
9
  self.description = summary
10
10
 
11
- self.urls = ['https://github.com/sportdb/rsssf']
11
+ self.urls = { home: 'https://github.com/sportdb/sport.db.sources' }
12
12
 
13
13
  self.author = 'Gerald Bauer'
14
- self.email = 'opensport@googlegroups.com'
14
+ self.email = 'gerald.bauer@gmail.com'
15
15
 
16
16
  # switch extension to .markdown for gihub formatting
17
17
  self.readme_file = 'README.md'
18
- self.history_file = 'HISTORY.md'
18
+ self.history_file = 'CHANGELOG.md'
19
19
 
20
20
  self.extra_deps = [
21
- ['logutils'],
22
- ['textutils'],
23
- ['fetcher'],
21
+ ['cocos'],
22
+ ['season-formats'],
23
+ ['rsssf-parser'], ## add rsssf parser machinery & tool
24
24
  ]
25
25
 
26
+
26
27
  self.licenses = ['Public Domain']
27
28
 
28
29
  self.spec_extras = {
29
- required_ruby_version: '>= 1.9.2'
30
+ required_ruby_version: '>= 2.2.2'
30
31
  }
31
32
 
32
33
  end
@@ -0,0 +1,495 @@
1
+
2
+ module Rsssf
3
+ class PageConverter
4
+
5
+ ## convenience helper
6
+ def self.convert( html, url: )
7
+ @@converter ||= new ## use a "shared" built-in converter
8
+ @@converter.convert( html, url: url )
9
+ end
10
+
11
+ ##
12
+ ## add anchor: options or such
13
+ ## lets you toggle adding anchors (§premier etc.) - why? why not?
14
+
15
+ def convert( html, url: )
16
+ ### todo/fix: first check if html is all ascii-7bit e.g.
17
+ ## includes only chars from 64 to 127!!!
18
+
19
+ ## normalize newlines
20
+ ## remove \r (form feed) used by Windows; just use \n (new line)
21
+ html = html.gsub( "\r", '' )
22
+
23
+ ## check for html entities
24
+ html = html.gsub( "ä", 'ä' )
25
+ html = html.gsub( "ö", 'ö' )
26
+ html = html.gsub( "ü", 'ü' )
27
+ html = html.gsub( "Ä", 'Ä' )
28
+ html = html.gsub( "Ö", 'Ö' )
29
+ html = html.gsub( "Ü", 'Ü' )
30
+ html = html.gsub( "ß", 'ß' )
31
+
32
+ ## typos / autofix - keep - why? why not?
33
+ html = html.gsub( "&oulm;", 'ö' ) ## support typo in entity (ö)
34
+ html = html.gsub( "¨", 'ü' ) ## support typo in entity (ü) - why? why not?
35
+ html = html.gsub( "&slig;", "ß" ) ## support typo in entity (ß)
36
+ html = html.gsub( "&aaacute;", "á" ) ## typo for á
37
+
38
+
39
+ html = html.gsub( "É", 'É' )
40
+ html = html.gsub( "ø", 'ø' )
41
+ html = html.gsub( "ã", 'ã' )
42
+ html = html.gsub( "õ", 'õ' )
43
+ html = html.gsub( "ô", 'ô' )
44
+
45
+ entities = %w[
46
+ À À
47
+ Á Á
48
+ Â Â
49
+ Ã Ã
50
+ Ä Ä
51
+ Å Å
52
+ à à
53
+ á á
54
+ â â
55
+ ã ã
56
+ ä ä
57
+ å å
58
+ Æ Æ
59
+ æ æ
60
+ ß ß
61
+ Ç Ç
62
+ ç ç
63
+ È È
64
+ É É
65
+ Ê Ê
66
+ Ë Ë
67
+ è è
68
+ é é
69
+ ê ê
70
+ ë ë
71
+ Ì Ì
72
+ Í Í
73
+ Î Î
74
+ Ï Ï
75
+ ì ì
76
+ í í
77
+ î î
78
+ ï ï
79
+ Ñ Ñ
80
+ ñ ñ
81
+ Ò Ò
82
+ Ó Ó
83
+ Ô Ô
84
+ Õ Õ
85
+ Ö Ö
86
+ ò ò
87
+ ó ó
88
+ ô ô
89
+ õ õ
90
+ ö ö
91
+ Ø Ø
92
+ ø ø
93
+ Ù Ù
94
+ Ú Ú
95
+ Û Û
96
+ Ü Ü
97
+ ù ù
98
+ ú ú
99
+ û û
100
+ ü ü
101
+ Ý Ý
102
+ ý ý
103
+ ÿ ÿ
104
+
105
+ < &lt;
106
+ > &gt;
107
+ & &amp;
108
+ © &copy;
109
+ ® &reg;
110
+
111
+ Š &#352;
112
+ š &#353;
113
+ č &#269;
114
+ ć &#263;
115
+ Ž &#381;
116
+ ’ &#8217;
117
+ ]
118
+
119
+
120
+
121
+ entities.each_slice(2) do |str, entity|
122
+ html = html.gsub( entity, str )
123
+ end
124
+
125
+
126
+
127
+ ##############
128
+ ## check for more entities
129
+ ## limit &---; to length 10 - why? why not?
130
+ html = html.gsub( /&[^; ]{1,10};/) do |match|
131
+
132
+ match = if match == '&#307;' ## use like Van D&#307;k -> Van Dijk
133
+ 'ij'
134
+ else
135
+ msg = "found unencoded html entity #{match}"
136
+ puts "*** WARN - #{msg}"
137
+ log( msg ) ## log too (see log.txt)
138
+
139
+ match ## pass through as is (1:1)
140
+ end
141
+
142
+ match
143
+ end
144
+ ## todo/fix: add more entities
145
+
146
+ ###################################
147
+ ### smart quotes quick fixes
148
+ ### convert all "smart" quote to (standard) single quotes
149
+ ## D´Alessandro => D'Alessandro
150
+
151
+ html = html.gsub( '´', "'" )
152
+
153
+ html = html.gsub( '’', "'" )
154
+ html = html.gsub( '‘', "'" )
155
+ html = html.gsub( '“', '"' )
156
+ html = html.gsub( '”', '"' )
157
+
158
+ ### convert fancy dashes/hyphens to plain dash/hyphen
159
+ html = html.gsub( '–', '-' )
160
+
161
+
162
+
163
+ txt = html_to_txt( html )
164
+
165
+ header = <<EOS
166
+ <!--
167
+ source: #{url}
168
+ -->
169
+
170
+ EOS
171
+
172
+ header+txt ## return txt w/ header
173
+ end ## method convert
174
+
175
+
176
+ ## todo/fix - use generic heading regex for all h2/h3/h4 etc.
177
+ ## exclude h1 - why? why not?
178
+ ## note - include leading and trailing spaces !!!
179
+ ##
180
+ ## note - for content use non-greedy to allow
181
+ ## match of tags inside content too
182
+ HEADING2_RE = %r{ \s*
183
+ <H2>
184
+ (?<title>.+?)
185
+ </H2>
186
+ \s*
187
+ }imx
188
+
189
+ HEADING4_RE = %r{ \s*
190
+ <H4>
191
+ (?<title>.+?)
192
+ </H4>
193
+ \s*
194
+ }imx
195
+
196
+ def replace_h2( html )
197
+ html.gsub( HEADING2_RE ) do |_|
198
+ m = Regexp.last_match
199
+ puts " replace heading 2 (h2) >#{m[:title]}<"
200
+ "\n\n## #{m[:title]}\n\n" ## note: make sure to always add two newlines
201
+ end
202
+ end
203
+
204
+ def replace_h4( html )
205
+ html.gsub( HEADING4_RE ) do |_|
206
+ m = Regexp.last_match
207
+ puts " replace heading 4 (h4) >#{m[:title]}<"
208
+ "\n\n#### #{m[:title]}\n\n" ## note: make sure to always add two newlines
209
+ end
210
+ end
211
+
212
+
213
+ def squish( str )
214
+ ## squish more than one white space to one space
215
+ str.gsub( /[ \r\t\n]+/, ' ' )
216
+ end
217
+
218
+
219
+ def patch_about( html )
220
+ # <A name=about>
221
+ # <H2>About this document</H2></A>
222
+ # or
223
+ # <A NAME="about"><H2>About this document</H2></A>
224
+ # => change to (possible?)
225
+ # <H2><A name=about>About this document</A></H2>
226
+
227
+ html.sub( %r{<A [ ] name=(about|"about")> \s*
228
+ <H2>About [ ] this [ ] document</H2></A>
229
+ }ixm,
230
+ "<H2><A name=about>About this document</A></H2>"
231
+ )
232
+ end
233
+
234
+ # <a name="sa">Série A</a>
235
+ # <a name="sd">Série D</a>
236
+
237
+ # <A name=about>
238
+ # <H2>About this document</H2></A>
239
+ # => change to (possible?)
240
+ # <H2><A name=about>About this document</A></H2>
241
+ #
242
+ #
243
+ # <h4><a name="cb">Copa do Brasil</a></h4>
244
+
245
+ ## note - for content use non-greedy to allow
246
+ ## match of tags inside content too
247
+
248
+ A_NAME_RE = %r{<A [ ]+ NAME [ ]* =
249
+ (?<name>[^>]+?)
250
+ >
251
+ (?<title>.+?)
252
+ </A>
253
+ }imx
254
+
255
+ # <a href="#sa">Série A</a><br>
256
+ #
257
+ # <A href="http://www.rsssf.org/">Rec.Sport.Soccer
258
+ # Statistics Foundation</A>
259
+ # <A href="http://www.rsssfbrasil.com">RSSSF
260
+ # Brazil</A>
261
+ #
262
+ # and Daniel Dalence (<A
263
+ # href="mailto:danielballack@terra.com.br">danielballack@terra.com.br</A>)
264
+
265
+
266
+ A_HREF_RE = %r{<A \s+ HREF [ ]* =
267
+ (?<href>[^>]+?)
268
+ >
269
+ (?<title>.+?)
270
+ <\/A>
271
+ }imx
272
+
273
+
274
+ def replace_a_href( html )
275
+ ## remove anchors (a href)
276
+ # note: heading 4 includes anchor (thus, let anchors go first)
277
+ # note: <a \newline href is used for authors email - thus incl. support for newline as space
278
+ html.gsub( A_HREF_RE ) do |match| ## note: use .+? non-greedy match
279
+ m = Regexp.last_match
280
+ href = m[:href].gsub( /["']/, '' ).strip ## remove ("" or '')
281
+ title = m[:title].strip ## note: "save" caputure first; gets replaced by gsub (next regex call)
282
+
283
+
284
+ ## e.g.
285
+ ## ‹Larsen23@gmx.de, see page mailto:Larsen23@gmx.de›
286
+ ## ‹danielballack@terra.com.br, see page mailto:danielballack@terra.com.br›
287
+ ## ‹zja70@aol.com, see page mailto:zja70@aol.com›)
288
+ if href.start_with?( 'mailto:')
289
+ puts " blank mailto - anchor (a) href >#{href}, >#{title}<"
290
+ '‹mailto›' ## delete/remove email
291
+ else
292
+ puts " replace anchor (a) href >#{href}, >#{title}<"
293
+
294
+ ## convert href to xref
295
+ xref = if href.start_with?('#') ## in-page ref
296
+ ", see §#{href[1..-1]}"
297
+ elsif href.start_with?( /https?:/ ) ## external page ref
298
+ ## skip - keep empty - why? why not? (or add url domain?)
299
+ ''
300
+ else
301
+ ## hack - check for some custom excludes
302
+ if title.start_with?( 'Rec.Sport.Soccer' )
303
+ ## skip - keep empty
304
+ ''
305
+ else
306
+ ## strip (ending) .htm|html
307
+ ", see page #{href.sub( /\.html?$/,'')}"
308
+ end
309
+ end
310
+
311
+ "‹#{squish(title)}#{xref}›"
312
+ end
313
+ end
314
+ end
315
+
316
+ def replace_a_name( html )
317
+ ##
318
+ ## remove (named) anchors
319
+ html.gsub( A_NAME_RE ) do |match| ## note: use .+? non-greedy match
320
+ m = Regexp.last_match
321
+ name = m[:name].gsub( /["']/, '' ).strip ## remove ("" or '')
322
+ title = m[:title].strip ## note: "save" caputure first; gets replaced by gsub (next regex call)
323
+ match = match.gsub( "\n", '$$' ) ## make newlines visible for debugging
324
+ puts " replace anchor (a) name >#{name}<, >#{title}< - >#{match}<"
325
+
326
+
327
+ ##
328
+ ## todo - report WARN if title incl. tags
329
+ ## assumes text only for now - why? why not?
330
+ ## add a name inside heading !!!
331
+ ## do NOT add heading inside a name !!!
332
+
333
+ "#{title} ‹§#{name}›" ## note - use two spaces min (between title & name)
334
+ end
335
+ end
336
+
337
+
338
+ EMAIL_RE = %r{ \s*
339
+ \(
340
+ [a-z][a-z0-9_]+
341
+ @[a-z]+(\.[a-z]+)+
342
+ \)
343
+ }imx
344
+
345
+
346
+ def remove_emails( html )
347
+ ### remove converted ("blineded") mailto anchors
348
+ ## note usually inside () e.g.
349
+ ## (‹mailto›)
350
+ ## plus slurp up all leading whitespace (incl. newline) - why? why not?
351
+ html = html.gsub( /\s*
352
+ \(‹mailto›\)
353
+ /xm, '' )
354
+
355
+ ###
356
+ ## remove "regular emails too e.g.
357
+ ##
358
+ ## Thanks to Marcelo Leme de Arruda (___@___.__.br),
359
+ ## Ricardo FF Pontes (___@____.com),
360
+ ## Santiago Reis (____@____.com.br),
361
+ ## Marcos Lacerda Queiroz (___@____.com.br)
362
+ ## etc.
363
+
364
+ ## check for "free-standing e.g. on its own line" emails only for now
365
+ html = html.gsub( EMAIL_RE ) do |match|
366
+ puts "removing email >#{match}<"
367
+ ''
368
+ end
369
+ html
370
+ end
371
+
372
+
373
+
374
+ def html_to_txt( html )
375
+
376
+ ###
377
+ # todo: check if any tags (still) present??
378
+
379
+
380
+ ## cut off everything before body
381
+ html = html.sub( /.+?<BODY>\s*/im, '' )
382
+
383
+ ## cut off everything after body (closing)
384
+ html = html.sub( /<\/BODY>.*/im, '' )
385
+
386
+ html = patch_about( html )
387
+
388
+ ## remove cite
389
+ html = html.gsub( /<CITE>([^<]+)<\/CITE>/im ) do |_|
390
+ puts " remove cite >#{$1}<"
391
+ "#{$1}"
392
+ end
393
+
394
+ html = html.gsub( /\s*<HR>\s*/im ) do |match|
395
+ match = match.gsub( "\n", '$$' ) ## make newlines visible for debugging
396
+ puts " replace horizontal rule (hr) - >#{match}<"
397
+ "\n=-=-=-=-=-=-=-=-=-=-=-=-=-=-=\n" ## check what hr to use use - . - . - or =-=-=-= or somehting distinct?
398
+ end
399
+
400
+ ## replace break (br)
401
+ ## note: do NOT use m/multiline for now - why? why not??
402
+ html = html.gsub( /<BR>\s*/i ) do |match| ## note: include (swallow) "extra" newline
403
+ match = match.gsub( "\n", '$$' ) ## make newlines visible for debugging
404
+ puts " replace break (br) - >#{match}<"
405
+ "\n"
406
+ end
407
+
408
+
409
+
410
+ html = replace_a_href( html )
411
+ ## note a name="about" includes more a hrefs etc.
412
+ # let it go first (before a href)
413
+ html = replace_a_name( html )
414
+
415
+
416
+
417
+ ## replace paragrah (p)
418
+ html = html.gsub( /\s*<P>\s*/im ) do |match| ## note: include (swallow) "extra" newline
419
+ match = match.gsub( "\n", '$$' ) ## make newlines visible for debugging
420
+ puts " replace paragraph (p) - >#{match}<"
421
+ "\n\n"
422
+ end
423
+ html = html.gsub( /<\/P>/i, '' ) ## replace paragraph (p) closing w/ nothing for now
424
+
425
+ ## remove i
426
+ html = html.gsub( /<I>([^<]+)<\/I>/im ) do |_|
427
+ puts " remove italic (i) >#{$1}<"
428
+ "#{$1}"
429
+ end
430
+
431
+
432
+ html = replace_h2( html )
433
+ html = replace_h4( html )
434
+
435
+
436
+
437
+
438
+ ## remove b - note: might include anchors (thus, call after anchors)
439
+ html = html.gsub( /<B>([^<]+)<\/B>/im ) do |_|
440
+ puts " remove bold (b) >#{$1}<"
441
+ "**#{$1}**"
442
+ end
443
+
444
+ ## replace preformatted (pre)
445
+ html = html.gsub( /<PRE>|<\/PRE>/i ) do |_|
446
+ puts " replace preformatted (pre)"
447
+ '' # replace w/ nothing for now (keep surrounding newlines)
448
+ end
449
+
450
+ =begin
451
+ puts
452
+ puts
453
+ puts "html:"
454
+ puts html[0..2000]
455
+ puts "-- snip --"
456
+ puts html[-1000..-1] ## print last hundred chars
457
+ =end
458
+
459
+
460
+ html = remove_emails( html )
461
+
462
+
463
+ ## cleanup whitespaces
464
+ ## todo/fix: convert newline in space first
465
+ ## and than collapse spaces etc.!!!
466
+ txt = String.new
467
+ html.each_line do |line|
468
+ line = line.gsub( "\t", ' ' ) # replace all tabs w/ two spaces for nwo
469
+ line = line.rstrip # remove trailing whitespace (incl. newline/formfeed)
470
+
471
+ txt << line
472
+ txt << "\n"
473
+ end
474
+
475
+ txt
476
+ end # method html_to_text
477
+
478
+
479
+
480
+ ###
481
+ # more helpers
482
+ def log( msg )
483
+ ## append msg to ./logs.txt
484
+ ## use ./errors.txt - why? why not?
485
+ File.open( './logs.txt', 'a:utf-8' ) do |f|
486
+ f.write( msg )
487
+ f.write( "\n" )
488
+ end
489
+ end
490
+
491
+
492
+
493
+ end # module PageConverter
494
+ end # module Rsssf
495
+