metrocot 1.0.0 → 1.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (6) hide show
  1. data/History.txt +10 -0
  2. data/README.txt +27 -27
  3. data/Rakefile +4 -1
  4. data/lib/metrocot.rb +313 -145
  5. data/test/test_metrocot.rb +71 -7
  6. metadata +4 -4
data/History.txt CHANGED
@@ -1,3 +1,13 @@
1
+ === 1.0.2 / 2009-01-08
2
+
3
+ * added docs and examples
4
+ * created tests for examples and made those work as well
5
+ * added StrippingTextScanner
6
+
7
+ === 1.0.1 / 2009-01-03
8
+
9
+ * checked in with some initial docs
10
+
1
11
  === 1.0.0 / 2009-01-02
2
12
 
3
13
  * First working version
data/README.txt CHANGED
@@ -4,9 +4,9 @@
4
4
 
5
5
  == DESCRIPTION:
6
6
 
7
- Metrocot builds on top of Hpricot to allow scraping of list data from HTML pages
8
- with a minimum of code and page specific information. The specification is done
9
- is a very compact readable format.
7
+ Metrocot builds on Hpricot to allow scraping of list data from HTML pages
8
+ with a minimum of code and page specific information. The specification is
9
+ done in a very compact readable format.
10
10
 
11
11
 
12
12
  == FEATURES/PROBLEMS:
@@ -19,39 +19,39 @@ is a very compact readable format.
19
19
 
20
20
  == SYNOPSIS:
21
21
 
22
- require 'rubygems'
23
- require 'metrocot'
22
+ require 'rubygems'
23
+ require 'metrocot'
24
24
 
25
- class Event < Object
25
+ class Event < Object
26
26
 
27
- attr_accessor :starts_at, :title, :description, :url
27
+ attr_accessor :starts_at, :title, :description, :url
28
28
 
29
- def initialize( starts_at, title, description, url )
30
- @starts_at = starts_at
31
- @title = title
32
- @description = description
33
- @url = url
29
+ def initialize( starts_at, title, description, url )
30
+ @starts_at = starts_at
31
+ @title = title
32
+ @description = description
33
+ @url = url
34
+ end
35
+
34
36
  end
35
-
36
- end
37
37
 
38
- mce_url = "http://www.musiccorner.ca/calendar.html"
39
- mce_doc = open(URI.parse(mce_url)) { |data| Hpricot(data) }
38
+ mce_url = "http://www.musiccorner.ca/calendar.html"
39
+ mce_doc = open(URI.parse(mce_url)) { |data| Hpricot(data) }
40
40
 
41
- scraper = Metrocot.new(
41
+ scraper = Metrocot.new(
42
42
  :starts_at => Metrocot::Scanners::DateTimeScanner,
43
43
  :description => Metrocot::Scanners::TextScanner,
44
- :title => Metrocot::Scanners::TextScanner
45
- )
44
+ :title => Metrocot::Scanners::StrippingTextScanner
45
+ )
46
46
 
47
- mce_events = scraper.scrape(mce_doc).descend("//div[@id='content']/table/tr/td") { |td|
48
- td.collect( "starts_at=.//h3 ... title=.//h2 ... description=((.//p )+)" ) { |starts_at, title, description| Event.new( starts_at, title, description, mce_url ) }
49
- }.values.flatten
50
-
51
- puts "Found #{mce_events.size} mce events:"
52
- mce_events.each_with_index { |event, event_index|
53
- puts "%3d %20s %s" % [event_index, event.starts_at, event.title]
54
- }
47
+ mce_events = scraper.scrape(mce_doc).descend("//div[@id='content']/table/tr/td") { |td|
48
+ td.collect( "starts_at=.//h3 ... title=.//h2 ... description=((.//p )+)" ) { |starts_at, title, description| Event.new( starts_at, title, description, mce_url ) }
49
+ }.values.flatten
50
+
51
+ puts "Found #{mce_events.size} mce events:"
52
+ mce_events.each_with_index { |event, event_index|
53
+ puts "%3d %20s %s" % [event_index, event.starts_at, event.title]
54
+ }
55
55
 
56
56
 
57
57
  == REQUIREMENTS:
data/Rakefile CHANGED
@@ -5,8 +5,11 @@ require 'hoe'
5
5
  require './lib/metrocot.rb'
6
6
 
7
7
  Hoe.new('metrocot', Metrocot::VERSION) do |p|
8
- p.rubyforge_name = 'metrocot' # if different than lowercase project name
8
+ p.rubyforge_name = 'metrocot'
9
9
  p.developer('Helmut Hissen', 'helmut@zeebar.com')
10
+ p.summary = "An Hpricot based tool for harvesting list-like data from HTML pages."
11
+ p.remote_rdoc_dir = '' # Release to root
10
12
  end
11
13
 
14
+
12
15
  # vim: syntax=Ruby
data/lib/metrocot.rb CHANGED
@@ -1,4 +1,8 @@
1
1
 
2
+ # Author:: Helmut Hissen (mailto:helmut@zeebar.com)
3
+ # Copyright:: Copyright(c) 2009 Metro Cascade Media, Inc.
4
+ # License:: Distributed under the BSD open source license
5
+ #
2
6
  #############################################################################
3
7
  #
4
8
  # Copyright (c) 2009 Metro Cascade Media Inc
@@ -24,9 +28,6 @@
24
28
  #
25
29
  #############################################################################
26
30
  #
27
- # Helmut Hissen <helmut@zeebar.com> (Metro Cascade Media Inc)
28
- # January 1 2009
29
- #
30
31
  #############################################################################
31
32
  #
32
33
  # We are like tiny pleasantly chirping hex bugs coding away on the
@@ -38,9 +39,39 @@
38
39
  #############################################################################
39
40
  #
40
41
 
42
+ # == Purpose
43
+ # This class implements the main Metrocot HTML scanner and a number of handy
44
+ # input scanners (for grabbing time, numbers, or text from HTML). The purpose
45
+ # of the Metrocot is to scan a XML dom for the patterns specified in the
46
+ # Metrocot pattern language.
47
+ #
48
+ # == Pattern Language
49
+ # The Metrocot pattern language allows for the following types of patterns:
50
+ #
51
+ # +...+:: matches anything
52
+ # +"some string":: matches that string
53
+ # +/(some|pattern)/ matches that regexp pattern
54
+ # +./HPRICOT_PATH+:: matches a certain type of dom subtree
55
+ # +SPACE+:: matches zero or more white spaces
56
+ # +(PATTERN_A PATTERN_B):: matches PATTERN_A followed by PATTERN_B
57
+ # +PATTERN\++:: matches one or more occurrences of PATTERN
58
+ #
59
+ # == Usage
60
+ # 0) create a Metricot and define the types of fields you want to extract (and their names).
61
+ # 1) use Hpricot to get the doc's dom
62
+ # 2) use descend(xpath) to create a NodeScraper rooted at the Hpricot node(s) matching the xpath
63
+ # 3) use collect(pattern) to collect all entries found in the HTML which match the Metricot pattern
64
+
41
65
  class Metrocot < Object
42
66
 
43
- VERSION = '1.0.0'
67
+ VERSION = '1.0.2'
68
+
69
+
70
+ # represents a subtree withing a metrocot dom. the semantics are roughly equivalent
71
+ # to what you get from select-dragging your mouse pointer through a section of an html
72
+ # doc in your web browser. Thats is, a range specifies the first and last node in the
73
+ # pre-fix traversal of the dom. Additionally, the first and last node may be truncated
74
+ # (at their tail and head respectively) if they are text nodes.
44
75
 
45
76
  class MatchRange
46
77
 
@@ -126,6 +157,9 @@ class Metrocot < Object
126
157
  end
127
158
 
128
159
 
160
+ #
161
+ # base class for all other patterns. Provides some reasonable default behaviours.
162
+ #
129
163
 
130
164
  class BasePattern
131
165
 
@@ -156,7 +190,11 @@ class Metrocot < Object
156
190
  end
157
191
 
158
192
  def description
159
- self.class.name
193
+ if name
194
+ "#{self.class.name} \"#{name}\""
195
+ else
196
+ self.class.name
197
+ end
160
198
  end
161
199
 
162
200
 
@@ -247,6 +285,8 @@ class Metrocot < Object
247
285
  end
248
286
 
249
287
 
288
+ # matches a certain Hpricot path
289
+
250
290
  class PathPattern < BasePattern
251
291
 
252
292
  def initialize( source, path )
@@ -311,6 +351,8 @@ class Metrocot < Object
311
351
 
312
352
  end
313
353
 
354
+
355
+ # matches zero or more white spaces
314
356
 
315
357
  class OptSpacePattern < BasePattern
316
358
 
@@ -332,47 +374,58 @@ class Metrocot < Object
332
374
  end
333
375
 
334
376
  def priority
335
- -7
377
+ -8
336
378
  end
337
379
 
338
380
  def each_match( match_range, match_map )
381
+
382
+ result = nil
383
+
339
384
  super(match_range, match_map)
340
- match_start_index = match_range.start_index
385
+
386
+ match_start_index = match_range.start_index
341
387
  match_start_offset = match_range.start_offset
342
- match_end_index = match_range.start_index
343
- match_end_offset = match_range.start_offset
388
+ match_end_index = match_range.start_index
389
+ match_end_offset = match_range.start_offset
390
+
391
+ raise "negative range #{match_range}" if match_start_index > match_end_index
392
+ raise "negative range #{match_range}" if match_start_index == match_end_index && match_start_offset > match_end_offset
344
393
 
345
394
  # consume rest of first text node
346
395
 
347
396
  hnodes = match_range.hnodes
397
+ hnode_text = nil
348
398
 
349
399
  if hnodes[match_start_index] && hnodes[match_start_index].text?
350
- hnode_text = hnodes[match_start_index].inner_text
351
- while match_end_offset < hnode_text.size && (/\s+/.=== hnode_text[match_start_offset .. match_end_offset])
352
- match_end_offset += 1
353
- end
400
+
401
+ hnode_text = hnodes[match_start_index].inner_text
402
+ while match_end_offset < hnode_text.size && (/^\s+$/.=== hnode_text[match_start_offset .. match_end_offset])
403
+ match_end_offset += 1
404
+ end
354
405
 
355
- if match_end_offset > match_start_offset
356
- if match_end_offset >= hnode_text.size
357
- match_range = match_range.tail( match_end_index + 1, 0 )
358
- log( "matched entire string of #{match_end_offset - match_start_offset} spaces" )
359
- else
360
- match_range = match_range.tail( match_end_index, match_end_offset )
361
- log( "matched first #{match_end_offset - match_start_offset} leading spaces" )
362
- end
406
+ if match_end_offset >= match_start_offset
407
+ if match_end_offset >= hnode_text.size
408
+ log( "matched entire string of #{match_end_offset - match_start_offset} spaces" )
409
+ else
410
+ log( "matched first #{match_end_offset - match_start_offset} leading spaces" )
363
411
  end
412
+ end
364
413
  end
365
414
 
415
+ sp_match_range = match_range.crop( match_start_index, match_start_offset, match_end_index, match_end_offset )
416
+
366
417
  result = with_scanned_match_data( match_map, hnodes[match_start_index ... match_end_index] ) { |match_map|
367
- yield( match_range, match_map )
418
+ yield( sp_match_range, match_map )
368
419
  }
369
- result
370
420
 
421
+ result
371
422
  end
372
423
 
373
424
  end
374
425
 
375
426
 
427
+ # matches one or more occurences of some other pattern
428
+
376
429
  class OneOrMorePattern < BasePattern
377
430
 
378
431
  def initialize(repeatee)
@@ -458,6 +511,8 @@ class Metrocot < Object
458
511
  end
459
512
 
460
513
 
514
+ # Matches anything.
515
+
461
516
  class AnythingPattern < BasePattern
462
517
 
463
518
  def description
@@ -474,9 +529,25 @@ class Metrocot < Object
474
529
  # it just expands to fill whatever gap
475
530
 
476
531
  def each_match( match_range, match_map )
477
- with_scanned_match_data( match_map, match_range.hnodes[match_range.start_index .. match_range.end_index] ) { |match_map|
478
- yield( match_range, match_map )
479
- }
532
+
533
+ result = if match_range.start_index > match_range.end_index
534
+ log( "empty range" )
535
+ nil
536
+ elsif match_range.start_index > match_range.end_index && match_range.start_offset > match_range.end_offset
537
+ log( "empty node" )
538
+ nil
539
+ elsif match_range.start_index == match_range.end_index || (match_range.start_index + 1 == match_range.end_index && match_range.end_offset == 0) && match_range.hnodes[match_range.start_index].text?
540
+ log( "single text node range #{match_range.describe}" )
541
+ raise "bad range #{match_range.describe}" if match_range.hnodes[match_range.start_index].nil?
542
+ with_scanned_match_data( match_map, match_range.hnodes[match_range.start_index].inner_text[match_range.start_offset...match_range.end_offset] ) { |match_map|
543
+ yield( match_range, match_map )
544
+ }
545
+ else
546
+ log( "multi node range #{match_range.describe}" )
547
+ with_scanned_match_data( match_map, match_range.hnodes[match_range.start_index ... match_range.end_index] ) { |match_map|
548
+ yield( match_range, match_map )
549
+ }
550
+ end
480
551
  end
481
552
 
482
553
 
@@ -487,6 +558,8 @@ class Metrocot < Object
487
558
  end
488
559
 
489
560
 
561
+ # Matches a certain text string or regex pattern
562
+
490
563
  class TextPattern < BasePattern
491
564
 
492
565
  def initialize( source, text )
@@ -580,10 +653,9 @@ class Metrocot < Object
580
653
 
581
654
  super(match_range, match_map)
582
655
 
583
- match_start_index = match_range.start_index
584
- match_start_offset = match_range.start_offset
585
- match_end_index = match_range.start_index
586
- match_end_offset = match_range.start_offset
656
+ match_index = match_range.start_index
657
+ match_offset = match_range.start_offset
658
+
587
659
 
588
660
  # consume rest of first text node
589
661
 
@@ -591,62 +663,72 @@ class Metrocot < Object
591
663
 
592
664
  actual_match = nil
593
665
 
594
- while match_start_index < match_range.end_index
666
+ while match_index < match_range.end_index || (match_index == match_range.end_index && match_offset < match_range.end_offset)
595
667
 
596
- while match_start_index < match_range.end_index && ! hnodes[match_start_index].text?
597
- log( "not text: ##{match_start_index} #{hnodes[match_start_index].class}" )
598
- match_start_index += 1
599
- match_start_offset = 0
668
+ while (match_index < match_range.end_index || (match_index == match_range.end_index && match_offset < match_range.end_offset)) && ! hnodes[match_index].text?
669
+ log( "not text: ##{match_index} #{hnodes[match_index].class}" )
670
+ match_index += 1
671
+ match_offset = 0
600
672
  end
601
673
 
602
- unless match_start_index < match_range.end_index && hnodes[match_start_index].text?
674
+ unless (match_index < match_range.end_index || (match_index == match_range.end_index && match_offset < match_range.end_offset)) && hnodes[match_index].text?
603
675
  log( "no match found" )
604
676
  return nil
605
677
  end
606
678
 
607
- hnode_text = hnodes[match_start_index].inner_text
679
+ hnode_text = if match_index == match_range.end_index
680
+ hnodes[match_index].inner_text[0...match_range.end_offset]
681
+ else
682
+ hnodes[match_index].inner_text
683
+ end
608
684
 
609
- log( "trying text match on: #{hnode_text[match_start_offset .. -1]}" )
685
+ log( "trying text match on: #{hnode_text[match_offset .. -1]}" )
610
686
 
611
- match_offset = hnode_text.index( @text, match_start_offset )
687
+ next_match_offset = hnode_text.index( @text, match_offset )
612
688
 
613
- if match_offset
689
+ if next_match_offset.nil?
690
+ log( "no match found for #{@text}" )
691
+ match_index += 1
692
+ match_offset = 0
693
+ next
694
+ end
614
695
 
615
- actual_match = if @text.is_a? Regexp
616
- hnode_text[match_offset..-1][@text]
617
- else
618
- @text
619
- end
696
+ actual_match = if @text.is_a? Regexp
697
+ hnode_text[next_match_offset..-1][@text]
698
+ else
699
+ @text
700
+ end
701
+
702
+ log( "next text match at #{match_index}.#{next_match_offset}: #{actual_match}" )
620
703
 
621
- match_end_offset = match_start_offset + actual_match.size
622
- match_start_offset = match_start_offset + actual_match.size
704
+ match_start_offset = next_match_offset
705
+ match_end_offset = match_start_offset + actual_match.size
623
706
 
624
- if match_end_offset >= match_start_offset
625
- if match_end_offset >= hnode_text.size
626
- log( "matched entire string of #{match_end_offset - match_start_offset} chars" )
627
- else
628
- log( "matched first #{match_end_offset - match_start_offset} chars" )
629
- end
630
- break
631
- end
707
+ if match_end_offset >= hnode_text.size
708
+ log( "matched entire string of #{match_end_offset - match_start_offset} chars" )
709
+ else
710
+ log( "matched first #{match_end_offset - match_start_offset} chars" )
632
711
  end
633
712
 
634
- match_start_index += 1
635
- match_start_offset = 0
713
+ result = with_scanned_match_data( match_map, actual_match ) { |match_map|
714
+ yield( match_range.crop( match_index, match_start_offset, match_index, match_end_offset), match_map )
715
+ }
716
+
717
+ return result if result
718
+
719
+ match_offset = match_end_offset
636
720
 
637
721
  end
638
722
 
639
-
640
- result = with_scanned_match_data( match_map, actual_match ) { |match_map|
641
- yield( match_range.crop( match_start_index, match_start_offset, match_start_index, match_end_offset), match_map )
642
- }
643
- result
723
+ return nil
644
724
 
645
725
  end
646
726
 
647
727
  end
648
728
 
649
729
 
730
+ # Matches a series of patterns
731
+
650
732
  class CompositePattern < BasePattern
651
733
 
652
734
  attr_reader :parts
@@ -676,7 +758,7 @@ class Metrocot < Object
676
758
  end
677
759
 
678
760
 
679
- def each_split_match( match_range, match_map, parts_by_priority, ppx, part_matches )
761
+ def each_split_match( comp_match_range, match_map, parts_by_priority, ppx, part_matches )
680
762
 
681
763
  pattern = nil
682
764
 
@@ -690,10 +772,13 @@ class Metrocot < Object
690
772
 
691
773
  if ppx >= parts_by_priority.size
692
774
  log("comp nothing left to do")
693
- return yield( match_range, match_map )
775
+ return yield( comp_match_range, match_map )
694
776
  end
695
777
 
696
778
 
779
+ match_range = comp_match_range
780
+ log("comp matching sub-pattern #{pattern.description} within #{match_range.describe}")
781
+
697
782
  #
698
783
  # figure out which gap this pattern is supposed to fill
699
784
  #
@@ -718,15 +803,25 @@ class Metrocot < Object
718
803
  if matched_on_left
719
804
  log("comp matching must be right of #{matched_on_left.description}")
720
805
  match_range = match_range.tail(matched_on_left.matched.end_index, matched_on_left.matched.end_offset)
806
+ elsif matched_on_right
807
+ right_node = match_range.hnodes[matched_on_right.matched.start_index]
808
+ parent_of_right_node = right_node && right_node.parent
809
+ parent_ix_of_right_node = parent_of_right_node && node_scraper.hnode_index[parent_of_right_node]
810
+ if parent_ix_of_right_node && parent_ix_of_right_node >= match_range.start_index
811
+ match_range = match_range.tail(parent_ix_of_right_node + 1, 0)
812
+ log("restricting left boundary to #{match_range} because would otherwise include subtree with right peer")
813
+ end
721
814
  end
722
815
 
723
- log("comp matching sub-pattern: #{pattern.description} at #{match_range.describe}")
816
+ log("comp matching sub-pattern #{pattern.description} at #{match_range.describe}")
724
817
 
725
818
  pattern.each_match( match_range, match_map ) { |part_match_range, match_map|
726
819
 
727
820
  pattern.matched = part_match_range
728
821
 
729
- result = each_split_match( match_range, match_map, parts_by_priority, ppx + 1, part_matches ) { |sub_match_range, sub_match_map|
822
+ log("found sub-pattern #{pattern.description} at #{part_match_range.describe}")
823
+
824
+ result = each_split_match( comp_match_range, match_map, parts_by_priority, ppx + 1, part_matches ) { |sub_match_range, sub_match_map|
730
825
  yield( sub_match_range, match_map )
731
826
  }
732
827
 
@@ -785,6 +880,9 @@ class Metrocot < Object
785
880
  end
786
881
 
787
882
 
883
+ # rooted at a node in the dom, the node srcaper is used to collect all matches of
884
+ # patterns.
885
+
788
886
  class NodeScraper
789
887
 
790
888
  attr_accessor :mcot, :root, :parent, :hnode, :pattern_classes, :top_part_names, :verbose
@@ -883,15 +981,16 @@ class Metrocot < Object
883
981
  result = nil
884
982
  pattern.each_match( match_range, {} ) { |sub_match_range, match_map|
885
983
  match_list = []
886
- block_args = if (call_with == :positional) && top_part_names.size > 0
887
- top_part_names.collect { |top_name|
984
+ result = if (call_with == :positional) && top_part_names.size > 0
985
+ block_args = top_part_names.collect { |top_name|
888
986
  match_map[top_name]
889
987
  }
988
+ log("calling pos scan block with: #{block_args.inspect}")
989
+ result = block.call( *block_args )
890
990
  else
891
- match_map
991
+ log("calling hash scan block with: #{match_map.inspect}")
992
+ result = block.call( match_map )
892
993
  end
893
- log("calling scan block with: #{block_args.join(", ")}")
894
- result = block.call( *block_args )
895
994
  if result
896
995
  results << result
897
996
  match_range = match_range.following( sub_match_range )
@@ -905,16 +1004,51 @@ class Metrocot < Object
905
1004
  results
906
1005
  end
907
1006
 
1007
+ # collects all occurrences of the data matching the pattern by calling the
1008
+ # yield block for everything part of the dom subtree matching the pattern.
1009
+ # The block can reject the dom match by returning nil. Anything other than
1010
+ # nil will be appended to the list returned at the end.
1011
+ #
1012
+ # Unlike collect_hashed(), the block will be given a list of parameter
1013
+ # values matching the list of named fields in the pattern.
1014
+ #
1015
+ # === Example
1016
+ #
1017
+ # mcot.scrape(doc).descend( "//ul/li" ) { |li|
1018
+ # li.collect( "liker=... \"likes\" likee=..." ) { |likes, liked|
1019
+ # [ likes, liked ]
1020
+ # }
1021
+ # }
1022
+ #
908
1023
 
909
1024
  def collect( pattern_s, &block )
910
1025
  collect_gen( pattern_s, :positional, &block )
911
1026
  end
912
1027
 
913
1028
 
1029
+ # collects all occurrences of the data matching the pattern by calling the
1030
+ # yield block for everything part of the dom subtree matching the pattern.
1031
+ # The block can reject the dom match by returning nil. Anything other than
1032
+ # nil will be appended to the list returned at the end.
1033
+ #
1034
+ # Unlike collect(), the block will be given a map of parameter values
1035
+ # keyed by the names of the named fields in the pattern.
1036
+ #
1037
+ # === Example
1038
+ #
1039
+ # mcot.scrape(doc).descend( "//ul/li" ) { |li|
1040
+ # li.collect_hashed( "killer=... verb=/(stabbed|shot|strangled)/ victim=... \"(with|using)\" weapon=..." ) { |map|
1041
+ # Murder.new( map )
1042
+ # }
1043
+ # }
1044
+
914
1045
  def collect_hashed( pattern_s, &block )
915
1046
  collect_gen( pattern_s, :map, &block )
916
1047
  end
917
1048
 
1049
+
1050
+ # returns the scanner declared with that name when the metrocot was created
1051
+
918
1052
  def scanner_by_name( name )
919
1053
  return mcot.scanner_by_name(name)
920
1054
  end
@@ -922,17 +1056,125 @@ class Metrocot < Object
922
1056
  end
923
1057
 
924
1058
 
1059
+ # Some useful scanners which should cover a good number of common
1060
+ # parsing scenarios.
1061
+
1062
+ module Scanners
1063
+
1064
+ class BaseScanner
1065
+ def scan(data)
1066
+ data.to_s
1067
+ end
1068
+ end
1069
+
1070
+
1071
+ # Scans the hpricot element or text for a date in one of the
1072
+ # various formats accepted by Time. Dependingon the kinds of
1073
+ # dates expected in the doc, this may not be sufficient and
1074
+ # you may have to create a custom scanner.
1075
+
1076
+ class DateTimeScanner < BaseScanner
1077
+ def scan( data )
1078
+ if data.is_a? Hpricot::Elem
1079
+ data = data.inner_text
1080
+ end
1081
+ Time.parse(data)
1082
+ end
1083
+ end
1084
+
1085
+
1086
+ # Scans the dom subtree and converts it to textile where possible
1087
+ # returning a string containing the Textile version
1088
+
1089
+ class TextileScanner < BaseScanner
1090
+ def scan( data )
1091
+ if data.is_a? Hpricot::Elem
1092
+ data = data.inner_text
1093
+ end
1094
+ end
1095
+ end
1096
+
1097
+
1098
+ # just pulls out the plain text. This will probably be the most
1099
+ # commonly used scanner.
1100
+
1101
+ class TextScanner < BaseScanner
1102
+ def scan( data )
1103
+ if data.is_a? Hpricot::Elem
1104
+ data = data.inner_text
1105
+ else
1106
+ data = data.to_s
1107
+ end
1108
+ data
1109
+ end
1110
+ end
1111
+
1112
+
1113
+ # just pulls out the plain text and strips it. This will probably one of the
1114
+ # most commonly used scanner.
1115
+
1116
+ class StrippingTextScanner < BaseScanner
1117
+ def scan( data )
1118
+ if data.is_a? Hpricot::Elem
1119
+ data = data.inner_text
1120
+ else
1121
+ data = data.to_s
1122
+ end
1123
+ data && data.strip
1124
+ end
1125
+ end
1126
+
1127
+ # pulls out text and then dicards everything up to the end of line
1128
+
1129
+ class LineScanner < BaseScanner
1130
+ end
1131
+
1132
+ end
1133
+
925
1134
  def log( s )
926
1135
  puts( s ) if @verbose
927
1136
  end
928
1137
 
929
1138
 
1139
+ # given a Symbol of a scanner (name), return a handle for that scanner
1140
+ # declared when the metrocot was created.
930
1141
 
931
1142
  def scanner_by_name( name )
932
1143
  @scanners[name]
933
1144
  end
934
1145
 
935
1146
 
1147
+
1148
+ attr_accessor :verbose
1149
+
1150
+
1151
+ def initialize( scanners )
1152
+
1153
+ @scanners = {}
1154
+ @compiled_patterns = {}
1155
+
1156
+ scanners.each { |name, value|
1157
+ if value.is_a? Class
1158
+ @scanners[name] = value.new
1159
+ else
1160
+ @scanners[name] = value
1161
+ end
1162
+ }
1163
+
1164
+ @verbose = false
1165
+
1166
+ log("scanners: #{@scanners.inspect}")
1167
+
1168
+ end
1169
+
1170
+ #
1171
+ #
1172
+
1173
+ def scrape(doc)
1174
+ NodeScraper.new( self, nil, nil, doc )
1175
+ end
1176
+
1177
+
936
1178
  def compile_pattern( pattern_s, node_scraper )
937
1179
 
938
1180
  # if @compiled_patterns.key? pattern_s
@@ -1031,80 +1273,6 @@ class Metrocot < Object
1031
1273
 
1032
1274
  end
1033
1275
 
1034
-
1035
- attr_accessor :verbose
1036
-
1037
-
1038
- def initialize( scanners )
1039
-
1040
- @scanners = {}
1041
- @compiled_patterns = {}
1042
-
1043
- scanners.each { |name, value|
1044
- if value.is_a? Class
1045
- @scanners[name] = value.new
1046
- else
1047
- @scanners[name] = value
1048
- end
1049
- }
1050
-
1051
- @verbose = false
1052
-
1053
- log("scanners: #{@scanners.inspect}")
1054
-
1055
- end
1056
-
1057
-
1058
- def scrape(doc)
1059
- NodeScraper.new( self, nil, nil, doc )
1060
- end
1061
-
1062
-
1063
- module Scanners
1064
-
1065
- class BaseScanner
1066
- def scan(data)
1067
- data.to_s
1068
- end
1069
- end
1070
-
1071
- class DateTimeScanner < BaseScanner
1072
- def scan( data )
1073
- if data.is_a? Hpricot::Elem
1074
- data = data.inner_text
1075
- end
1076
- Time.parse(data)
1077
- end
1078
- end
1079
-
1080
- class TextLookupScanner < BaseScanner
1081
- end
1082
-
1083
- class TextileScanner < BaseScanner
1084
- def scan( data )
1085
- if data.is_a? Hpricot::Elem
1086
- data = data.inner_text
1087
- end
1088
- end
1089
- end
1090
-
1091
- class TextScanner < BaseScanner
1092
- def scan( data )
1093
- if data.is_a? Hpricot::Elem
1094
- data = data.inner_text
1095
- else
1096
- data = data.to_s
1097
- end
1098
- data
1099
- end
1100
- end
1101
-
1102
- class LineScanner < BaseScanner
1103
- end
1104
-
1105
-
1106
- end
1107
-
1108
1276
  end
1109
1277
 
1110
1278
  #
@@ -20,21 +20,32 @@ class TestMetrocot < Test::Unit::TestCase
20
20
  end
21
21
  end
22
22
 
23
+ class Murder < Object
24
+ attr_accessor :killer, :mod, :weapon, :victim
25
+ def initialize( map )
26
+ @killer = map[:killer]
27
+ @mod = map[:mod]
28
+ @weapon = map[:weapon]
29
+ @victim = map[:victim]
30
+ end
31
+ end
32
+
33
+
23
34
  def test_nothing
24
35
 
25
36
  html = "<html><head></head><body><h1>hello!</h1><h2>A</h2><p>a 1</p><p>a 2</p><h2>B</h2><p>b 1</p><p>b 2</p><h2>C</h2><p>c 1</p><p>c 2</p></body></html>"
26
37
 
27
38
  doc = Hpricot(html)
28
39
 
29
- scraper = Metrocot.new(
40
+ mcot = Metrocot.new(
30
41
  :a => Metrocot::Scanners::TextScanner,
31
42
  :b => Metrocot::Scanners::TextScanner,
32
43
  :c => Metrocot::Scanners::TextScanner
33
44
  )
34
45
 
35
- # scraper.verbose = true
46
+ # mcot.verbose = true
36
47
 
37
- assert_equal( [[]], scraper.scrape(doc).descend("//html/body") { |td|
48
+ assert_equal( [[]], mcot.scrape(doc).descend("//html/body") { |td|
38
49
  td.collect( "a=.//h3 b=.//p c=.//p" ) { |a, b, c|
39
50
  Abc.new( a, b, c )
40
51
  }
@@ -42,21 +53,22 @@ class TestMetrocot < Test::Unit::TestCase
42
53
 
43
54
  end
44
55
 
56
+
45
57
  def test_abc
46
58
 
47
- html = "<html><head></head><body><h1>hello!</h1><h2>A</h2><p>a 1</p><p>a 2</p><h2>B</h2><p>b 1</p><p>b 2</p><h2>C</h2><p>c 1</p><p>c 2</p></body></html>"
59
+ html = "<html><head></head><body><h1>hello!</h1><h2>A</h2><h3>where are the cartoon foxes?</h3><p>a 1</p><p>a 2</p><h2>B</h2><p>b 1</p><p>b 2</p><h2>C</h2><p>c 1</p><p>c 2</p><p>Inquiring minds want to know!</p></body></html>"
48
60
 
49
61
  doc = Hpricot(html)
50
62
 
51
- scraper = Metrocot.new(
63
+ mcot = Metrocot.new(
52
64
  :a => Metrocot::Scanners::TextScanner,
53
65
  :b => Metrocot::Scanners::TextScanner,
54
66
  :c => Metrocot::Scanners::TextScanner
55
67
  )
56
68
 
57
- # scraper.verbose = true
69
+ # mcot.verbose = true
58
70
 
59
- abcs = scraper.scrape(doc).descend("//html/body") { |td|
71
+ abcs = mcot.scrape(doc).descend("//html/body") { |td|
60
72
  td.collect( "a=.//h2 b=.//p c=.//p" ) { |a, b, c|
61
73
  Abc.new( a, b, c )
62
74
  }
@@ -66,5 +78,57 @@ class TestMetrocot < Test::Unit::TestCase
66
78
 
67
79
  end
68
80
 
81
+
82
+ def test_murder
83
+
84
+ html = %{
85
+ <html>
86
+ <head>
87
+ <title>Murder by Numbers</title>
88
+ </head>
89
+ <body>
90
+ <h1>The Who Killed Who Hoedown</h1>
91
+ <ul>
92
+ <li>Bob strangled the plumber with piano wire.</li>
93
+ <li>Collonel Clinket stabbed Aunt Elizabeth with a screw driver, while Dick shot Harry with a hunting rifle.</li>
94
+ <li>Other than that, not much happened here today. Honestly!</li>
95
+ <li>Accidents do and will happen and that's what happened last Friday.</li>
96
+ </ul>
97
+ <p>
98
+ ps: Collonel Clinket would have stabbed Aunt Elizabeth with an ice pick if that had been available.
99
+ </p>
100
+ </body>
101
+ </html>
102
+ }
103
+
104
+ doc = Hpricot(html)
105
+
106
+ mcot = Metrocot.new(
107
+ :killer => Metrocot::Scanners::StrippingTextScanner,
108
+ :victim => Metrocot::Scanners::StrippingTextScanner,
109
+ :mod => Metrocot::Scanners::StrippingTextScanner,
110
+ :weapon => Metrocot::Scanners::StrippingTextScanner
111
+ )
112
+
113
+ # mcot.verbose = true
114
+
115
+ murder_strings = []
116
+ mcot.scrape(doc).descend( "//ul/li" ) { |li|
117
+ li.collect_hashed( "killer=... mod=/(stabbed|shot|strangled)/ victim=... /(with|using)/ weapon=... /[.,]/" ) { |map| Murder.new( map ) }
118
+ }.each { |node, murders|
119
+ murders.each { |murder|
120
+ ms = " %-20s %-20s %-10s %-20s" % [murder.killer, murder.victim, murder.mod, murder.weapon]
121
+ # puts ms
122
+ murder_strings << ms
123
+ }
124
+ }
125
+
126
+ assert_equal([
127
+ " Bob the plumber strangled piano wire ",
128
+ " Collonel Clinket Aunt Elizabeth stabbed a screw driver ",
129
+ " while Dick Harry shot a hunting rifle "].sort, murder_strings.sort)
130
+
131
+ end
132
+
69
133
  end
70
134
 
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: metrocot
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0
4
+ version: 1.0.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Helmut Hissen
@@ -9,7 +9,7 @@ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
11
 
12
- date: 2009-01-05 00:00:00 -08:00
12
+ date: 2009-01-08 00:00:00 -08:00
13
13
  default_executable:
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency
@@ -22,7 +22,7 @@ dependencies:
22
22
  - !ruby/object:Gem::Version
23
23
  version: 1.8.2
24
24
  version:
25
- description: Metrocot builds on top of Hpricot to allow scraping of list data from HTML pages with a minimum of code and page specific information. The specification is done is a very compact readable format.
25
+ description: Metrocot builds on Hpricot to allow scraping of list data from HTML pages with a minimum of code and page specific information. The specification is done in a very compact readable format.
26
26
  email:
27
27
  - helmut@zeebar.com
28
28
  executables:
@@ -67,6 +67,6 @@ rubyforge_project: metrocot
67
67
  rubygems_version: 1.2.0
68
68
  signing_key:
69
69
  specification_version: 2
70
- summary: Metrocot builds on top of Hpricot to allow scraping of list data from HTML pages with a minimum of code and page specific information
70
+ summary: An Hpricot based tool for harvesting list-like data from HTML pages.
71
71
  test_files:
72
72
  - test/test_metrocot.rb