extraloop 0.0.3 → 0.0.4

Sign up to get free protection for your applications and to get access to all the features.
@@ -1,3 +1,6 @@
1
+ == 0.0.4 / 2011-01-14
2
+ * fixed a bug which prevented from subclassing `IterativeScraper` instances
3
+
1
4
  == 0.0.3 / 2011-01-01
2
5
  * namespaced all classes into the ExtraLoop module
3
6
 
data/README.md CHANGED
@@ -20,9 +20,7 @@ A basic scraper that fetches the top 25 websites from Alexa's daily top 100 list
20
20
  extract(:site_name, "h2").
21
21
  extract(:url, "h2 a").
22
22
  extract(:description, ".description").
23
- on(:data, proc { |data|
24
- results = data
25
- }).
23
+ on(:data, proc { |data| results = data }).
26
24
  run()
27
25
 
28
26
  An Iterative Scraper that fetches URL, title, and publisher from some 110 Google News articles mentioning the keyword 'Egypt'.
@@ -57,8 +55,9 @@ An Iterative Scraper that fetches URL, title, and publisher from some 110 Google
57
55
 
58
56
  ### Extractors
59
57
 
60
- ExtraLoop allows to fetch structured data from online documents by looping through a list of elements matching a given selector. For each of the matched element, an arbitrary set of fields can be extracted. While the `loop_on` method sets up such loop, the `extract` method extracts a piece of information from an element (e.g. a story's title) and and stores it into a record's field.
61
-
58
+ ExtraLoop allows to fetch structured data from online documents by looping through a list of elements matching a given selector.
59
+ For each matched element, an arbitrary set of fields can be extracted. While the `loop_on` method sets up such loop, the `extract`
60
+ method extracts a piece of information from an element (e.g. a story's title) and and stores it into a record's field.
62
61
 
63
62
  # using a CSS3 or an XPath selector
64
63
  loop_on('div.post')
@@ -67,12 +66,12 @@ ExtraLoop allows to fetch structured data from online documents by looping throu
67
66
 
68
67
  loop_on(proc { |doc| doc.search('div.post') })
69
68
 
70
- # using both a selector and a proc (the result of applying a selector is passed on as the first argument of the proc)
69
+ # using both a selector and a proc (matched elements are passed in as the first argument of the proc )
71
70
 
72
- loop_on('div.post', proc { |posts| posts.reject {|post| post.attr(:class) == 'sticky' }})
71
+ loop_on('div.post', proc { |posts| posts.reject { |post| post.attr(:class) == 'sticky' }})
73
72
 
74
73
  Both the `loop_on` and the `extract` methods may be called with a selector, a proc or a combination of the two. By default, when parsing DOM documents, `extract` will call
75
- `Nokogiri::XML::Node#text()`. Alternatively, `extract` can also be passed an attribute name or a proc as a third argument; this will be evaluated in the context of the matching element.
74
+ `Nokogiri::XML::Node#text()`. Alternatively, `extract` also accepts an attribute name or a proc as a third argument, this will be evaluated in the context of the matching element.
76
75
 
77
76
  # extract a story's title
78
77
  extract(:title, 'h3')
@@ -87,21 +86,25 @@ Both the `loop_on` and the `extract` methods may be called with a selector, a pr
87
86
 
88
87
  #### Extracting from JSON Documents
89
88
 
90
- While processing each HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at the `ContentType` header sent by the server (this value may be overriden by providing a `:format` key in the scraper's initialization options).
91
- When the format is JSON, the document is parsed using the `yajl` parser and converted into a hash. In this case, both the `loop_on` and the `extract` methods still behave as documented above, with only the exception of the CSS3/XPath selector string, which is specific of DOM documents.
92
- When working with JSON documents, it is possible to loop over an arbitrary portion of a document by simply passing a proc to `loop_on`.
89
+ While processing each HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at
90
+ the `ContentType` header sent by the server (this value may be overriden by providing a `:format` key in the scraper's
91
+ initialization options). When the format is JSON, the document is parsed using the `yajl` parser and converted into a hash.
92
+ In this case, both the `loop_on` and the `extract` methods still behave as illustrated above, with the sole exception
93
+ of the CSS3/XPath selector string, which is specific of DOM documents.
94
+
95
+ When working with JSON data, you can just use a proc and return the document elements you want to loop on.
93
96
 
94
97
  # Fetch a portion of a document using a proc
95
98
  loop_on(proc { |data| data['query']['categorymembers'] })
96
99
 
97
- Alternatively, the same loop can be defined by passing an array of nested keys, locating the position of the document fragments.
100
+ Alternatively, the same loop can be defined by passing an array of keys pointing at a value located
101
+ at several levels of depth down into the parsed document hash.
98
102
 
99
103
  # Fetch the same document portion above using a hash path
100
104
  loop_on(['query', 'categorymembers'])
101
105
 
102
- Passing an array of nested keys will also work fine with the `extract` method.
103
- When fetching fields from a JSON document fragment, `extract` will try to use the
104
- field name as a key if no key path or key string is provided.
106
+ When fetching fields from a portion of a JSON document, `extract` will use the
107
+ field name as a hash key if no key path or key string is provided.
105
108
 
106
109
  # current node:
107
110
  #
@@ -110,26 +113,30 @@ field name as a key if no key path or key string is provided.
110
113
  # 'text' => 'bla bla bla',
111
114
  # 'from_user_id'..
112
115
  # }
113
-
114
116
 
115
- extract(:from_user)
117
+ # >> extract(:from_user)
116
118
  # => "johndoe"
117
119
 
118
120
 
119
121
  ### Iteration methods:
120
122
 
121
- The `IterativeScraper` class comes with two methods for defining how a scraper should loop over paginated content.
122
-
123
+ The `IterativeScraper` class comes with two methods for defining how a scraper should loop over paginated content.
123
124
 
124
- #set_iteration(iteration_parameter, array_range_or_proc)
125
+ _set_iteration(iteration_parameter, array_range_or_proc)_
125
126
 
126
127
  * `iteration_parameter` - A symbol identifying the request parameter that the scraper will use as offset in order to iterate over the paginated content.
127
128
  * `array_or_range_or_proc` - Either an explicit set of values or a block of code. If provided, the block is called with the parsed document as its first argument. Its return value is then used to shift, at each iteration, the value of the iteration parameter. If the block fails to return a non empty array, the iteration stops.
128
129
 
129
130
  The second iteration methods, `#continue_with`, allows to continue iterating untill an arbitrary block of code returns a positive, non-nil value.
130
131
 
131
- #continue_with(iteration_parameter, block)
132
+ _continue_with(iteration_parameter, block)_
132
133
 
133
134
  * `iteration_parameter` - the scraper' iteration parameter.
134
135
  * `block` - An arbitrary block of ruby code, its return value will be used to determine the value of the next iteration's offset parameter.
135
136
 
137
+ ### Running tests
138
+
139
+ ExtraLoop uses `rspec` and `rr` as its testing framework. The test suite can be run by calling the `rspec` executable from within the `spec` directory:
140
+
141
+ cd spec
142
+ rspec *
@@ -0,0 +1,57 @@
1
+ require '../lib/extraloop'
2
+ require 'pry'
3
+
4
+ class WikipediaCategoryScraper < ExtraLoop::IterativeScraper
5
+ attr_accessor :members
6
+ attr_reader :request_arguments
7
+
8
+ baseurl = "http://en.wikipedia.org"
9
+ endpoint_url = "/w/api.php"
10
+ @@api_url = baseurl + endpoint_url
11
+
12
+ def initialize(category, depth=2, parent=nil)
13
+
14
+ @members = []
15
+ @parent = parent
16
+
17
+ params = {
18
+ :action => 'query',
19
+ :list => 'categorymembers',
20
+ :format => 'json',
21
+ :cmtitle => "Category:#{category.gsub(/^Category\:/,'')}",
22
+ :cmlimit => "500",
23
+ :cmtype => 'page|subcat',
24
+ :cmdir => 'asc',
25
+ :cmprop => 'ids|title|type|timestamp'
26
+ }
27
+ options = {
28
+ :depth => depth,
29
+ :format => :json,
30
+ :log => false
31
+ }
32
+ request_arguments = { :params => params }
33
+
34
+ super(@@api_url, options, request_arguments)
35
+
36
+ loop_on(['query', 'categorymembers']).
37
+ extract(:title).
38
+ extract(:ns).
39
+ extract(:type).
40
+ extract(:timestamp).
41
+ on(:data, proc { |results|
42
+ puts "#{"\t" * (@options[:depth] - 2).abs } #{@scraper.request_arguments[:params][:cmtitle]}"
43
+ categories = results.select{ |record| record.ns === 14 }.each { |category| results.delete(category) }
44
+
45
+
46
+ categories.each do |record|
47
+ # Instanciate a sub scraper if the current depth is greater than zero and the category member is a sub category.
48
+ WikipediaCategoryScraper.new(record.title, @options[:depth] - 1, @scraper.request_arguments[:params][:cmtitle] ).run unless @options[:depth] <= 0
49
+ end
50
+
51
+ }).
52
+ continue_with(:cmcontinue, ['query-continue', 'categorymembers', 'cmcontinue'])
53
+ end
54
+ end
55
+
56
+
57
+ WikipediaCategoryScraper.new("Italian_media").run
@@ -4,6 +4,7 @@ module ExtraLoop
4
4
 
5
5
  class ExtractionEnvironment
6
6
  attr_accessor :document
7
+ attr_reader :scraper
7
8
 
8
9
  def initialize(scraper=nil, document=nil, records=nil)
9
10
  if scraper
@@ -15,7 +15,7 @@ module ExtraLoop
15
15
  @document = @loop_extractor.parse(document)
16
16
  @records = []
17
17
  @hooks = hooks
18
- @environment = ExtractionEnvironment.new(@scraper, @document, @records)
18
+ @environment = ExtractionEnvironment.new(scraper, @document, @records)
19
19
  self
20
20
  end
21
21
 
@@ -240,7 +240,7 @@ module ExtraLoop
240
240
  #
241
241
 
242
242
  def run_super(method, args=[])
243
- self.class.superclass.instance_method(method).bind(self).call(*args)
243
+ ExtraLoop::ScraperBase.instance_method(method).bind(self).call(*args)
244
244
  end
245
245
 
246
246
 
@@ -31,6 +31,10 @@ describe ExtractionLoop do
31
31
  describe "run" do
32
32
  before(:each) do
33
33
 
34
+ @fake_scraper = Object.new
35
+ stub(@fake_scraper).options {{}}
36
+ stub(@fake_scraper).results { }
37
+
34
38
  @extractors = [:a, :b].map do |field_name|
35
39
  object = Object.new
36
40
  stub(object).extract_field { |node, record| node[field_name] }
@@ -55,6 +59,7 @@ describe ExtractionLoop do
55
59
  mock(env).run.with_any_args.times(20 + 2)
56
60
  end
57
61
 
62
+
58
63
  @extraction_loop = ExtractionLoop.new(@loop_extractor, @extractors, "fake document", hooks, @fake_scraper).run
59
64
  end
60
65
 
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: extraloop
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.3
4
+ version: 0.0.4
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,11 +9,11 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-01-01 00:00:00.000000000Z
12
+ date: 2012-01-14 00:00:00.000000000Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: yajl-ruby
16
- requirement: &10243900 !ruby/object:Gem::Requirement
16
+ requirement: &10100300 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ~>
@@ -21,10 +21,10 @@ dependencies:
21
21
  version: 1.1.0
22
22
  type: :runtime
23
23
  prerelease: false
24
- version_requirements: *10243900
24
+ version_requirements: *10100300
25
25
  - !ruby/object:Gem::Dependency
26
26
  name: nokogiri
27
- requirement: &10242520 !ruby/object:Gem::Requirement
27
+ requirement: &10098200 !ruby/object:Gem::Requirement
28
28
  none: false
29
29
  requirements:
30
30
  - - ~>
@@ -32,10 +32,10 @@ dependencies:
32
32
  version: 1.5.0
33
33
  type: :runtime
34
34
  prerelease: false
35
- version_requirements: *10242520
35
+ version_requirements: *10098200
36
36
  - !ruby/object:Gem::Dependency
37
37
  name: typhoeus
38
- requirement: &10240780 !ruby/object:Gem::Requirement
38
+ requirement: &10095680 !ruby/object:Gem::Requirement
39
39
  none: false
40
40
  requirements:
41
41
  - - ~>
@@ -43,10 +43,10 @@ dependencies:
43
43
  version: 0.3.2
44
44
  type: :runtime
45
45
  prerelease: false
46
- version_requirements: *10240780
46
+ version_requirements: *10095680
47
47
  - !ruby/object:Gem::Dependency
48
48
  name: logging
49
- requirement: &10238820 !ruby/object:Gem::Requirement
49
+ requirement: &10094320 !ruby/object:Gem::Requirement
50
50
  none: false
51
51
  requirements:
52
52
  - - ~>
@@ -54,10 +54,10 @@ dependencies:
54
54
  version: 0.6.1
55
55
  type: :runtime
56
56
  prerelease: false
57
- version_requirements: *10238820
57
+ version_requirements: *10094320
58
58
  - !ruby/object:Gem::Dependency
59
59
  name: rspec
60
- requirement: &10233640 !ruby/object:Gem::Requirement
60
+ requirement: &10092700 !ruby/object:Gem::Requirement
61
61
  none: false
62
62
  requirements:
63
63
  - - ~>
@@ -65,10 +65,10 @@ dependencies:
65
65
  version: 2.7.1
66
66
  type: :development
67
67
  prerelease: false
68
- version_requirements: *10233640
68
+ version_requirements: *10092700
69
69
  - !ruby/object:Gem::Dependency
70
70
  name: rr
71
- requirement: &10231680 !ruby/object:Gem::Requirement
71
+ requirement: &10089100 !ruby/object:Gem::Requirement
72
72
  none: false
73
73
  requirements:
74
74
  - - ~>
@@ -76,10 +76,10 @@ dependencies:
76
76
  version: 1.0.4
77
77
  type: :development
78
78
  prerelease: false
79
- version_requirements: *10231680
79
+ version_requirements: *10089100
80
80
  - !ruby/object:Gem::Dependency
81
81
  name: pry
82
- requirement: &10229180 !ruby/object:Gem::Requirement
82
+ requirement: &10088040 !ruby/object:Gem::Requirement
83
83
  none: false
84
84
  requirements:
85
85
  - - ~>
@@ -87,7 +87,7 @@ dependencies:
87
87
  version: 0.9.7.4
88
88
  type: :development
89
89
  prerelease: false
90
- version_requirements: *10229180
90
+ version_requirements: *10088040
91
91
  description: A Ruby library for extracting data from websites and web based APIs.
92
92
  Supports most common document formats (i.e. HTML, XML, and JSON), and comes with
93
93
  a handy mechanism for iterating over paginated datasets.
@@ -100,6 +100,7 @@ files:
100
100
  - README.md
101
101
  - examples/google_news_scraper.rb
102
102
  - examples/wikipedia_categories.rb
103
+ - examples/wikipedia_categories_recoursive.rb
103
104
  - lib/extraloop.rb
104
105
  - lib/extraloop/dom_extractor.rb
105
106
  - lib/extraloop/extraction_environment.rb