jsl-feedzirra 0.0.12.8 → 0.0.12.9
Sign up to get free protection for your applications and to get access to all the features.
- data/LICENSE.rdoc +22 -0
- data/README.rdoc +63 -112
- data/lib/feedzirra/feed.rb +4 -5
- data/lib/feedzirra.rb +2 -2
- data/spec/feedzirra/feed_spec.rb +31 -0
- metadata +19 -12
data/LICENSE.rdoc
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
(The MIT License)
|
2
|
+
|
3
|
+
Copyright (c) 2009 {Paul Dix}[http://pauldix.net]
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
6
|
+
a copy of this software and associated documentation files (the
|
7
|
+
'Software'), to deal in the Software without restriction, including
|
8
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
9
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
10
|
+
permit persons to whom the Software is furnished to do so, subject to
|
11
|
+
the following conditions:
|
12
|
+
|
13
|
+
The above copyright notice and this permission notice shall be
|
14
|
+
included in all copies or substantial portions of the Software.
|
15
|
+
|
16
|
+
THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
|
17
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
18
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
|
19
|
+
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
|
20
|
+
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
|
21
|
+
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
|
22
|
+
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.rdoc
CHANGED
@@ -1,6 +1,6 @@
|
|
1
|
-
|
1
|
+
= Feedzirra
|
2
2
|
|
3
|
-
|
3
|
+
== Description
|
4
4
|
|
5
5
|
Feedzirra is a feed library that is designed to get and update many feeds as quickly as possible. This includes using libcurl-multi through the
|
6
6
|
taf2-curb[link:http://github.com/taf2/curb/tree/master] gem for faster http gets, and libxml through
|
@@ -12,22 +12,20 @@ much control as you want in updating feeds. Feedzirra makes it easy to figure o
|
|
12
12
|
of a feed in a key-value store. Feedzirra uses the the "moneta" gem, which is a unified interface to key-value storage systems, in order to provide
|
13
13
|
access to many different types of stores depending on your requirements.
|
14
14
|
|
15
|
-
|
15
|
+
== Installation
|
16
16
|
|
17
17
|
For now Feedzirra exists only on github. It also has a few gem requirements that are only on github. Before you start you need to have
|
18
18
|
libcurl[link:http://curl.haxx.se/] and libxml[link:http://xmlsoft.org/] installed. If you're on Leopard you have both. Otherwise, you'll need to
|
19
|
-
grab them. Once you've got those libraries,
|
20
|
-
that lives on github and not the Ruby Forge version of curb), and pauldix-feedzirra. The feedzirra gemspec has all the dependencies so you should
|
21
|
-
be able to get up and running with the standard github gem install routine:
|
19
|
+
grab them. Once you've got those libraries, you should be able to get up and running with the standard github gem install routine:
|
22
20
|
|
23
21
|
gem sources -a http://gems.github.com # if you haven't already
|
24
|
-
gem install
|
22
|
+
gem install jsl-feedzirra
|
25
23
|
|
26
|
-
|
24
|
+
== Usage
|
27
25
|
|
28
|
-
This experimental branch offers a new interface to feed fetching with persistent back-end stores. This allows you to
|
29
|
-
|
30
|
-
|
26
|
+
This experimental branch offers a new interface to feed fetching with persistent back-end stores. This allows you to easily run a script
|
27
|
+
retrieving the feeds once per hour or once per day, and it will remember which feeds have been seenbefore and which are new. This feature
|
28
|
+
uses the Feedzirra::Reader interface.
|
31
29
|
|
32
30
|
You can create a Feedzirra::Reader object after the Feedzirra library (with require 'feedzirra') is loaded as follows:
|
33
31
|
|
@@ -51,84 +49,60 @@ and a Ruby Hash structure-based back end that doesn't attempt to persist any inf
|
|
51
49
|
|
52
50
|
Once you've retrieved a single feed, you can use the accessors below to query the results.
|
53
51
|
|
54
|
-
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
|
74
|
-
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
|
79
|
-
|
80
|
-
|
81
|
-
|
82
|
-
|
83
|
-
|
84
|
-
|
85
|
-
|
86
|
-
|
87
|
-
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
Feedzirra is easily extended with custom parsing classes and persistent back-ends. You'll have to read the source to find out how, though, because we
|
105
|
-
still haven't written the documentation. :(
|
106
|
-
|
107
|
-
=== Benchmarks
|
108
|
-
|
109
|
-
One of the goals of Feedzirra is speed. This includes not only parsing, but fetching multiple feeds as quickly as possible. I ran a benchmark getting 20 feeds 10 times using Feedzirra, rFeedParser, and FeedNormalizer. For more details the {benchmark code can be found in the project in spec/benchmarks/feedzirra_benchmarks.rb}[http://github.com/pauldix/feedzirra/blob/7fb5634c5c16e9c6ec971767b462c6518cd55f5d/spec/benchmarks/feedzirra_benchmarks.rb]
|
110
|
-
|
111
|
-
feedzirra 5.170000 1.290000 6.460000 ( 18.917796)
|
112
|
-
rfeedparser 104.260000 12.220000 116.480000 (244.799063)
|
113
|
-
feed-normalizer 66.250000 4.010000 70.260000 (191.589862)
|
114
|
-
|
115
|
-
The result of that benchmark is a bit sketchy because of the network variability. Running 10 times against the same 20 feeds was meant to smooth some of that out. However, there is also a {benchmark comparing parsing speed in spec/benchmarks/parsing_benchmark.rb}[http://github.com/pauldix/feedzirra/blob/7fb5634c5c16e9c6ec971767b462c6518cd55f5d/spec/benchmarks/parsing_benchmark.rb] on an atom feed.
|
116
|
-
|
117
|
-
feedzirra 0.500000 0.030000 0.530000 ( 0.658744)
|
118
|
-
rfeedparser 8.400000 1.110000 9.510000 ( 11.839827)
|
119
|
-
feed-normalizer 5.980000 0.160000 6.140000 ( 7.576140)
|
120
|
-
|
121
|
-
There's also a {benchmark that shows the results of using Feedzirra to perform updates on feeds}[http://github.com/pauldix/feedzirra/blob/45d64319544c61a4c9eb9f7f825c73b9f9030cb3/spec/benchmarks/updating_benchmarks.rb] you've already pulled in. I tested against 179 feeds. The first is the initial pull and the second is an update 65 seconds later. I'm not sure how many of them support etag and last-modified, so performance may be better or worse depending on what feeds you're requesting.
|
122
|
-
|
123
|
-
feedzirra fetch and parse 4.010000 0.710000 4.720000 ( 15.110101)
|
124
|
-
feedzirra update 0.660000 0.280000 0.940000 ( 5.152709)
|
125
|
-
|
126
|
-
=== Discussion
|
52
|
+
# feed and entries accessors
|
53
|
+
feed.title # => "Paul Dix Explains Nothing"
|
54
|
+
feed.url # => "http://www.pauldix.net"
|
55
|
+
feed.feed_url # => "http://feeds.feedburner.com/PaulDixExplainsNothing"
|
56
|
+
feed.etag # => "GunxqnEP4NeYhrqq9TyVKTuDnh0"
|
57
|
+
feed.last_modified # => Sat Jan 31 17:58:16 -0500 2009 # it's a Time object
|
58
|
+
|
59
|
+
entry = feed.entries.first
|
60
|
+
entry.title # => "Ruby Http Client Library Performance"
|
61
|
+
entry.url # => "http://www.pauldix.net/2009/01/ruby-http-client-library-performance.html"
|
62
|
+
entry.author # => "Paul Dix"
|
63
|
+
entry.summary # => "..."
|
64
|
+
entry.content # => "..."
|
65
|
+
entry.published # => Thu Jan 29 17:00:19 UTC 2009 # it's a Time object
|
66
|
+
entry.categories # => ["...", "..."]
|
67
|
+
|
68
|
+
# sanitizing an entry's content
|
69
|
+
entry.title.sanitize # => returns the title with harmful stuff escaped
|
70
|
+
entry.author.sanitize # => returns the author with harmful stuff escaped
|
71
|
+
entry.content.sanitize # => returns the content with harmful stuff escaped
|
72
|
+
entry.content.sanitize! # => returns content with harmful stuff escaped and replaces original (also exists for author and title)
|
73
|
+
entry.sanitize! # => sanitizes the entry's title, author, and content in place (as in, it changes the value to clean versions)
|
74
|
+
feed.sanitize_entries! # => sanitizes all entries in place
|
75
|
+
|
76
|
+
# updating a single feed
|
77
|
+
updated_feed = Feedzirra::Feed.update(feed)
|
78
|
+
|
79
|
+
# an updated feed has the following extra accessors
|
80
|
+
updated_feed.updated? # returns true if any of the feed attributes have been modified. will return false if only new entries
|
81
|
+
updated_feed.new_entries # a collection of the entry objects that are newer than the latest in the feed before update
|
82
|
+
|
83
|
+
# fetching multiple feeds
|
84
|
+
feed_urls = ["http://feeds.feedburner.com/PaulDixExplainsNothing", "http://feeds.feedburner.com/trottercashion"]
|
85
|
+
feeds = Feedzirra::Reader.new(feed_urls).fetch
|
86
|
+
|
87
|
+
# feeds is now a hash with the feed_urls as keys and the parsed feed objects as values. If an error was thrown
|
88
|
+
# there will be a Fixnum of the http response code instead of a feed object
|
89
|
+
|
90
|
+
# updating multiple feeds. if you're using a persistent back-end, Feedzirra uses that to determine which entries are ones that you haven't seen before
|
91
|
+
updated_feeds = Feedzirra::reader.new(urls).fetch
|
92
|
+
|
93
|
+
# defining custom behavior on failure or success. note that a return status of 304 (not updated) will call the on_success handler
|
94
|
+
feed = Feedzirra::Reader.new("http://feeds.feedburner.com/PaulDixExplainsNothing",
|
95
|
+
:on_success => lambda {|feed| puts feed.title },
|
96
|
+
:on_failure => lambda {|url, response_code, response_header, response_body| puts response_body }).fetch
|
97
|
+
|
98
|
+
# if a collection was passed into the initializer, the handlers will be called for each one
|
99
|
+
|
100
|
+
== Discussion
|
127
101
|
|
128
102
|
I'd like feedback on the api and any bugs encountered on feeds in the wild. I've set up a
|
129
103
|
{google group here}[http://groups.google.com/group/feedzirra].
|
130
104
|
|
131
|
-
|
105
|
+
== Troubleshooting Installation
|
132
106
|
|
133
107
|
*NOTE:*Some people have been reporting a few issues related to installation. First, the Ruby Forge version of curb is not what you want. It will not work. Nor will the curl-multi gem that lives on
|
134
108
|
Ruby Forge. You have to get the taf2-curb[link:http://github.com/taf2/curb/tree/master] fork installed.
|
@@ -155,7 +129,7 @@ Another problem could be if you are running Mac Ports and you have libcurl insta
|
|
155
129
|
|
156
130
|
If you're still having issues, please let me know on the mailing list. Also, {Todd Fisher (taf2)}[link:http://github.com/taf2] is working on fixing the gem install. Please send him a full error report.
|
157
131
|
|
158
|
-
|
132
|
+
== TODO
|
159
133
|
|
160
134
|
This thing needs to hammer on many different feeds in the wild. I'm sure there will be bugs. I want to find them and crush them. I didn't bother
|
161
135
|
using the test suite for feedparser. i wanted to start fresh.
|
@@ -171,29 +145,6 @@ Here are some more specific TODOs.
|
|
171
145
|
* Make the feed_spec actually mock stuff out so it doesn't hit the net.
|
172
146
|
* Readdress how feeds determine if they can parse a document. Maybe I should use namespaces instead?
|
173
147
|
|
174
|
-
|
175
|
-
|
176
|
-
|
177
|
-
|
178
|
-
Copyright (c) 2009:
|
179
|
-
|
180
|
-
{Paul Dix}[http://pauldix.net]
|
181
|
-
|
182
|
-
Permission is hereby granted, free of charge, to any person obtaining
|
183
|
-
a copy of this software and associated documentation files (the
|
184
|
-
'Software'), to deal in the Software without restriction, including
|
185
|
-
without limitation the rights to use, copy, modify, merge, publish,
|
186
|
-
distribute, sublicense, and/or sell copies of the Software, and to
|
187
|
-
permit persons to whom the Software is furnished to do so, subject to
|
188
|
-
the following conditions:
|
189
|
-
|
190
|
-
The above copyright notice and this permission notice shall be
|
191
|
-
included in all copies or substantial portions of the Software.
|
192
|
-
|
193
|
-
THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
|
194
|
-
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
195
|
-
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
|
196
|
-
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
|
197
|
-
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
|
198
|
-
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
|
199
|
-
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
148
|
+
== LICENSE
|
149
|
+
|
150
|
+
This library is provided under the MIT License. See {the complete LICENSE}[link:files/LICENSE_rdoc.html] for details.
|
data/lib/feedzirra/feed.rb
CHANGED
@@ -19,7 +19,7 @@ module Feedzirra
|
|
19
19
|
# A Hash if multiple URL's are passed. The key will be the URL, and the value the Feed object.
|
20
20
|
def self.fetch_and_parse(urls, options = {})
|
21
21
|
multi = Feedzirra::HttpMulti.new(urls, options)
|
22
|
-
multi.
|
22
|
+
multi.run
|
23
23
|
urls.is_a?(String) ? multi.responses.values.first : multi.responses
|
24
24
|
end
|
25
25
|
|
@@ -36,9 +36,8 @@ module Feedzirra
|
|
36
36
|
#
|
37
37
|
# A Hash if multiple Feeds are passed. The key will be the URL, and the value the updated Feed object.
|
38
38
|
def self.update(feeds, options = {})
|
39
|
-
multi = Feedzirra::HttpMulti.new(
|
40
|
-
|
41
|
-
multi.perform
|
39
|
+
multi = Feedzirra::HttpMulti.new(feeds, options)
|
40
|
+
multi.run
|
42
41
|
multi.responses.size == 1 ? multi.responses.values.first : multi.responses.values
|
43
42
|
end
|
44
43
|
|
@@ -60,7 +59,7 @@ module Feedzirra
|
|
60
59
|
# FIXME - Raw mode is currently not supported!
|
61
60
|
def self.fetch_raw(urls, options = {})
|
62
61
|
multi = Feedzirra::HttpMulti.new(urls, options.merge(:raw => true))
|
63
|
-
multi.
|
62
|
+
multi.run
|
64
63
|
urls.is_a?(String) ? multi.responses.values.first : multi.responses
|
65
64
|
end
|
66
65
|
|
data/lib/feedzirra.rb
CHANGED
@@ -39,6 +39,6 @@ require 'feedzirra/parser/atom'
|
|
39
39
|
require 'feedzirra/parser/atom_feed_burner'
|
40
40
|
|
41
41
|
module Feedzirra
|
42
|
-
USER_AGENT = "feedzirra http://github.com/
|
43
|
-
VERSION = "0.0.12.
|
42
|
+
USER_AGENT = "feedzirra http://github.com/jsl/feedzirra/tree/master"
|
43
|
+
VERSION = "0.0.12.9"
|
44
44
|
end
|
data/spec/feedzirra/feed_spec.rb
CHANGED
@@ -1,5 +1,36 @@
|
|
1
1
|
require File.join(File.dirname(__FILE__), %w[.. spec_helper])
|
2
2
|
|
3
3
|
describe Feedzirra::Feed do
|
4
|
+
describe "Feed.fetch_and_parse" do
|
5
|
+
it "should call #run on the HttpMulti object" do
|
6
|
+
multi = mock('httpmulti')
|
7
|
+
multi.expects(:run)
|
8
|
+
response = mock('response', :values => [ ])
|
9
|
+
multi.expects(:responses).returns(response)
|
10
|
+
Feedzirra::HttpMulti.expects(:new).with('foo', { }).returns(multi)
|
11
|
+
Feedzirra::Feed.fetch_and_parse('foo')
|
12
|
+
end
|
13
|
+
end
|
4
14
|
|
15
|
+
describe "Feed.update" do
|
16
|
+
it "should call #run on the HttpMulti object" do
|
17
|
+
multi = mock('httpmulti')
|
18
|
+
multi.expects(:run)
|
19
|
+
response = mock('response', :values => [ ], :size => 0)
|
20
|
+
multi.expects(:responses).returns(response).times(2)
|
21
|
+
Feedzirra::HttpMulti.expects(:new).with('foo', { }).returns(multi)
|
22
|
+
Feedzirra::Feed.update('foo')
|
23
|
+
end
|
24
|
+
end
|
25
|
+
|
26
|
+
describe "#Feed.fetch_raw" do
|
27
|
+
it "should call #run on the HttpMulti object" do
|
28
|
+
multi = mock('httpmulti')
|
29
|
+
multi.expects(:run)
|
30
|
+
response = mock('response', :values => [ ])
|
31
|
+
multi.expects(:responses).returns(response)
|
32
|
+
Feedzirra::HttpMulti.expects(:new).with('foo', { :raw => true}).returns(multi)
|
33
|
+
Feedzirra::Feed.fetch_raw('foo')
|
34
|
+
end
|
35
|
+
end
|
5
36
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: jsl-feedzirra
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.12.
|
4
|
+
version: 0.0.12.9
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Paul Dix
|
@@ -9,7 +9,7 @@ autorequire:
|
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
11
|
|
12
|
-
date: 2009-
|
12
|
+
date: 2009-05-19 00:00:00 -07:00
|
13
13
|
default_executable:
|
14
14
|
dependencies:
|
15
15
|
- !ruby/object:Gem::Dependency
|
@@ -18,9 +18,9 @@ dependencies:
|
|
18
18
|
version_requirement:
|
19
19
|
version_requirements: !ruby/object:Gem::Requirement
|
20
20
|
requirements:
|
21
|
-
- - "
|
21
|
+
- - ">="
|
22
22
|
- !ruby/object:Gem::Version
|
23
|
-
version: 0
|
23
|
+
version: "0"
|
24
24
|
version:
|
25
25
|
- !ruby/object:Gem::Dependency
|
26
26
|
name: nokogiri
|
@@ -28,9 +28,9 @@ dependencies:
|
|
28
28
|
version_requirement:
|
29
29
|
version_requirements: !ruby/object:Gem::Requirement
|
30
30
|
requirements:
|
31
|
-
- - "
|
31
|
+
- - ">="
|
32
32
|
- !ruby/object:Gem::Version
|
33
|
-
version: 0
|
33
|
+
version: "0"
|
34
34
|
version:
|
35
35
|
- !ruby/object:Gem::Dependency
|
36
36
|
name: pauldix-sax-machine
|
@@ -90,7 +90,7 @@ dependencies:
|
|
90
90
|
requirements:
|
91
91
|
- - ">="
|
92
92
|
- !ruby/object:Gem::Version
|
93
|
-
version: 0
|
93
|
+
version: "0"
|
94
94
|
version:
|
95
95
|
description:
|
96
96
|
email: paul@pauldix.net
|
@@ -98,8 +98,9 @@ executables: []
|
|
98
98
|
|
99
99
|
extensions: []
|
100
100
|
|
101
|
-
extra_rdoc_files:
|
102
|
-
|
101
|
+
extra_rdoc_files:
|
102
|
+
- README.rdoc
|
103
|
+
- LICENSE.rdoc
|
103
104
|
files:
|
104
105
|
- lib/core_ext/date.rb
|
105
106
|
- lib/core_ext/string.rb
|
@@ -122,6 +123,7 @@ files:
|
|
122
123
|
- lib/feedzirra/parser/feed_utilities.rb
|
123
124
|
- lib/feedzirra/parser/feed_entry_utilities.rb
|
124
125
|
- README.rdoc
|
126
|
+
- LICENSE.rdoc
|
125
127
|
- Rakefile
|
126
128
|
- spec/spec.opts
|
127
129
|
- spec/spec_helper.rb
|
@@ -138,10 +140,15 @@ files:
|
|
138
140
|
- spec/feedzirra/feed_utilities_spec.rb
|
139
141
|
- spec/feedzirra/feed_entry_utilities_spec.rb
|
140
142
|
has_rdoc: true
|
141
|
-
homepage: http://github.com/
|
143
|
+
homepage: http://github.com/jsl/feedzirra
|
142
144
|
post_install_message:
|
143
|
-
rdoc_options:
|
144
|
-
|
145
|
+
rdoc_options:
|
146
|
+
- --title
|
147
|
+
- HashBack
|
148
|
+
- --main
|
149
|
+
- README.rdoc
|
150
|
+
- --line-numbers
|
151
|
+
- --inline-source
|
145
152
|
require_paths:
|
146
153
|
- lib
|
147
154
|
required_ruby_version: !ruby/object:Gem::Requirement
|