sporkmonger-sax-machine 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,165 @@
1
+ <?xml version="1.0" encoding="UTF-8"?>
2
+ <?xml-stylesheet href="http://feeds.feedburner.com/~d/styles/atom10full.xsl" type="text/xsl" media="screen"?><?xml-stylesheet href="http://feeds.feedburner.com/~d/styles/itemcontent.css" type="text/css" media="screen"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:thr="http://purl.org/syndication/thread/1.0" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
3
+ <title>Paul Dix Explains Nothing</title>
4
+
5
+ <link rel="alternate" type="text/html" href="http://www.pauldix.net/" />
6
+ <id>tag:typepad.com,2003:weblog-108605</id>
7
+ <updated>2008-09-04T16:07:19-04:00</updated>
8
+ <subtitle>Entrepreneurship, programming, software development, politics, NYC, and random thoughts.</subtitle>
9
+ <generator uri="http://www.typepad.com/">TypePad</generator>
10
+ <link rel="self" href="http://feeds.feedburner.com/PaulDixExplainsNothing" type="application/atom+xml" /><entry>
11
+ <title>Marshal data too short error with ActiveRecord</title>
12
+ <link rel="alternate" type="text/html" href="http://feeds.feedburner.com/~r/PaulDixExplainsNothing/~3/383536354/marshal-data-to.html" />
13
+ <link rel="replies" type="text/html" href="http://www.pauldix.net/2008/09/marshal-data-to.html" thr:count="2" thr:updated="2008-11-17T14:40:06-05:00" />
14
+ <id>tag:typepad.com,2003:post-55147740</id>
15
+ <published>2008-09-04T16:07:19-04:00</published>
16
+ <updated>2008-11-17T14:40:06-05:00</updated>
17
+ <summary>In my previous post about the speed of serializing data, I concluded that Marshal was the quickest way to get things done. So I set about using Marshal to store some data in an ActiveRecord object. Things worked great at...</summary>
18
+ <author>
19
+ <name>Paul Dix</name>
20
+ </author>
21
+ <category scheme="http://www.sixapart.com/ns/types#category" term="Tahiti" />
22
+
23
+
24
+ <content type="html" xml:lang="en-US" xml:base="http://www.pauldix.net/">
25
+ &lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;In my previous &lt;a href="http://www.pauldix.net/2008/08/serializing-dat.html"&gt;post about the speed of serializing data&lt;/a&gt;, I concluded that Marshal was the quickest way to get things done. So I set about using Marshal to store some data in an ActiveRecord object. Things worked great at first, but on some test data I got this error: marshal data too short. Luckily, &lt;a href="http://www.brynary.com/"&gt;Bryan Helmkamp&lt;/a&gt; had helpfully pointed out that there were sometimes problems with storing marshaled data in the database. He said it was best to base64 encode the marshal dump before storing.&lt;/p&gt;
26
+
27
+ &lt;p&gt;I was curious why it was working on some things and not others. It turns out that some types of data being marshaled were causing the error to pop up. Here's the test data I used in my specs:&lt;/p&gt;
28
+ &lt;pre&gt;{ :foo =&amp;gt; 3, :bar =&amp;gt; 2 } # hash with symbols for keys and integer values&lt;br /&gt;[3, 2.1, 4, 8]&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; # array with integer and float values&lt;/pre&gt;
29
+ &lt;p&gt;Everything worked when I switched the array values to all integers so it seems that floats were causing the problem. However, in the interest of keeping everything working regardless of data types, I base64 encoded before going into the database and decoded on the way out.&lt;/p&gt;
30
+
31
+ &lt;p&gt;I also ran the benchmarks again to determine what impact this would have on speed. Here are the results for 100 iterations on a 10k element array and a 10k element hash with and without base64 encode/decode:&lt;/p&gt;
32
+ &lt;pre&gt;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp; user&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp; system&amp;nbsp; &amp;nbsp;&amp;nbsp; total&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp; real&lt;br /&gt;array marshal&amp;nbsp; 0.200000&amp;nbsp; &amp;nbsp;0.010000&amp;nbsp; &amp;nbsp;0.210000 (&amp;nbsp; 0.214018) (without Base64)&lt;br /&gt;array marshal&amp;nbsp; 0.220000&amp;nbsp; &amp;nbsp;0.010000&amp;nbsp; &amp;nbsp;0.230000 (&amp;nbsp; 0.250260)&lt;br /&gt;&lt;br /&gt;hash marshal&amp;nbsp; &amp;nbsp;1.830000&amp;nbsp; &amp;nbsp;0.040000&amp;nbsp; &amp;nbsp;1.870000 (&amp;nbsp; 1.892874) (without Base64)&lt;br /&gt;hash marshal&amp;nbsp; &amp;nbsp;2.040000&amp;nbsp; &amp;nbsp;0.100000&amp;nbsp; &amp;nbsp;2.140000 (&amp;nbsp; 2.170405)&lt;/pre&gt;
33
+ &lt;p&gt;As you can see the difference in speed is pretty negligible. I assume that the error has to do with AR cleaning the stuff that gets inserted into the database, but I'm not really sure. In the end it's just easier to use Base64.encode64 when serializing data into a text field in ActiveRecord using Marshal.&lt;/p&gt;
34
+
35
+ &lt;p&gt;I've also read people posting about this error when using the database session store. I can only assume that it's because they were trying to store either way too much data in their session (too much for a regular text field) or they were storing float values or some other data type that would cause this to pop up. Hopefully this helps.&lt;/p&gt;&lt;/div&gt;
36
+ &lt;div class="feedflare"&gt;
37
+ &lt;a href="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?a=rWfWO"&gt;&lt;img src="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?i=rWfWO" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?a=RaCqo"&gt;&lt;img src="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?i=RaCqo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?a=1CBLo"&gt;&lt;img src="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?i=1CBLo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
38
+ &lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/PaulDixExplainsNothing/~4/383536354" height="1" width="1"/&gt;</content>
39
+
40
+
41
+ <feedburner:origLink>http://www.pauldix.net/2008/09/marshal-data-to.html</feedburner:origLink></entry>
42
+ <entry>
43
+ <title>Serializing data speed comparison: Marshal vs. JSON vs. Eval vs. YAML</title>
44
+ <link rel="alternate" type="text/html" href="http://feeds.feedburner.com/~r/PaulDixExplainsNothing/~3/376401099/serializing-dat.html" />
45
+ <link rel="replies" type="text/html" href="http://www.pauldix.net/2008/08/serializing-dat.html" thr:count="5" thr:updated="2008-10-14T01:26:31-04:00" />
46
+ <id>tag:typepad.com,2003:post-54766774</id>
47
+ <published>2008-08-27T14:31:41-04:00</published>
48
+ <updated>2008-10-14T01:26:31-04:00</updated>
49
+ <summary>Last night at the NYC Ruby hackfest, I got into a discussion about serializing data. Brian mentioned the Marshal library to me, which for some reason had completely escaped my attention until last night. He said it was wicked fast...</summary>
50
+ <author>
51
+ <name>Paul Dix</name>
52
+ </author>
53
+ <category scheme="http://www.sixapart.com/ns/types#category" term="Tahiti" />
54
+
55
+
56
+ <content type="html" xml:lang="en-US" xml:base="http://www.pauldix.net/">
57
+ &lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;Last night at the &lt;a href="http://nycruby.org"&gt;NYC Ruby hackfest&lt;/a&gt;, I got into a discussion about serializing data. Brian mentioned the Marshal library to me, which for some reason had completely escaped my attention until last night. He said it was wicked fast so we decided to run a quick benchmark comparison.&lt;/p&gt;
58
+ &lt;p&gt;The test data is designed to roughly approximate what my &lt;a href="http://www.pauldix.net/2008/08/storing-many-cl.html"&gt;stored classifier data&lt;/a&gt; will look like. The different methods we decided to benchmark were Marshal, json, eval, and yaml. With each one we took the in-memory object and serialized it and then read it back in. With eval we had to convert the object to ruby code to serialize it then run eval against that. Here are the results for 100 iterations on a 10k element array and a hash with 10k key/value pairs run on my Macbook Pro 2.4 GHz Core 2 Duo:&lt;/p&gt;
59
+ &lt;pre&gt;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; user&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;system&amp;nbsp; &amp;nbsp;&amp;nbsp; total&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp; real&lt;br /&gt;array marshal&amp;nbsp; 0.210000&amp;nbsp; &amp;nbsp;0.010000&amp;nbsp; &amp;nbsp;0.220000 (&amp;nbsp; 0.220701)&lt;br /&gt;array json&amp;nbsp; &amp;nbsp;&amp;nbsp; 2.180000&amp;nbsp; &amp;nbsp;0.050000&amp;nbsp; &amp;nbsp;2.230000 (&amp;nbsp; 2.288489)&lt;br /&gt;array eval&amp;nbsp; &amp;nbsp;&amp;nbsp; 2.090000&amp;nbsp; &amp;nbsp;0.060000&amp;nbsp; &amp;nbsp;2.150000 (&amp;nbsp; 2.240443)&lt;br /&gt;array yaml&amp;nbsp; &amp;nbsp; 26.650000&amp;nbsp; &amp;nbsp;0.350000&amp;nbsp; 27.000000 ( 27.810609)&lt;br /&gt;&lt;br /&gt;hash marshal&amp;nbsp; &amp;nbsp;2.000000&amp;nbsp; &amp;nbsp;0.050000&amp;nbsp; &amp;nbsp;2.050000 (&amp;nbsp; 2.114950)&lt;br /&gt;hash json&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;3.700000&amp;nbsp; &amp;nbsp;0.060000&amp;nbsp; &amp;nbsp;3.760000 (&amp;nbsp; 3.881716)&lt;br /&gt;hash eval&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;5.370000&amp;nbsp; &amp;nbsp;0.140000&amp;nbsp; &amp;nbsp;5.510000 (&amp;nbsp; 6.117947)&lt;br /&gt;hash yaml&amp;nbsp; &amp;nbsp;&amp;nbsp; 68.220000&amp;nbsp; &amp;nbsp;0.870000&amp;nbsp; 69.090000 ( 72.370784)&lt;/pre&gt;
60
+ &lt;p&gt;The order in which I tested them is pretty much the order in which they ranked for speed. Marshal was amazingly fast. JSON and eval came out roughly equal on the array with eval trailing quite a bit for the hash. Yaml was just slow as all hell. A note on the json: I used the 1.1.3 library which uses c to parse. I assume it would be quite a bit slower if I used the pure ruby implementation. Here's &lt;a href="http://gist.github.com/7549"&gt;a gist of the benchmark code&lt;/a&gt; if you're curious and want to run it yourself.&lt;/p&gt;
61
+
62
+
63
+
64
+ &lt;p&gt;If you're serializing user data, be super careful about using eval. It's probably best to avoid it completely. Finally, just for fun I took yaml out (it was too slow) and ran the benchmark again with 1k iterations:&lt;/p&gt;
65
+ &lt;pre&gt;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; user&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;system&amp;nbsp; &amp;nbsp;&amp;nbsp; total&amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp; real&lt;br /&gt;array marshal&amp;nbsp; 2.080000&amp;nbsp; &amp;nbsp;0.110000&amp;nbsp; &amp;nbsp;2.190000 (&amp;nbsp; 2.242235)&lt;br /&gt;array json&amp;nbsp; &amp;nbsp; 21.860000&amp;nbsp; &amp;nbsp;0.500000&amp;nbsp; 22.360000 ( 23.052403)&lt;br /&gt;array eval&amp;nbsp; &amp;nbsp; 20.730000&amp;nbsp; &amp;nbsp;0.570000&amp;nbsp; 21.300000 ( 21.992454)&lt;br /&gt;&lt;br /&gt;hash marshal&amp;nbsp; 19.510000&amp;nbsp; &amp;nbsp;0.500000&amp;nbsp; 20.010000 ( 20.794111)&lt;br /&gt;hash json&amp;nbsp; &amp;nbsp;&amp;nbsp; 39.770000&amp;nbsp; &amp;nbsp;0.670000&amp;nbsp; 40.440000 ( 41.689297)&lt;br /&gt;hash eval&amp;nbsp; &amp;nbsp;&amp;nbsp; 51.410000&amp;nbsp; &amp;nbsp;1.290000&amp;nbsp; 52.700000 ( 54.155711)&lt;/pre&gt;&lt;/div&gt;
66
+ &lt;div class="feedflare"&gt;
67
+ &lt;a href="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?a=zombO"&gt;&lt;img src="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?i=zombO" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?a=T3kqo"&gt;&lt;img src="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?i=T3kqo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?a=aI6Oo"&gt;&lt;img src="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?i=aI6Oo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
68
+ &lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/PaulDixExplainsNothing/~4/376401099" height="1" width="1"/&gt;</content>
69
+
70
+
71
+ <feedburner:origLink>http://www.pauldix.net/2008/08/serializing-dat.html</feedburner:origLink></entry>
72
+ <entry>
73
+ <title>Gotcha with cache_fu and permalinks</title>
74
+ <link rel="alternate" type="text/html" href="http://feeds.feedburner.com/~r/PaulDixExplainsNothing/~3/369250462/gotcha-with-cac.html" />
75
+ <link rel="replies" type="text/html" href="http://www.pauldix.net/2008/08/gotcha-with-cac.html" thr:count="2" thr:updated="2008-11-20T13:58:38-05:00" />
76
+ <id>tag:typepad.com,2003:post-54411628</id>
77
+ <published>2008-08-19T14:26:24-04:00</published>
78
+ <updated>2008-11-20T13:58:38-05:00</updated>
79
+ <summary>This is an issue I had recently in a project with cache_fu. Models that I found and cached based on permalinks weren't expiring the cache correctly when getting updated. Here's an example scenario. Say you have a blog with posts....</summary>
80
+ <author>
81
+ <name>Paul Dix</name>
82
+ </author>
83
+ <category scheme="http://www.sixapart.com/ns/types#category" term="Ruby on Rails" />
84
+
85
+
86
+ <content type="html" xml:lang="en-US" xml:base="http://www.pauldix.net/">
87
+ &lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;This is an issue I had recently in a project with &lt;a href="http://errtheblog.com/posts/57-kickin-ass-w-cachefu"&gt;cache_fu&lt;/a&gt;. Models that I found and cached based on permalinks weren't expiring the cache correctly when getting updated. Here's an example scenario.&lt;/p&gt;
88
+
89
+ &lt;p&gt;Say you have a blog with posts. However, instead of using a url like http://paulscoolblog.com/posts/23 you want something that's more search engine friendly and readable for the user. So you use a permalink (maybe using the &lt;a href="http://github.com/github/permalink_fu/tree/master"&gt;permalink_fu plugin&lt;/a&gt;) that's auto-generated based on the title of the post. This post would have a url that looks something like http://paulscoolblog.com/posts/gotcha-with-cache_fu-and-permalinks.&lt;/p&gt;
90
+
91
+ &lt;p&gt;In your controller's show method you'd probably find the post like this:&lt;/p&gt;
92
+ &lt;pre&gt;@post = Post.find_by_permalink(params[:permalink])&lt;/pre&gt;
93
+ &lt;p&gt;However, you'd want to do the caching thing so you'd actually do this:&lt;/p&gt;
94
+ &lt;pre&gt;@post = Post.cached(:find_by_permalink, :with =&amp;gt; params[:permalink])&lt;/pre&gt;
95
+ &lt;p&gt;The problem that I ran into, which is probably obvious to anyone familiar with cache_fu, was that when updating the post, it wouldn't expire the cache. That part of the post model looks like this:&lt;/p&gt;
96
+ &lt;pre&gt;class Post &amp;lt; ActiveRecord::Base&lt;br /&gt;&amp;nbsp; before_save :expire_cache&lt;br /&gt;&amp;nbsp; ...&lt;br /&gt;end&lt;/pre&gt;
97
+ &lt;p&gt;Do you see it? The issue is that when expire_cache gets called on the object, it expires the key &lt;strong&gt;Post:23&lt;/strong&gt; from the cache (assuming 23 was the id of the post). However, when the post was cached using the cached(:find_by_permalink ...) method, it put the post object into the cache with a key of &lt;strong&gt;Post:find_by_permalink:gotcha-with-cache_fu-and-permalinks&lt;/strong&gt;.&lt;/p&gt;
98
+ &lt;p&gt;Luckily, it's a fairly simple fix. If you have a model that is commonly accessed through permalinks, just write your own cache expiry method that looks for both keys and expires them.&lt;/p&gt;&lt;/div&gt;
99
+ &lt;div class="feedflare"&gt;
100
+ &lt;a href="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?a=V1ojO"&gt;&lt;img src="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?i=V1ojO" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?a=eu6Zo"&gt;&lt;img src="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?i=eu6Zo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?a=ddUho"&gt;&lt;img src="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?i=ddUho" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
101
+ &lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/PaulDixExplainsNothing/~4/369250462" height="1" width="1"/&gt;</content>
102
+
103
+
104
+ <feedburner:origLink>http://www.pauldix.net/2008/08/gotcha-with-cac.html</feedburner:origLink></entry>
105
+ <entry>
106
+ <title>Non-greedy mode in regex</title>
107
+ <link rel="alternate" type="text/html" href="http://feeds.feedburner.com/~r/PaulDixExplainsNothing/~3/365673983/non-greedy-mode.html" />
108
+ <link rel="replies" type="text/html" href="http://www.pauldix.net/2008/08/non-greedy-mode.html" thr:count="0" />
109
+ <id>tag:typepad.com,2003:post-54227244</id>
110
+ <published>2008-08-15T09:32:11-04:00</published>
111
+ <updated>2008-08-27T09:33:15-04:00</updated>
112
+ <summary>I was writing a regular expression yesterday and this popped up. It's just a quick note about greedy vs. non-greedy mode in regular expression matching. Say I have a regular expression that looks something like this: /(\[.*\])/ In English that...</summary>
113
+ <author>
114
+ <name>Paul Dix</name>
115
+ </author>
116
+ <category scheme="http://www.sixapart.com/ns/types#category" term="Ruby" />
117
+
118
+
119
+ <content type="html" xml:lang="en-US" xml:base="http://www.pauldix.net/">&lt;p&gt;I was writing a regular expression yesterday and this popped up. It's just a quick note about greedy vs. non-greedy mode in regular expression matching. Say I have a regular expression that looks something like this:&lt;/p&gt;&#xD;
120
+ &lt;pre&gt;/(\[.*\])/&lt;/pre&gt;&#xD;
121
+ &lt;p&gt;In English that says something roughly like: find an opening bracket [ with 0 or more of any character followed by a closing bracket. The backslashes are to escape the brackets and the parenthesis specify grouping so we can later access that matched text.&lt;/p&gt;&#xD;
122
+ &#xD;
123
+ &lt;p&gt;The greedy mode comes up with the 0 or more characters part of the match (the .* part of the expression). The default mode of greedy means that the parser will gobble up as many characters as it can and match the very last closing bracket. So if you have text like this:&lt;/p&gt;&#xD;
124
+ &#xD;
125
+ &lt;pre&gt;a = [:foo, :bar]&lt;br&gt;b = [:hello, :world]&lt;/pre&gt;&#xD;
126
+ &lt;p&gt;The resulting grouped match would be this:&lt;/p&gt;&#xD;
127
+ &lt;pre&gt;[:foo, :bar]&lt;br&gt;b = [:hello, :world]&lt;/pre&gt;&#xD;
128
+ &lt;p&gt;If you just wanted the [:foo, :bar] part, the solution is to parse in non-greedy mode. This means that it will match on the first closing bracket it sees. The modified regular expression looks like this:&lt;/p&gt;&#xD;
129
+ &lt;pre&gt;/(\[.*?\])/&lt;/pre&gt;&#xD;
130
+ &lt;p&gt;I love the regular expression engine in Ruby. It's one of the best things it ripped off from Perl. The one thing I don't like is the magic global variable that it places matched groups into. You can access that first match through the $1 variable. If you're unfamiliar with regular expressions, a good place to start is the &lt;a href="http://www.amazon.com/Programming-Perl-3rd-Larry-Wall/dp/0596000278/ref=pd_bbs_sr_1?ie=UTF8&amp;amp;s=books&amp;amp;qid=1218806755&amp;amp;sr=8-1"&gt;Camel book&lt;/a&gt;. It's about Perl, but the way they work is very similar. I actually haven't seen good coverage of regexes in a Ruby book.&lt;/p&gt;&lt;div class="feedflare"&gt;
131
+ &lt;a href="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?a=OkVmO"&gt;&lt;img src="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?i=OkVmO" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?a=iRpWo"&gt;&lt;img src="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?i=iRpWo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?a=pjRCo"&gt;&lt;img src="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?i=pjRCo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
132
+ &lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/PaulDixExplainsNothing/~4/365673983" height="1" width="1"/&gt;</content>
133
+
134
+
135
+ <feedburner:origLink>http://www.pauldix.net/2008/08/non-greedy-mode.html</feedburner:origLink></entry>
136
+ <entry>
137
+ <title>Storing many classification models</title>
138
+ <link rel="alternate" type="text/html" href="http://feeds.feedburner.com/~r/PaulDixExplainsNothing/~3/358530158/storing-many-cl.html" />
139
+ <link rel="replies" type="text/html" href="http://www.pauldix.net/2008/08/storing-many-cl.html" thr:count="3" thr:updated="2008-08-08T11:40:28-04:00" />
140
+ <id>tag:typepad.com,2003:post-53888232</id>
141
+ <published>2008-08-07T12:01:38-04:00</published>
142
+ <updated>2008-08-27T16:58:18-04:00</updated>
143
+ <summary>One of the things I need to do in Filterly is keep many trained classifiers. These are the machine learning models that determine if a blog post is on topic (Filterly separates information by topic). At the very least I...</summary>
144
+ <author>
145
+ <name>Paul Dix</name>
146
+ </author>
147
+ <category scheme="http://www.sixapart.com/ns/types#category" term="Tahiti" />
148
+
149
+
150
+ <content type="html" xml:lang="en-US" xml:base="http://www.pauldix.net/">&lt;p&gt;One of the things I need to do in &lt;a href="http://filterly.com/"&gt;Filterly&lt;/a&gt; is keep many trained &lt;a href="http://en.wikipedia.org/wiki/Statistical_classification"&gt;classifiers&lt;/a&gt;. These are the machine learning models that determine if a blog post is on topic (Filterly separates information by topic). At the very least I need one per topic in the system. If I want to do something like &lt;a href="http://en.wikipedia.org/wiki/Boosting"&gt;boosting&lt;/a&gt; then I need even more. The issue I'm wrestling with is how to store this data. I'll outline a specific approach and what the storage needs are.&lt;/p&gt;&#xD;
151
+ &#xD;
152
+ &lt;p&gt;Let's say I go with boosting and 10 &lt;a href="http://en.wikipedia.org/wiki/Perceptron"&gt;perceptrons&lt;/a&gt;. I'll also limit my feature space to the 10,000 most statistically significant features. So the storage for each perceptron is a 10k element array. However, I'll also have to keep another data structure to store what the 10k features are and their position in the array. In code I use a hash for this where the feature name is the key and the value is its position. I just need to store one of these hashes per topic.&lt;/p&gt;&#xD;
153
+ &#xD;
154
+ &lt;p&gt;That's not really a huge amount of data. I'm more concerned about the best way to store it. I don't think this kind of thing maps well to a relational database. I don't need to store the features individually. Generally when I'm running the thing I'll want the whole perceptron and feature set in memory for quick access. For now I'm just using a big text field and serializing each using JSON.&lt;/p&gt;&#xD;
155
+ &#xD;
156
+ &lt;p&gt;I don't really like this approach. The whole serializing into the database seems really inelegant. Combined with the time that it takes to parse these things. Each time I want to see if a new post is on topic I'd need to load up the classifier and parse the 10 10k arrays and the 10k key hash. I could keep each classifier running as a service, but then I've got a pretty heavy process running for each topic.&lt;/p&gt;&#xD;
157
+ &#xD;
158
+ &lt;p&gt;I guess I'll just use the stupid easy solution for the time being and worry about performance later. Anyone have thoughts on the best approach?&lt;/p&gt;&lt;div class="feedflare"&gt;
159
+ &lt;a href="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?a=DUT8O"&gt;&lt;img src="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?i=DUT8O" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?a=ZGjFo"&gt;&lt;img src="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?i=ZGjFo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?a=pH3Vo"&gt;&lt;img src="http://feeds.feedburner.com/~f/PaulDixExplainsNothing?i=pH3Vo" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
160
+ &lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/PaulDixExplainsNothing/~4/358530158" height="1" width="1"/&gt;</content>
161
+
162
+
163
+ <feedburner:origLink>http://www.pauldix.net/2008/08/storing-many-cl.html</feedburner:origLink></entry>
164
+
165
+ </feed>
@@ -0,0 +1,667 @@
1
+ require File.dirname(__FILE__) + '/../spec_helper'
2
+
3
+ describe "SAXMachine" do
4
+ describe "element" do
5
+ describe "when parsing a single element" do
6
+ before :each do
7
+ @klass = Class.new do
8
+ include SAXMachine
9
+ element :title
10
+ end
11
+ end
12
+
13
+ it "should provide an accessor" do
14
+ document = @klass.new
15
+ document.title = "Title"
16
+ document.title.should == "Title"
17
+ end
18
+
19
+ it "should allow introspection of the elements" do
20
+ @klass.column_names.should =~ [:title]
21
+ end
22
+
23
+ it "should not overwrite the setter if there is already one present" do
24
+ @klass = Class.new do
25
+ def title=(val)
26
+ @title = "#{val} **"
27
+ end
28
+ include SAXMachine
29
+ element :title
30
+ end
31
+ document = @klass.new
32
+ document.title = "Title"
33
+ document.title.should == "Title **"
34
+ end
35
+
36
+ describe "the class attribute" do
37
+ before(:each) do
38
+ @klass = Class.new do
39
+ include SAXMachine
40
+ element :date, :class => DateTime
41
+ end
42
+ @document = @klass.new
43
+ @document.date = DateTime.now.to_s
44
+ end
45
+
46
+ it "should be available" do
47
+ @klass.data_class(:date).should == DateTime
48
+ end
49
+ end
50
+
51
+ describe "the required attribute" do
52
+ it "should be available" do
53
+ @klass = Class.new do
54
+ include SAXMachine
55
+ element :date, :required => true
56
+ end
57
+ @klass.required?(:date).should be_true
58
+ end
59
+ end
60
+
61
+ it "should not overwrite the accessor when the element is not present" do
62
+ document = @klass.new
63
+ document.title = "Title"
64
+ document.parse("<foo></foo>")
65
+ document.title.should == "Title"
66
+ end
67
+
68
+ it "should not overwrite the value when the element is present" do
69
+ document = @klass.new
70
+ document.title = "Old title"
71
+ document.parse("<title>New title</title>")
72
+ document.title.should == "Old title"
73
+ end
74
+
75
+ it "should save the element text into an accessor" do
76
+ document = @klass.parse("<title>My Title</title>")
77
+ document.title.should == "My Title"
78
+ end
79
+
80
+ it "should save cdata into an accessor" do
81
+ document = @klass.parse("<title><![CDATA[A Title]]></title>")
82
+ document.title.should == "A Title"
83
+ end
84
+
85
+ it "should save the element text into an accessor when there are multiple elements" do
86
+ document = @klass.parse("<xml><title>My Title</title><foo>bar</foo></xml>")
87
+ document.title.should == "My Title"
88
+ end
89
+
90
+ it "should save the first element text when there are multiple of the same element" do
91
+ document = @klass.parse("<xml><title>My Title</title><title>bar</title></xml>")
92
+ document.title.should == "My Title"
93
+ end
94
+ end
95
+
96
+ describe "when parsing multiple elements" do
97
+ before :each do
98
+ @klass = Class.new do
99
+ include SAXMachine
100
+ element :title
101
+ element :name
102
+ end
103
+ end
104
+
105
+ it "should save the element text for a second tag" do
106
+ document = @klass.parse("<xml><title>My Title</title><name>Paul</name></xml>")
107
+ document.name.should == "Paul"
108
+ document.title.should == "My Title"
109
+ end
110
+ end
111
+
112
+ describe "when using options for parsing elements" do
113
+ describe "using the 'as' option" do
114
+ before :each do
115
+ @klass = Class.new do
116
+ include SAXMachine
117
+ element :description, :as => :summary
118
+ end
119
+ end
120
+
121
+ it "should provide an accessor using the 'as' name" do
122
+ document = @klass.new
123
+ document.summary = "a small summary"
124
+ document.summary.should == "a small summary"
125
+ end
126
+
127
+ it "should save the element text into the 'as' accessor" do
128
+ document = @klass.parse("<description>here is a description</description>")
129
+ document.summary.should == "here is a description"
130
+ end
131
+ end
132
+
133
+ describe "using the :with option" do
134
+ describe "and the :value option" do
135
+ before :each do
136
+ @klass = Class.new do
137
+ include SAXMachine
138
+ element :link, :value => :href, :with => {:foo => "bar"}
139
+ end
140
+ end
141
+
142
+ it "should escape correctly the ampersand" do
143
+ document = @klass.parse("<link href='http://api.flickr.com/services/feeds/photos_public.gne?id=49724566@N00&amp;lang=en-us&amp;format=atom' foo='bar'>asdf</link>")
144
+ document.link.should == "http://api.flickr.com/services/feeds/photos_public.gne?id=49724566@N00&lang=en-us&format=atom"
145
+ end
146
+
147
+ it "should save the value of a matching element" do
148
+ document = @klass.parse("<link href='test' foo='bar'>asdf</link>")
149
+ document.link.should == "test"
150
+ end
151
+
152
+ it "should save the value of the first matching element" do
153
+ document = @klass.parse("<xml><link href='first' foo='bar' /><link href='second' foo='bar' /></xml>")
154
+ document.link.should == "first"
155
+ end
156
+
157
+ describe "and the :as option" do
158
+ before :each do
159
+ @klass = Class.new do
160
+ include SAXMachine
161
+ element :link, :value => :href, :as => :url, :with => {:foo => "bar"}
162
+ element :link, :value => :href, :as => :second_url, :with => {:asdf => "jkl"}
163
+ end
164
+ end
165
+
166
+ it "should save the value of the first matching element" do
167
+ document = @klass.parse("<xml><link href='first' foo='bar' /><link href='second' asdf='jkl' /><link href='second' foo='bar' /></xml>")
168
+ document.url.should == "first"
169
+ document.second_url.should == "second"
170
+ end
171
+ end
172
+ end
173
+
174
+ describe "with only one element" do
175
+ before :each do
176
+ @klass = Class.new do
177
+ include SAXMachine
178
+ element :link, :with => {:foo => "bar"}
179
+ end
180
+ end
181
+
182
+ it "should save the text of an element that has matching attributes" do
183
+ document = @klass.parse("<link foo=\"bar\">match</link>")
184
+ document.link.should == "match"
185
+ end
186
+
187
+ it "should not save the text of an element that doesn't have matching attributes" do
188
+ document = @klass.parse("<link>no match</link>")
189
+ document.link.should be_nil
190
+ end
191
+
192
+ it "should save the text of an element that has matching attributes when it is the second of that type" do
193
+ document = @klass.parse("<xml><link>no match</link><link foo=\"bar\">match</link></xml>")
194
+ document.link.should == "match"
195
+ end
196
+
197
+ it "should save the text of an element that has matching attributes plus a few more" do
198
+ document = @klass.parse("<xml><link>no match</link><link asdf='jkl' foo='bar'>match</link>")
199
+ document.link.should == "match"
200
+ end
201
+ end
202
+
203
+ describe "with multiple elements of same tag" do
204
+ before :each do
205
+ @klass = Class.new do
206
+ include SAXMachine
207
+ element :link, :as => :first, :with => {:foo => "bar"}
208
+ element :link, :as => :second, :with => {:asdf => "jkl"}
209
+ end
210
+ end
211
+
212
+ it "should match the first element" do
213
+ document = @klass.parse("<xml><link>no match</link><link foo=\"bar\">first match</link><link>no match</link></xml>")
214
+ document.first.should == "first match"
215
+ end
216
+
217
+ it "should match the second element" do
218
+ document = @klass.parse("<xml><link>no match</link><link foo='bar'>first match</link><link asdf='jkl'>second match</link><link>hi</link></xml>")
219
+ document.second.should == "second match"
220
+ end
221
+ end
222
+ end
223
+
224
+ describe "using the 'value' option" do
225
+ before :each do
226
+ @klass = Class.new do
227
+ include SAXMachine
228
+ element :link, :value => :foo
229
+ end
230
+ end
231
+
232
+ it "should save the attribute value" do
233
+ document = @klass.parse("<link foo='test'>hello</link>")
234
+ document.link.should == 'test'
235
+ end
236
+
237
+ it "should save the attribute value when there is no text enclosed by the tag" do
238
+ document = @klass.parse("<link foo='test'></link>")
239
+ document.link.should == 'test'
240
+ end
241
+
242
+ it "should save the attribute value when the tag close is in the open" do
243
+ document = @klass.parse("<link foo='test'/>")
244
+ document.link.should == 'test'
245
+ end
246
+
247
+ it "should save two different attribute values on a single tag" do
248
+ @klass = Class.new do
249
+ include SAXMachine
250
+ element :link, :value => :foo, :as => :first
251
+ element :link, :value => :bar, :as => :second
252
+ end
253
+ document = @klass.parse("<link foo='foo value' bar='bar value'></link>")
254
+ document.first.should == "foo value"
255
+ document.second.should == "bar value"
256
+ end
257
+
258
+ it "should not fail if one of the attribute hasn't been defined" do
259
+ @klass = Class.new do
260
+ include SAXMachine
261
+ element :link, :value => :foo, :as => :first
262
+ element :link, :value => :bar, :as => :second
263
+ end
264
+ document = @klass.parse("<link foo='foo value'></link>")
265
+ document.first.should == "foo value"
266
+ document.second.should be_nil
267
+ end
268
+ end
269
+
270
+ describe "when desiring both the content and attributes of an element" do
271
+ before :each do
272
+ @klass = Class.new do
273
+ include SAXMachine
274
+ element :link
275
+ element :link, :value => :foo, :as => :link_foo
276
+ element :link, :value => :bar, :as => :link_bar
277
+ end
278
+ end
279
+
280
+ it "should parse the element and attribute values" do
281
+ document = @klass.parse("<link foo='test1' bar='test2'>hello</link>")
282
+ document.link.should == 'hello'
283
+ document.link_foo.should == 'test1'
284
+ document.link_bar.should == 'test2'
285
+ end
286
+ end
287
+
288
+ describe "when specifying namespaces" do
289
+ before :all do
290
+ @klass = Class.new do
291
+ include SAXMachine
292
+ element :a, :xmlns => 'urn:test'
293
+ element :b, :xmlns => ['', 'urn:test']
294
+ end
295
+ end
296
+
297
+ it "should get the element with the xmlns" do
298
+ document = @klass.parse("<a xmlns='urn:test'>hello</a>")
299
+ document.a.should == 'hello'
300
+ end
301
+
302
+ it "shouldn't get the element without the xmlns" do
303
+ document = @klass.parse("<a>hello</a>")
304
+ document.a.should be_nil
305
+ end
306
+
307
+ it "shouldn't get the element with the wrong xmlns" do
308
+ document = @klass.parse("<a xmlns='urn:test2'>hello</a>")
309
+ document.a.should be_nil
310
+ end
311
+
312
+ it "should get an element without xmlns if the empty namespace is desired" do
313
+ document = @klass.parse("<b>hello</b>")
314
+ document.b.should == 'hello'
315
+ end
316
+
317
+ it "should get an element with the right prefix" do
318
+ document = @klass.parse("<p:a xmlns:p='urn:test'>hello</p:a>")
319
+ document.a.should == 'hello'
320
+ end
321
+
322
+ it "should not get an element with the wrong prefix" do
323
+ document = @klass.parse("<x:a xmlns:p='urn:test' xmlns:x='urn:test2'>hello</x:a>")
324
+ document.a.should be_nil
325
+ end
326
+
327
+ it "should get a prefixed element without xmlns if the empty namespace is desired" do
328
+ pending "this needs a less pickier nokogiri push parser"
329
+ document = @klass.parse("<x:b>hello</x:b>")
330
+ document.b.should == 'hello'
331
+ end
332
+
333
+ it "should get the namespaced element even it's not first" do
334
+ document = @klass.parse("<root xmlns:a='urn:test'><a>foo</a><a>foo</a><a:a>bar</a:a></root>")
335
+ document.a.should == 'bar'
336
+ end
337
+
338
+ it "should parse multiple namespaces" do
339
+ klass = Class.new do
340
+ include SAXMachine
341
+ element :a, :xmlns => 'urn:test'
342
+ element :b, :xmlns => 'urn:test2'
343
+ end
344
+ document = klass.parse("<root xmlns='urn:test' xmlns:b='urn:test2'><b:b>bar</b:b><a>foo</a></root>")
345
+ document.a.should == 'foo'
346
+ document.b.should == 'bar'
347
+ end
348
+
349
+ context "when passing a default namespace" do
350
+ before :all do
351
+ @xmlns = 'urn:test'
352
+ class Inner
353
+ include SAXMachine
354
+ element :a, :xmlns => @xmlns
355
+ end
356
+ @outer = Class.new do
357
+ include SAXMachine
358
+ elements :root, :default_xmlns => @xmlns, :class => Inner
359
+ end
360
+ end
361
+
362
+ it "should replace the empty namespace with a default" do
363
+ document = @outer.parse("<root><a>Hello</a></root>")
364
+ document.root[0].a.should == 'Hello'
365
+ end
366
+
367
+ it "should not replace another namespace" do
368
+ document = @outer.parse("<root xmlns='urn:test2'><a>Hello</a></root>")
369
+ document.root[0].a.should == 'Hello'
370
+ end
371
+ end
372
+ end
373
+
374
+ end
375
+ end
376
+
377
+ describe "elements" do
378
+ describe "when parsing multiple elements" do
379
+ before :all do
380
+ @klass = Class.new do
381
+ include SAXMachine
382
+ elements :entry, :as => :entries
383
+ end
384
+ end
385
+
386
+ it "should provide a collection accessor" do
387
+ document = @klass.new
388
+ document.entries << :foo
389
+ document.entries.should == [:foo]
390
+ end
391
+
392
+ it "should parse a single element" do
393
+ document = @klass.parse("<entry>hello</entry>")
394
+ document.entries.should == ["hello"]
395
+ end
396
+
397
+ it "should parse multiple elements" do
398
+ document = @klass.parse("<xml><entry>hello</entry><entry>world</entry></xml>")
399
+ document.entries.should == ["hello", "world"]
400
+ end
401
+
402
+ it "should parse multiple elements when taking an attribute value" do
403
+ attribute_klass = Class.new do
404
+ include SAXMachine
405
+ elements :entry, :as => :entries, :value => :foo
406
+ end
407
+ doc = attribute_klass.parse("<xml><entry foo='asdf' /><entry foo='jkl' /></xml>")
408
+ doc.entries.should == ["asdf", "jkl"]
409
+ end
410
+ end
411
+
412
+ describe "when using the class option" do
413
+ before :each do
414
+ class Foo
415
+ include SAXMachine
416
+ element :title
417
+ end
418
+ @klass = Class.new do
419
+ include SAXMachine
420
+ elements :entry, :as => :entries, :class => Foo
421
+ end
422
+ end
423
+
424
+ it "should parse a single element with children" do
425
+ document = @klass.parse("<entry><title>a title</title></entry>")
426
+ document.entries.size.should == 1
427
+ document.entries.first.title.should == "a title"
428
+ end
429
+
430
+ it "should parse multiple elements with children" do
431
+ document = @klass.parse("<xml><entry><title>title 1</title></entry><entry><title>title 2</title></entry></xml>")
432
+ document.entries.size.should == 2
433
+ document.entries.first.title.should == "title 1"
434
+ document.entries.last.title.should == "title 2"
435
+ end
436
+
437
+ it "should not parse a top level element that is specified only in a child" do
438
+ document = @klass.parse("<xml><title>no parse</title><entry><title>correct title</title></entry></xml>")
439
+ document.entries.size.should == 1
440
+ document.entries.first.title.should == "correct title"
441
+ end
442
+
443
+ it "should parse out an attribute value from the tag that starts the collection" do
444
+ class Foo
445
+ element :entry, :value => :href, :as => :url
446
+ end
447
+ document = @klass.parse("<xml><entry href='http://pauldix.net'><title>paul</title></entry></xml>")
448
+ document.entries.size.should == 1
449
+ document.entries.first.title.should == "paul"
450
+ document.entries.first.url.should == "http://pauldix.net"
451
+ end
452
+ end
453
+
454
+ describe "when desiring sax events" do
455
+ XHTML_XMLNS = "http://www.w3.org/1999/xhtml"
456
+
457
+ before :all do
458
+ @klass = Class.new do
459
+ include SAXMachine
460
+ elements :body, :events => true
461
+ end
462
+ end
463
+
464
+ it "should parse a simple child" do
465
+ document = @klass.parse("<body><p/></body>")
466
+ document.body[0].should == [[:start_element, "", "p", []],
467
+ [:end_element, "", "p"]]
468
+ end
469
+
470
+ it "should parse a simple child with text" do
471
+ document = @klass.parse("<body><p>Hello</p></body>")
472
+ document.body[0].should == [[:start_element, "", "p", []],
473
+ [:chars, "Hello"],
474
+ [:end_element, "", "p"]]
475
+ end
476
+
477
+ it "should parse nested children" do
478
+ document = @klass.parse("<body><p><span/></p></body>")
479
+ document.body[0].should == [[:start_element, "", "p", []],
480
+ [:start_element, "", "span", []],
481
+ [:end_element, "", "span"],
482
+ [:end_element, "", "p"]]
483
+ end
484
+
485
+ it "should parse multiple children" do
486
+ document = @klass.parse("<body><p>Hello</p><p>World</p></body>")
487
+ document.body[0].should == [[:start_element, "", "p", []],
488
+ [:chars, "Hello"],
489
+ [:end_element, "", "p"],
490
+ [:start_element, "", "p", []],
491
+ [:chars, "World"],
492
+ [:end_element, "", "p"]]
493
+ end
494
+
495
+ it "should pass namespaces" do
496
+ document = @klass.parse("<body xmlns='#{XHTML_XMLNS}'><p/></body>")
497
+ document.body[0].should == [[:start_element, XHTML_XMLNS, "p", []],
498
+ [:end_element, XHTML_XMLNS, "p"]]
499
+ end
500
+ end
501
+ end
502
+
503
+ describe "full example" do
504
+ XMLNS_ATOM = "http://www.w3.org/2005/Atom"
505
+ XMLNS_FEEDBURNER = "http://rssnamespace.org/feedburner/ext/1.0"
506
+
507
+ before :each do
508
+ @xml = File.read('spec/sax-machine/atom.xml')
509
+ class AtomEntry
510
+ include SAXMachine
511
+ element :title
512
+ element :name, :as => :author
513
+ element :origLink, :as => :orig_link, :xmlns => XMLNS_FEEDBURNER
514
+ element :summary
515
+ element :content
516
+ element :published
517
+ end
518
+
519
+ class Atom
520
+ include SAXMachine
521
+ element :title
522
+ element :link, :value => :href, :as => :url, :with => {:type => "text/html"}
523
+ element :link, :value => :href, :as => :feed_url, :with => {:type => "application/atom+xml"}
524
+ elements :entry, :as => :entries, :class => AtomEntry, :xmlns => XMLNS_ATOM
525
+ end
526
+ end
527
+
528
+ it "should parse the url" do
529
+ f = Atom.parse(@xml)
530
+ f.url.should == "http://www.pauldix.net/"
531
+ end
532
+
533
+ it "should parse all entries" do
534
+ f = Atom.parse(@xml)
535
+ f.entries.length.should == 5
536
+ end
537
+
538
+ it "should parse the feedburner:origLink" do
539
+ f = Atom.parse(@xml)
540
+ f.entries[0].orig_link.should == 'http://www.pauldix.net/2008/09/marshal-data-to.html'
541
+ end
542
+ end
543
+
544
+ describe "another full example" do
545
+ RSS_XMLNS = 'http://purl.org/rss/1.0/'
546
+ ATOM_XMLNS = 'http://www.w3.org/2005/Atom'
547
+
548
+ class Entry
549
+ include SAXMachine
550
+ element :title, :xmlns => RSS_XMLNS
551
+ element :title, :xmlns => ATOM_XMLNS
552
+ element :link, :xmlns => RSS_XMLNS
553
+ element :link, :xmlns => ATOM_XMLNS, :value => 'href'
554
+ end
555
+
556
+ class Channel
557
+ include SAXMachine
558
+ element :title, :xmlns => RSS_XMLNS
559
+ element :title, :xmlns => ATOM_XMLNS
560
+ element :link, :xmlns => RSS_XMLNS
561
+ element :link, :xmlns => ATOM_XMLNS, :value => 'href'
562
+ elements :entry, :as => :entries, :class => Entry
563
+ elements :item, :as => :entries, :class => Entry
564
+ end
565
+
566
+ class Root
567
+ include SAXMachine
568
+ elements :rss, :as => :channels, :default_xmlns => RSS_XMLNS, :class => Channel
569
+ elements :feed, :as => :channels, :default_xmlns => ATOM_XMLNS, :class => Channel
570
+ end
571
+
572
+ context "when parsing a complex example" do
573
+ before :all do
574
+ @document = Root.parse(<<-XML).channels[0]
575
+ <?xml version="1.0" encoding="UTF-8"?>
576
+ <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"
577
+ xmlns:content="http://purl.org/rss/1.0/modules/content/"
578
+ xmlns:wfw="http://wellformedweb.org/CommentAPI/"
579
+ xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
580
+ xmlns:dc="http://purl.org/dc/elements/1.1/"
581
+ xmlns:cc="http://web.resource.org/cc/">
582
+ <channel>
583
+ <title>Delicious/tag/pubsubhubbub</title>
584
+ <atom:link rel="self" type="application/rss+xml" href="http://feeds.delicious.com/v2/rss/tag/pubsubhubbub?count=15"/>
585
+ <link>http://delicious.com/tag/pubsubhubbub</link>
586
+ <description>recent bookmarks tagged pubsubhubbub</description>
587
+ </channel>
588
+ </rss>
589
+ XML
590
+ end
591
+
592
+ it "should parse the title" do
593
+ @document.title.should == 'Delicious/tag/pubsubhubbub'
594
+ end
595
+
596
+ it "should parse the link" do
597
+ @document.link.should == 'http://feeds.delicious.com/v2/rss/tag/pubsubhubbub?count=15'
598
+ end
599
+ end
600
+ end
601
+
602
+ describe "yet another full example" do
603
+ context "when parsing a Twitter example" do
604
+ before :all do
605
+ RSS_XMLNS = ['http://purl.org/rss/1.0/', '']
606
+ ATOM_XMLNS = 'http://www.w3.org/2005/Atom' unless defined? ATOM_XMLNS
607
+
608
+ class Link
609
+ include SAXMachine
610
+ end
611
+
612
+ class Entry
613
+ include SAXMachine
614
+ element :title, :xmlns => RSS_XMLNS
615
+ element :link, :xmlns => RSS_XMLNS, :as => :entry_link
616
+ element :title, :xmlns => ATOM_XMLNS, :as => :title
617
+ elements :link, :xmlns => ATOM_XMLNS, :as => :links, :class => Link
618
+ end
619
+
620
+ class Feed
621
+ include SAXMachine
622
+ element :title, :xmlns => RSS_XMLNS, :as => :title
623
+ element :link, :xmlns => RSS_XMLNS, :as => :feed_link
624
+ elements :item, :xmlns => RSS_XMLNS, :as => :entries, :class => Entry
625
+ element :title, :xmlns => ATOM_XMLNS, :as => :title
626
+ elements :link, :xmlns => ATOM_XMLNS, :as => :links, :class => Link
627
+ end
628
+
629
+ @document = Feed.parse(<<-XML)
630
+ <?xml version="1.0" encoding="UTF-8"?>
631
+ <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
632
+ <channel>
633
+ <atom:link type="application/rss+xml" rel="self" href="http://twitter.com/statuses/user_timeline/5381582.rss"/>
634
+ <title>Twitter / julien51</title>
635
+ <link>http://twitter.com/julien51</link>
636
+ <description>Twitter updates from julien / julien51.</description>
637
+ <language>en-us</language>
638
+ <ttl>40</ttl>
639
+ <item>
640
+ <title>julien51: @github : I get an error when trying to build one of my gems (julien51-sax-machine), it seems related to another gem's gemspec.</title>
641
+ <description>julien51: @github : I get an error when trying to build one of my gems (julien51-sax-machine), it seems related to another gem's gemspec.</description>
642
+ <pubDate>Thu, 30 Jul 2009 01:00:30 +0000</pubDate>
643
+ <guid>http://twitter.com/julien51/statuses/2920716033</guid>
644
+ <link>http://twitter.com/julien51/statuses/2920716033</link>
645
+ </item>
646
+ <item>
647
+ <title>julien51: Hum, San Francisco's summer are delightful. http://bit.ly/VeXt4</title>
648
+ <description>julien51: Hum, San Francisco's summer are delightful. http://bit.ly/VeXt4</description>
649
+ <pubDate>Wed, 29 Jul 2009 23:07:32 +0000</pubDate>
650
+ <guid>http://twitter.com/julien51/statuses/2918869948</guid>
651
+ <link>http://twitter.com/julien51/statuses/2918869948</link>
652
+ </item>
653
+ </channel>
654
+ </rss>
655
+ XML
656
+ end
657
+
658
+ it "should parse the title" do
659
+ @document.title.should == 'Twitter / julien51'
660
+ end
661
+
662
+ it "should find an entry" do
663
+ @document.entries.length.should == 2
664
+ end
665
+ end
666
+ end
667
+ end