wuclan 0.2.0 → 0.2.1
Sign up to get free protection for your applications and to get access to all the features.
- data/README.textile +81 -6
- data/examples/twitter/parse/parse_twitter_requests.rb +24 -18
- data/examples/twitter/parse/parse_twitter_search_requests.rb +12 -16
- data/examples/twitter/parse/parse_twitter_stream_requests.rb +40 -0
- data/examples/twitter/scrape_twitter_search/README.textile +95 -0
- data/examples/twitter/scrape_twitter_search/edamame_global_config-template.yaml +17 -0
- data/examples/twitter/scrape_twitter_search/scrape_twitter_search.rb +1 -1
- data/examples/twitter/scrape_twitter_search/seed.tsv +19 -0
- data/examples/twitter/scrape_twitter_search/twitter_search_daemons.god +38 -18
- data/lib/wuclan/twitter.rb +2 -0
- data/lib/wuclan/twitter/model.rb +1 -0
- data/lib/wuclan/twitter/model/relationship.rb +34 -26
- data/lib/wuclan/twitter/model/tweet.rb +7 -0
- data/lib/wuclan/twitter/model/twitter_user.rb +5 -0
- data/lib/wuclan/twitter/parse.rb +3 -0
- data/lib/wuclan/twitter/parse/twitter_search_parse.rb +1 -0
- data/lib/wuclan/twitter/scrape.rb +2 -0
- data/lib/wuclan/twitter/scrape/old_skool_request_classes.rb +8 -5
- data/lib/wuclan/twitter/scrape/twitter_ff_ids_request.rb +4 -2
- data/lib/wuclan/twitter/scrape/twitter_json_response.rb +39 -6
- data/lib/wuclan/twitter/scrape/twitter_request_stream.rb +1 -0
- data/lib/wuclan/twitter/scrape/twitter_stream_request.rb +44 -0
- data/wuclan.gemspec +10 -2
- metadata +9 -2
data/README.textile
CHANGED
@@ -1,9 +1,29 @@
|
|
1
1
|
|
2
|
-
|
2
|
+
Wuclan uses "Wukong":http://mrflip.github.com/wukong (Hadoop massive-data processing made easy) and "Monkeyshines":http://mrflip.github.com/monkeyshines (massive-scale directed scraper) to grok the deep structure of social networks. It is designed to scrape in a way that respectful of the terms and technical limits of each site while being agressive and efficient with your resources. We use it in practice to collect and analyze social graphs as large as 50 million-nodes, 1 billion-edges, 500 GB raw data -- all of it actual data extracted in compliance with the site's terms of service.
|
3
3
|
|
4
|
-
|
4
|
+
Currently wuclan handles:
|
5
5
|
|
6
|
-
|
6
|
+
* Twitter -- API
|
7
|
+
* Twitter -- Search
|
8
|
+
* Twitter -- Hosebird
|
9
|
+
* Last.fm
|
10
|
+
* Opensocial
|
11
|
+
|
12
|
+
<notextile><div class="toggle"></notextile>
|
13
|
+
|
14
|
+
h2. Why?
|
15
|
+
|
16
|
+
APIs are nice and all, but they prevent any insight into a) global properties, or b) deep structure. You can't find global word frequency and dispersion, or average clustering coefficient, or calculate pagerank, or determine weighted-shortest-paths connections between two people through an API call. But with a 10 machine hadoop cluster and a good-sized collection of data, you can (and wuclan has scripts to help answer many of those questions).
|
17
|
+
|
18
|
+
Wuclan is strictly meant for such massive-scale investigations. Unless you're planning to do your final analysis on either hadoop or an enterprise-grade database system it's probably not worth the hassle.
|
19
|
+
|
20
|
+
<notextile></div><div class="toggle"></notextile>
|
21
|
+
|
22
|
+
h2. Wuclan: Scraping
|
23
|
+
|
24
|
+
is almost ready for public use. Check back shortly.
|
25
|
+
|
26
|
+
h3. lib/wuclan/*/models
|
7
27
|
|
8
28
|
Defines the Wukong objects we'll most often use
|
9
29
|
|
@@ -12,9 +32,7 @@ Defines the Wukong objects we'll most often use
|
|
12
32
|
* TwitterUser
|
13
33
|
* TwitterUserProfiles
|
14
34
|
|
15
|
-
|
16
|
-
|
17
|
-
h3. lib/wuclan/request
|
35
|
+
h3. lib/wuclan/*/request
|
18
36
|
|
19
37
|
|
20
38
|
* Request -- the basic request metadata
|
@@ -25,4 +43,61 @@ h3. lib/wuclan/request
|
|
25
43
|
ensures that the request is left alone while recordizing.
|
26
44
|
|
27
45
|
|
46
|
+
<notextile></div><div class="toggle"></notextile>
|
47
|
+
|
48
|
+
h2. Wuclan: Analysis
|
49
|
+
|
50
|
+
actually most of this still lives in the imw_twitter_friends repo.
|
51
|
+
|
52
|
+
<notextile></div><div class="toggle"></notextile>
|
53
|
+
|
54
|
+
h2. Install
|
55
|
+
|
56
|
+
** "Main Install and Setup Documentation":http://mrflip.github.com/edamame/INSTALL.html **
|
57
|
+
|
58
|
+
h3. Get the code
|
59
|
+
|
60
|
+
We're still actively developing edamame. The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/edamame
|
61
|
+
|
62
|
+
pre. $ git clone git://github.com/mrflip/edamame
|
63
|
+
|
64
|
+
A gem is available from "gemcutter:":http://gemcutter.org/gems/edamame
|
65
|
+
|
66
|
+
pre. $ sudo gem install edamame --source=http://gemcutter.org
|
67
|
+
|
68
|
+
(don't use the gems.github.com version -- it's way out of date.)
|
69
|
+
|
70
|
+
You can instead download this project in either "zip":http://github.com/mrflip/edamame/zipball/master or "tar":http://github.com/mrflip/edamame/tarball/master formats.
|
71
|
+
|
72
|
+
h3. Get the Dependencies
|
73
|
+
|
74
|
+
To finish setting up, see the "detailed setup instructions":http://mrflip.github.com/edamame/INSTALL.html and then read the "usage notes":http://mrflip.github.com/edamame/usage.html
|
75
|
+
|
76
|
+
* "beanstalkd 1.3,":http://xph.us/dist/beanstalkd/ "libevent 1.4,":http://monkey.org/~provos/libevent/ and "beanstalk-client":http://github.com/dustin/beanstalk-client
|
77
|
+
* "Tokyo Tyrant,":http://tokyocabinet.sourceforge.net/tyrantdoc/ "Tokyo Tyrant Ruby libs,":http://tokyocabinet.sourceforge.net/tyrantrubydoc/ "Tokyo Cabinet,":http://tokyocabinet.sourceforge.net and "Tokyo Cabinet Ruby libs":http://tokyocabinet.sourceforge.net/tyrantdoc/
|
78
|
+
* Gems: "wukong":http://mrflip.github.com/wukong and "monkeyshines":http://mrflip.github.com/monkeyshines
|
79
|
+
|
80
|
+
See the "Detailed install instructions":http://mrflip.github.com/edamame/INSTALL.html (it also has hints about installing Tokyo*, Beanstalkd and friends.
|
81
|
+
|
82
|
+
<notextile></div><div class="toggle"></notextile>
|
83
|
+
|
28
84
|
h3. lib/wuclan/
|
85
|
+
|
86
|
+
|
87
|
+
---------------------------------------------------------------------------
|
88
|
+
|
89
|
+
<notextile><div class="toggle"></notextile>
|
90
|
+
|
91
|
+
h2. More info
|
92
|
+
|
93
|
+
There are many useful examples in the examples/ directory.
|
94
|
+
|
95
|
+
h3. Credits
|
96
|
+
|
97
|
+
wuclan was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org / "@mrflip":http://twitter.com/mrflip) for the "infochimps project":http://infochimps.org
|
98
|
+
|
99
|
+
h3. Help!
|
100
|
+
|
101
|
+
Send wuclan questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
|
102
|
+
|
103
|
+
<notextile></div></notextile>
|
@@ -5,23 +5,24 @@ require 'wukong'
|
|
5
5
|
require 'monkeyshines'
|
6
6
|
|
7
7
|
require 'wuclan/twitter'
|
8
|
-
# if you're anyone but original author this next require is useless but harmless.
|
9
|
-
require 'wuclan/twitter/scrape/old_skool_request_classes'
|
10
8
|
# un-namespace request classes.
|
11
9
|
include Wuclan::Twitter::Scrape
|
12
10
|
include Wuclan::Twitter::Model
|
11
|
+
# if you're anyone but original author this next require is useless but harmless.
|
12
|
+
require 'wuclan/twitter/scrape/old_skool_request_classes'
|
13
13
|
|
14
14
|
#
|
15
|
+
# Incoming objects are Wuclan::Twitter::Scrape requests.
|
15
16
|
#
|
16
|
-
#
|
17
|
-
#
|
18
|
-
#
|
17
|
+
# Their #parse method disgorges a stream of Wuclan::Twitter::Model objects, as
|
18
|
+
# few or as many as found. For example, a twitter_user_request will assumedly
|
19
|
+
# have a twitter_user record if it is healthy, but may not have a tweet (if the
|
20
|
+
# user hasn't ever tweeted) and might not have profile or style info (if the
|
21
|
+
# user is protected).
|
19
22
|
#
|
20
23
|
class TwitterRequestParser < Wukong::Streamer::StructStreamer
|
21
|
-
|
22
24
|
def process request, *args, &block
|
23
25
|
request.parse(*args) do |obj|
|
24
|
-
next if obj.is_a? BadRecord
|
25
26
|
yield obj.to_flat(false)
|
26
27
|
end
|
27
28
|
end
|
@@ -31,29 +32,34 @@ end
|
|
31
32
|
# We want to record each individual state of the resource, with the last-seen of
|
32
33
|
# its timestamps (if there are many). So if we saw
|
33
34
|
#
|
34
|
-
# rsrc id screen_name followers_count friends_count (...
|
35
|
-
# user 23 skidoo 47 61
|
36
|
-
# user 23 skidoo 48 62
|
37
|
-
# user 23 skidoo 48 62
|
38
|
-
# user 23 skidoo 52 62
|
39
|
-
# user 23 skidoo 52
|
35
|
+
# rsrc id screen_name followers_count friends_count (...) scraped_at
|
36
|
+
# user 23 skidoo 47 61 20090608
|
37
|
+
# user 23 skidoo 48 62 20090802
|
38
|
+
# user 23 skidoo 48 62 20090901
|
39
|
+
# user 23 skidoo 52 62 20090920
|
40
|
+
# user 23 skidoo 52 62 20090922
|
41
|
+
# user 23 skidoo 52 63 20090923
|
42
|
+
#
|
43
|
+
# we would only keep
|
40
44
|
#
|
45
|
+
# user 23 skidoo 47 61 20090608
|
46
|
+
# user 23 skidoo 48 62 20090802
|
47
|
+
# user 23 skidoo 52 62 20090920
|
48
|
+
# user 23 skidoo 52 63 20090922
|
41
49
|
#
|
42
50
|
class TwitterRequestUniqer < Wukong::Streamer::UniqByLastReducer
|
43
51
|
include Wukong::Streamer::StructRecordizer
|
44
|
-
|
45
52
|
attr_accessor :uniquer_count
|
46
53
|
|
47
54
|
#
|
48
|
-
#
|
49
|
-
#
|
55
|
+
# FIXME -- move this into the models themselves.
|
50
56
|
#
|
51
57
|
# for immutable objects we can just work off their ID.
|
52
58
|
#
|
53
59
|
# for mutable objects we want to record each unique state: all the fields
|
54
60
|
# apart from the scraped_at timestamp.
|
55
61
|
#
|
56
|
-
def get_key obj
|
62
|
+
def get_key obj, *_
|
57
63
|
case obj
|
58
64
|
when Tweet
|
59
65
|
obj.id
|
@@ -71,7 +77,7 @@ class TwitterRequestUniqer < Wukong::Streamer::UniqByLastReducer
|
|
71
77
|
super *args
|
72
78
|
end
|
73
79
|
|
74
|
-
def accumulate obj
|
80
|
+
def accumulate obj, *_
|
75
81
|
self.uniquer_count += 1
|
76
82
|
self.final_value = [self.uniquer_count, obj.to_flat].flatten
|
77
83
|
end
|
@@ -1,28 +1,24 @@
|
|
1
1
|
#!/usr/bin/env ruby
|
2
|
-
#$: << ENV['WUKONG_PATH']
|
3
2
|
require 'rubygems'
|
4
3
|
require 'wukong'
|
5
|
-
require '
|
6
|
-
|
7
|
-
require 'wuclan/twitter'
|
8
|
-
require 'wuclan/twitter/scrape/twitter_search_request'
|
9
|
-
require 'wuclan/twitter/parse/twitter_search_parse'
|
4
|
+
require 'wuclan/twitter';
|
5
|
+
require 'wuclan/twitter/parse';
|
10
6
|
include Wuclan::Twitter::Scrape
|
11
7
|
|
12
|
-
#
|
13
|
-
#
|
14
|
-
# Instantiate each incoming request.
|
15
|
-
# Stream out the contained classes it generates.
|
16
|
-
#
|
17
|
-
#
|
18
8
|
class TwitterRequestParser < Wukong::Streamer::StructStreamer
|
9
|
+
#
|
10
|
+
# Object: parse thyself.
|
11
|
+
#
|
19
12
|
def process request, *args, &block
|
20
13
|
request.parse(*args) do |obj|
|
21
|
-
next if obj.is_a?
|
22
|
-
yield obj
|
14
|
+
next if obj.blank? || obj.is_a?(BadRecord)
|
15
|
+
yield obj
|
23
16
|
end
|
24
17
|
end
|
25
18
|
end
|
26
19
|
|
27
|
-
#
|
28
|
-
Wukong::Script.new(
|
20
|
+
# Go, script, go!
|
21
|
+
Wukong::Script.new(
|
22
|
+
TwitterRequestParser,
|
23
|
+
nil
|
24
|
+
).run
|
@@ -0,0 +1,40 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
require 'rubygems'
|
3
|
+
require 'wukong'
|
4
|
+
require 'monkeyshines';
|
5
|
+
require 'wuclan/twitter';
|
6
|
+
require 'wuclan/twitter/scrape/twitter_search_request';
|
7
|
+
require 'wuclan/twitter/parse/twitter_search_parse';
|
8
|
+
include Wuclan::Twitter::Scrape
|
9
|
+
|
10
|
+
|
11
|
+
|
12
|
+
#
|
13
|
+
# Twitter stream requests
|
14
|
+
#
|
15
|
+
# http://apiwiki.twitter.com/Streaming-API-Documentation
|
16
|
+
#
|
17
|
+
# Fills a file with JSON status records, one line per status.
|
18
|
+
#
|
19
|
+
# {"text":"Hey #bigdata #hadoop geeks: who's missing? @mrflip/bigdata / http://bit.ly/datatweeps","favorited":false,"geo":null,"in_reply_to_screen_name":null,"source":"web","created_at":"Thu Oct 29 09:29:32 +0000 2009","user":{"verified":false,"notifications":null,"profile_text_color":"000000","time_zone":"Central Time (US & Canada)","following":null,"profile_link_color":"0000ff","profile_image_url":"http://a3.twimg.com/profile_images/377919497/FlipCircle-2009-900-trans_normal.png","profile_background_image_url":"http://a3.twimg.com/profile_background_images/2348065/2005Mar-AustinTypeTour-075_-_Rappers_Delight_Raindrop.jpg","description":"Increasing access to free open data, building tools to Organize, Explore and Comprehend massive data sources - http://infochimps.org","location":"iPhone: 30.316122,-97.733817","profile_sidebar_fill_color":"ffffff","screen_name":"mrflip","profile_background_tile":false,"profile_sidebar_border_color":"f0edd8","statuses_count":1307,"followers_count":678,"protected":false,"url":"http://infochimps.org","created_at":"Mon Mar 19 21:08:24 +0000 2007","friends_count":514,"name":"Philip Flip Kromer","geo_enabled":false,"profile_background_color":"BCC0C8","id":1554031,"utc_offset":-21600,"favourites_count":61},"id":5254924802,"in_reply_to_user_id":null,"in_reply_to_status_id":null,"truncated":false}
|
20
|
+
#
|
21
|
+
# Try it with
|
22
|
+
# twuserpass='name:pass'
|
23
|
+
# curl -s -u $twpass http://stream.twitter.com/1/statuses/sample.json > /tmp/sample.json
|
24
|
+
# cat /tmp/sample.json | parse_twitter_stream_requests.rb --map
|
25
|
+
#
|
26
|
+
class TwitterRequestParser < Wukong::Streamer::RecordStreamer
|
27
|
+
def recordize *args
|
28
|
+
foo = args.first
|
29
|
+
[ TwitterStreamRequest.new(super(*args).first) ]
|
30
|
+
end
|
31
|
+
def process request, *args, &block
|
32
|
+
request.parse(*args) do |obj|
|
33
|
+
next if obj.is_a? BadRecord
|
34
|
+
yield obj.to_flat(false) # if obj.is_a?(DeleteTweet)
|
35
|
+
end
|
36
|
+
end
|
37
|
+
end
|
38
|
+
|
39
|
+
# This makes the script go.
|
40
|
+
Wukong::Script.new(TwitterRequestParser, nil).run
|
@@ -0,0 +1,95 @@
|
|
1
|
+
This is actually less rickety than it seems, but you'll have to hand edit a few paths and config files. Feel free to suggest a more polite organization of it all.
|
2
|
+
|
3
|
+
h2. Initial setup
|
4
|
+
|
5
|
+
* Install prerequisites using rubygems:
|
6
|
+
|
7
|
+
<pre>
|
8
|
+
sudo gem install htmlentities extlib god
|
9
|
+
</pre>
|
10
|
+
|
11
|
+
* check out each of the "monkeyshines":http://github.com/mrflip/monkeyshines, "wukong":http://github.com/mrflip/wukong, "wuclan":http://github.com/mrflip/wuclan, and "edamame":http://github.com/mrflip/edamame repos using git, preferably as neighbors in the same directory.
|
12
|
+
|
13
|
+
* follow instructions from http://mrflip.github.com/edamame/INSTALL.html for beanstalkd and tokyo tyrant
|
14
|
+
|
15
|
+
h2. Find the scraper
|
16
|
+
|
17
|
+
Although you can install wuclan as a gem, I actually recommend installing it from git source:
|
18
|
+
|
19
|
+
<pre>
|
20
|
+
git clone git://github.com/mrflip/wuclan.git
|
21
|
+
</pre>
|
22
|
+
|
23
|
+
You'll run the scraper from
|
24
|
+
|
25
|
+
<pre>
|
26
|
+
wuclan/examples/twitter/scrape_twitter_search
|
27
|
+
</pre>
|
28
|
+
|
29
|
+
h2. Make the scrape destination
|
30
|
+
|
31
|
+
You will need to set up a landing place for the files, probably by editing the work/ symlink (sorry, this is kludgy and should be fixed).
|
32
|
+
|
33
|
+
The naming scheme I use is good for running scrapers against a lot of targets. From the wuclan/examples/twitter/scrape_twitter_search directory:
|
34
|
+
|
35
|
+
<pre>
|
36
|
+
mkdir ../../../../data/ripd/com.tw/com.twitter.search
|
37
|
+
</pre>
|
38
|
+
|
39
|
+
(this constructs a tree that is a sibling of the wuclan dir).
|
40
|
+
|
41
|
+
Wherever you put the scrape destination,
|
42
|
+
* DO NOT add it to your code's git repo
|
43
|
+
* exclude it from spotlight indexing and so forth
|
44
|
+
|
45
|
+
h2. Add your search terms to seed.tsv
|
46
|
+
|
47
|
+
To add a search job, edit seed.tsv: add each search phrase and its priority, separated by a tab (Lower priority == more important). **Don't** url-encode your query terms. Spaces will be replaced by plus signs+ and other non-alphanumerics will be url-encoded.
|
48
|
+
|
49
|
+
h2. Start the queue daemons
|
50
|
+
|
51
|
+
Copy @edamame_global_config-template.yaml@ to your scrape destination, and name it @edamame_global_config.yaml@ Also, edit the @./twitter_search_daemons.god@ file to indicate the scrape destination.
|
52
|
+
|
53
|
+
Use god to start the daemons: @sudo god -c ./twitter_search_daemons.god@ (add the -D flag to debug)
|
54
|
+
|
55
|
+
h2. Load the search terms
|
56
|
+
|
57
|
+
Load this data with
|
58
|
+
|
59
|
+
<pre>
|
60
|
+
./load_twitter_search_jobs.rb --handle=com.twitter.search --source-filename=./seed_lim.tsv
|
61
|
+
</pre>
|
62
|
+
|
63
|
+
You can check it was loaded with
|
64
|
+
|
65
|
+
<pre>
|
66
|
+
/path/to/edamame/bin/edamame-sync --handle=com.twitter.search --store=:11241 --queue=:11240
|
67
|
+
</pre>
|
68
|
+
|
69
|
+
(This unloads all jobs from the transient queue and stuffs them back in from the database).
|
70
|
+
|
71
|
+
Empty all search queues with
|
72
|
+
|
73
|
+
<pre>
|
74
|
+
/path/to/edamame/bin/edamame-nuke --handle=com.twitter.search --store=:11241 --queue=:11240
|
75
|
+
</pre>
|
76
|
+
|
77
|
+
h2. Run the scraper
|
78
|
+
|
79
|
+
<pre>
|
80
|
+
nohup ./scrape_twitter_search.rb --handle=com.twitter.search >> work/log/twitter_search-console.log 2>&1 &
|
81
|
+
</pre>
|
82
|
+
|
83
|
+
This will run forever. Check its progress with
|
84
|
+
|
85
|
+
<pre>
|
86
|
+
tail -f work/log/twitter_search-console.log
|
87
|
+
</pre>
|
88
|
+
|
89
|
+
If you want to watch the output files,
|
90
|
+
|
91
|
+
<pre>
|
92
|
+
datename=`date "+%Y%m%d"` ; tail -f work/$datename/* | cut -c 1-2000
|
93
|
+
</pre>
|
94
|
+
|
95
|
+
Be careful dumping the output files to screen -- each line can be tens of thousand characters long and will lock your terminal right up.
|
@@ -0,0 +1,17 @@
|
|
1
|
+
--- # -*- YAML -*-
|
2
|
+
#
|
3
|
+
# Save this file in your god dir, *then* change the settings below.
|
4
|
+
# Make sure your version control system is set to ignore the file.
|
5
|
+
#
|
6
|
+
|
7
|
+
:email:
|
8
|
+
:domain: your.domain.com
|
9
|
+
:username: robot@your.domain.com
|
10
|
+
:password: YOURPASSWORD
|
11
|
+
:to: people_who_can_fix_errors@your.domain.com
|
12
|
+
:to_name: People who can fix the scraper
|
13
|
+
|
14
|
+
# these apply to all processes
|
15
|
+
:god_process:
|
16
|
+
:flapping_notify: default
|
17
|
+
|
@@ -34,7 +34,7 @@ loop do
|
|
34
34
|
:dest => { :type => :chunked_flat_file_store, :rootdir => WORK_DIR, :filemode => 'a' },
|
35
35
|
# :dest => { :type => :flat_file_store, :filename => WORK_DIR+"/test_output.tsv" },
|
36
36
|
# :fetcher => { :type => TwitterSearchFakeFetcher },
|
37
|
-
:sleep_time => 1 ,
|
37
|
+
:sleep_time => 1.25 ,
|
38
38
|
})
|
39
39
|
Log.info "Starting a run!"
|
40
40
|
scraper.run
|
@@ -0,0 +1,19 @@
|
|
1
|
+
# See the readme for instructions.
|
2
|
+
|
3
|
+
# To add a search job, put the search phrase and its priority, separated by a
|
4
|
+
# tab (Lower priority == more important).
|
5
|
+
|
6
|
+
red sox 1000
|
7
|
+
yankees 1000
|
8
|
+
|
9
|
+
# You can recycle the output of dump_twitter_search_jobs
|
10
|
+
hadoop 50 0.103063874053513 4985477488 3852 9700.94447529118952 20091019021751
|
11
|
+
infochimp 50 0.106963481416675 4827288395 62 9347.39116701332932 20091019022759
|
12
|
+
infochimps 50 0.102891460905350 4922555326 575 9717.31832415175086 20091019025808
|
13
|
+
semantic 50 0.103387063739869 4985327841 4400 9670.39361100528913 20091019021435
|
14
|
+
semanticweb 50 0.102956390169747 4986072386 3156 9711.01359700656940 20091019025943
|
15
|
+
|
16
|
+
# These will quickly generate a buttload of data
|
17
|
+
# RT 110000 6.977299880525690 4986408447 9628514 115.87012880821899 20091019032753
|
18
|
+
# http 110000 28.312757201646100 4986411833 70327665 6.90163844186046 20091019032825
|
19
|
+
# twitter 110000 1.319672131147540 4986376554 7432567 733.71915915527904 20091019032511
|
@@ -1,25 +1,45 @@
|
|
1
|
-
|
2
|
-
require '
|
3
|
-
|
1
|
+
require 'yaml'
|
2
|
+
require 'extlib'
|
3
|
+
require 'wukong/extensions/hash'
|
4
|
+
require "edamame/monitoring"
|
4
5
|
|
5
6
|
#
|
6
|
-
#
|
7
|
+
# You can load this file with
|
8
|
+
# sudo god -c ./twitter_search_daemons.god
|
9
|
+
# To debug, run
|
10
|
+
# sudo god -c ./twitter_search_daemons.god -D
|
7
11
|
#
|
8
|
-
|
12
|
+
|
13
|
+
#
|
14
|
+
# Change this to point to your scrape destination.
|
15
|
+
#
|
16
|
+
WORK_DIR = '/data/ripd/com.tw/com.twitter.search'
|
17
|
+
|
18
|
+
#
|
19
|
+
# Also, make a copy of edamame_global_config-template.yaml in that directory,
|
20
|
+
# but rename it edamame_global_config.yaml and edit it to suit.
|
9
21
|
#
|
10
|
-
|
22
|
+
GodProcess::GLOBAL_SITE_OPTIONS_FILES << WORK_DIR+'/edamame_global_config.yaml'
|
23
|
+
|
24
|
+
# Files will be timestamped by when god is started.
|
25
|
+
DATESTAMP = Time.now.utc.strftime("%Y%m%d")
|
26
|
+
|
27
|
+
# Uncomment for a bunch of diagnostics:
|
28
|
+
# p GodProcess.global_site_options,
|
29
|
+
# TyrantGod.site_options, TyrantGod.default_options.deep_merge(TyrantGod.site_options),
|
30
|
+
# GodProcess.site_options
|
31
|
+
|
11
32
|
#
|
12
|
-
#
|
33
|
+
# Define email notifiers and attach one by default
|
13
34
|
#
|
14
|
-
|
15
|
-
|
16
|
-
[BeanstalkdGod, { :port => 11240, :max_mem_usage => 100.megabytes, }],
|
17
|
-
[TyrantGod, { :port => 11241, :db_dirname => WORK_DIR, :db_name => "twitter_search-queue.tct" }],
|
18
|
-
#
|
19
|
-
# [TyrantGod, { :port => 11249, :db_dirname => WORK_DIR, :db_name => "twitter_search-flat.tct" }],
|
20
|
-
]
|
35
|
+
God.setup_email GodProcess.global_site_options[:email]
|
36
|
+
GodProcess::DEFAULT_OPTIONS[:flapping_notify] = 'default'
|
21
37
|
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
38
|
+
#
|
39
|
+
# Twitter Search
|
40
|
+
#
|
41
|
+
handle = 'comtwittersearch'
|
42
|
+
base_port = 11220
|
43
|
+
db_dirname = WORK_DIR+'/distdb/'+DATESTAMP
|
44
|
+
BeanstalkdGod.create :port => base_port + 0, :max_mem_usage => 100.megabytes
|
45
|
+
TyrantGod.create :port => base_port + 1, :db_name => handle+'-queue.tct', :db_dirname => db_dirname
|
data/lib/wuclan/twitter.rb
CHANGED
data/lib/wuclan/twitter/model.rb
CHANGED
@@ -9,6 +9,7 @@ module Wuclan
|
|
9
9
|
autoload :TwitterUserSearchId, 'wuclan/twitter/model/twitter_user'
|
10
10
|
autoload :TwitterUserId, 'wuclan/twitter/model/twitter_user'
|
11
11
|
autoload :Tweet, 'wuclan/twitter/model/tweet'
|
12
|
+
autoload :DeleteTweet, 'wuclan/twitter/model/tweet'
|
12
13
|
autoload :SearchTweet, 'wuclan/twitter/model/tweet'
|
13
14
|
autoload :AFollowsB, 'wuclan/twitter/model/relationship'
|
14
15
|
autoload :AFavoritesB, 'wuclan/twitter/model/relationship'
|
@@ -7,6 +7,14 @@ module Wuclan::Twitter::Model
|
|
7
7
|
end
|
8
8
|
end
|
9
9
|
|
10
|
+
def status_id
|
11
|
+
tweet_id
|
12
|
+
end
|
13
|
+
|
14
|
+
def in_reply_to_status_id
|
15
|
+
in_reply_to_status_id
|
16
|
+
end
|
17
|
+
|
10
18
|
def self.included base
|
11
19
|
base.class_eval{ extend ClassMethods }
|
12
20
|
end
|
@@ -28,43 +36,43 @@ module Wuclan::Twitter::Model
|
|
28
36
|
class AFavoritesB < TypedStruct.new(
|
29
37
|
[:user_a_id, Integer],
|
30
38
|
[:user_b_id, Integer],
|
31
|
-
[:
|
39
|
+
[:tweet_id, Integer]
|
32
40
|
)
|
33
41
|
include ModelCommon
|
34
42
|
include RelationshipBase
|
35
|
-
# Key on user_a-user_b-
|
43
|
+
# Key on user_a-user_b-tweet_id (really just user_a-tweet_id is enough)
|
36
44
|
def num_key_fields() 3 end
|
37
|
-
def numeric_id_fields() [:user_a_id, :user_b_id, :
|
45
|
+
def numeric_id_fields() [:user_a_id, :user_b_id, :tweet_id] ; end
|
38
46
|
end
|
39
47
|
|
40
48
|
# Direct (threaded) replies: occur at the start of a tweet.
|
41
49
|
class ARepliesB < TypedStruct.new(
|
42
50
|
[:user_a_id, Integer],
|
43
51
|
[:user_b_id, Integer],
|
44
|
-
[:
|
45
|
-
[:
|
52
|
+
[:tweet_id, Integer],
|
53
|
+
[:in_reply_to_tweet_id, Integer]
|
46
54
|
)
|
47
55
|
include ModelCommon
|
48
56
|
include RelationshipBase
|
49
|
-
# Key on user_a-user_b-
|
57
|
+
# Key on user_a-user_b-tweet_id
|
50
58
|
def num_key_fields() 3 end
|
51
|
-
def numeric_id_fields() [:user_a_id, :user_b_id, :
|
59
|
+
def numeric_id_fields() [:user_a_id, :user_b_id, :tweet_id, :in_reply_to_tweet_id] ; end
|
52
60
|
end
|
53
61
|
|
54
62
|
# Direct (threaded) replies: occur at the start of a tweet.
|
55
63
|
class ARepliesBName < TypedStruct.new(
|
56
|
-
[:user_a_name,
|
57
|
-
[:user_b_name,
|
58
|
-
[:
|
59
|
-
[:
|
60
|
-
[:user_a_sid,
|
61
|
-
[:user_b_sid,
|
64
|
+
[:user_a_name, String],
|
65
|
+
[:user_b_name, String],
|
66
|
+
[:tweet_id, Integer],
|
67
|
+
[:in_reply_to_tweet_id, Integer],
|
68
|
+
[:user_a_sid, Integer],
|
69
|
+
[:user_b_sid, Integer]
|
62
70
|
)
|
63
71
|
include ModelCommon
|
64
72
|
include RelationshipBase
|
65
|
-
# Key on user_a-user_b-
|
73
|
+
# Key on user_a-user_b-tweet_id
|
66
74
|
def num_key_fields() 3 end
|
67
|
-
def numeric_id_fields() [:user_a_id, :user_b_id, :
|
75
|
+
def numeric_id_fields() [:user_a_id, :user_b_id, :tweet_id, :in_reply_to_tweet_id] ; end
|
68
76
|
end
|
69
77
|
|
70
78
|
# Atsign mentions anywhere in the tweet
|
@@ -72,13 +80,13 @@ module Wuclan::Twitter::Model
|
|
72
80
|
class AAtsignsB < TypedStruct.new(
|
73
81
|
[:user_a_id, Integer],
|
74
82
|
[:user_b_name, String],
|
75
|
-
[:
|
83
|
+
[:tweet_id, Integer]
|
76
84
|
)
|
77
85
|
include ModelCommon
|
78
86
|
include RelationshipBase
|
79
|
-
# Key on user_a-user_b-
|
87
|
+
# Key on user_a-user_b-tweet_id
|
80
88
|
def num_key_fields() 3 end
|
81
|
-
def numeric_id_fields() [:user_a_id, :
|
89
|
+
def numeric_id_fields() [:user_a_id, :tweet_id] ; end
|
82
90
|
end
|
83
91
|
|
84
92
|
# Atsign mentions anywhere in the tweet
|
@@ -86,13 +94,13 @@ module Wuclan::Twitter::Model
|
|
86
94
|
class AAtsignsBId < TypedStruct.new(
|
87
95
|
[:user_a_id, Integer],
|
88
96
|
[:user_b_id, Integer],
|
89
|
-
[:
|
97
|
+
[:tweet_id, Integer]
|
90
98
|
)
|
91
99
|
include ModelCommon
|
92
100
|
include RelationshipBase
|
93
|
-
# Key on user_a-user_b-
|
101
|
+
# Key on user_a-user_b-tweet_id
|
94
102
|
def num_key_fields() 3 end
|
95
|
-
def numeric_id_fields() [:user_a_id, :user_b_id, :
|
103
|
+
def numeric_id_fields() [:user_a_id, :user_b_id, :tweet_id] ; end
|
96
104
|
end
|
97
105
|
|
98
106
|
|
@@ -112,7 +120,7 @@ module Wuclan::Twitter::Model
|
|
112
120
|
# non-retweet-whore-requests have user_b_name set and unset respectively.)
|
113
121
|
#
|
114
122
|
# +user_a_id:+ the user who sent the re-tweet
|
115
|
-
# +
|
123
|
+
# +tweet_id:+ the id of the tweet *containing* the re-tweet (for the ID of the original tweet you're on your own.)
|
116
124
|
# +user_b_name:+ the user citied as originating: RT @user_b_name
|
117
125
|
# +please_flag:+ a 1 if the text contains 'please' or 'plz' as a stand-alone word
|
118
126
|
# +text:+ the *full* text of the tweet
|
@@ -120,7 +128,7 @@ module Wuclan::Twitter::Model
|
|
120
128
|
class ARetweetsB < TypedStruct.new(
|
121
129
|
[:user_a_id, Integer],
|
122
130
|
[:user_b_name, String],
|
123
|
-
[:
|
131
|
+
[:tweet_id, Integer],
|
124
132
|
[:please_flag, Integer],
|
125
133
|
[:text, String]
|
126
134
|
)
|
@@ -133,7 +141,7 @@ module Wuclan::Twitter::Model
|
|
133
141
|
end
|
134
142
|
# Key on retweeting_user-user-tweet_id
|
135
143
|
def num_key_fields() 3 end
|
136
|
-
def numeric_id_fields() [:user_a_id, :
|
144
|
+
def numeric_id_fields() [:user_a_id, :tweet_id] ; end
|
137
145
|
#
|
138
146
|
# If there's no user we'll assume this
|
139
147
|
# is a retweet and not an rtwhore.
|
@@ -146,7 +154,7 @@ module Wuclan::Twitter::Model
|
|
146
154
|
class ARetweetsBId < TypedStruct.new(
|
147
155
|
[:user_a_id, Integer],
|
148
156
|
[:user_b_id, Integer],
|
149
|
-
[:
|
157
|
+
[:tweet_id, Integer],
|
150
158
|
[:please_flag, Integer],
|
151
159
|
[:text, String]
|
152
160
|
)
|
@@ -160,7 +168,7 @@ module Wuclan::Twitter::Model
|
|
160
168
|
|
161
169
|
# Key on retweeting_user-user-tweet_id
|
162
170
|
def num_key_fields() 3 end
|
163
|
-
def numeric_id_fields() [:user_a_id, :user_b_id, :
|
171
|
+
def numeric_id_fields() [:user_a_id, :user_b_id, :tweet_id] ; end
|
164
172
|
|
165
173
|
#
|
166
174
|
# If there's no user we'll assume this
|
@@ -31,6 +31,13 @@ module Wuclan::Twitter::Model
|
|
31
31
|
def numeric_id_fields() [:id, :twitter_user_id, :in_reply_to_status_id, :in_reply_to_user_id] ; end
|
32
32
|
end
|
33
33
|
|
34
|
+
class DeleteTweet < TypedStruct.new(
|
35
|
+
[:id, Integer ],
|
36
|
+
[:created_at, Bignum ],
|
37
|
+
[:twitter_user_id, Integer ]
|
38
|
+
)
|
39
|
+
include ModelCommon
|
40
|
+
end
|
34
41
|
|
35
42
|
#
|
36
43
|
# SearchTweet
|
@@ -30,6 +30,8 @@ module Wuclan::Twitter::Model
|
|
30
30
|
|
31
31
|
end
|
32
32
|
|
33
|
+
|
34
|
+
|
33
35
|
#
|
34
36
|
# Fundamental information on a user.
|
35
37
|
#
|
@@ -57,6 +59,9 @@ module Wuclan::Twitter::Model
|
|
57
59
|
def tweets_per_day() tweets_count.to_i / days_since_created end
|
58
60
|
end
|
59
61
|
|
62
|
+
|
63
|
+
|
64
|
+
|
60
65
|
#
|
61
66
|
# Outside of a users/show page, when a user is mentioned
|
62
67
|
# only this subset of fields appear.
|
@@ -14,8 +14,10 @@ module Wuclan
|
|
14
14
|
autoload :TwitterFriendsIdsRequest, 'wuclan/twitter/scrape/twitter_ff_ids_request'
|
15
15
|
autoload :TwitterUserTimelineRequest, 'wuclan/twitter/scrape/twitter_timeline_request'
|
16
16
|
autoload :TwitterPublicTimelineRequest, 'wuclan/twitter/scrape/twitter_timeline_request'
|
17
|
+
autoload :TwitterStreamRequest, 'wuclan/twitter/scrape/twitter_stream_request'
|
17
18
|
autoload :JsonUserWithTweet, 'wuclan/twitter/scrape/twitter_json_response'
|
18
19
|
autoload :JsonTweetWithUser, 'wuclan/twitter/scrape/twitter_json_response'
|
20
|
+
autoload :JsonDeleteTweet, 'wuclan/twitter/scrape/twitter_json_response'
|
19
21
|
|
20
22
|
end
|
21
23
|
end
|
@@ -13,8 +13,7 @@ module Wuclan::Twitter::Scrape
|
|
13
13
|
|
14
14
|
def parse *args, &block
|
15
15
|
handle_special_cases!(*args, &block) or return
|
16
|
-
|
17
|
-
yield self
|
16
|
+
super *args
|
18
17
|
end
|
19
18
|
|
20
19
|
def handle_special_cases! *args, &block
|
@@ -26,10 +25,14 @@ module Wuclan::Twitter::Scrape
|
|
26
25
|
end
|
27
26
|
end
|
28
27
|
|
29
|
-
class
|
30
|
-
class
|
31
|
-
class
|
28
|
+
class User < TwitterUserRequest ; include OldSkoolRequest ; end
|
29
|
+
class Followers < TwitterFollowersRequest ; include OldSkoolRequest ; end
|
30
|
+
class Friends < TwitterFriendsRequest ; include OldSkoolRequest ; end
|
31
|
+
class FollowersIds < TwitterFollowersIdsRequest ; include OldSkoolRequest ; end
|
32
|
+
class FriendsIds < TwitterFriendsIdsRequest ; include OldSkoolRequest ; end
|
33
|
+
class Favorites < TwitterFavoritesRequest ; include OldSkoolRequest ; end
|
32
34
|
class UserTimeline < TwitterUserTimelineRequest ; include OldSkoolRequest ; end
|
35
|
+
|
33
36
|
class Bogus < BadRecord ;
|
34
37
|
def parse suffix=nil, *args
|
35
38
|
errors = suffix.split('-')
|
@@ -27,10 +27,11 @@ module Wuclan
|
|
27
27
|
# unpacks the raw API response, yielding all the relationships.
|
28
28
|
#
|
29
29
|
def parse *args, &block
|
30
|
+
return unless healthy?
|
30
31
|
parsed_contents.each do |user_b_id|
|
31
32
|
user_b_id = "%010d"%user_b_id.to_i
|
32
33
|
# B is a follower: B follows user.
|
33
|
-
yield AFollowsB.new(user_b_id,
|
34
|
+
yield AFollowsB.new(user_b_id, twitter_user_id)
|
34
35
|
end
|
35
36
|
end
|
36
37
|
end
|
@@ -62,10 +63,11 @@ module Wuclan
|
|
62
63
|
# unpacks the raw API response, yielding all the relationships.
|
63
64
|
#
|
64
65
|
def parse *args, &block
|
66
|
+
return unless healthy?
|
65
67
|
parsed_contents.each do |user_b_id|
|
66
68
|
user_b_id = "%010d"%user_b_id.to_i
|
67
69
|
# B is a friend: user follows B
|
68
|
-
yield AFollowsB.new(
|
70
|
+
yield AFollowsB.new(twitter_user_id, user_b_id)
|
69
71
|
end
|
70
72
|
end
|
71
73
|
end
|
@@ -20,6 +20,7 @@ module Wuclan::Twitter::Scrape
|
|
20
20
|
# generate all the contained TwitterXXX objects
|
21
21
|
#
|
22
22
|
def each
|
23
|
+
return unless healthy?
|
23
24
|
if is_partial?
|
24
25
|
yield user
|
25
26
|
else
|
@@ -38,10 +39,10 @@ module Wuclan::Twitter::Scrape
|
|
38
39
|
# This method tries to guess, based on the fields in the raw_user, which it has.
|
39
40
|
#
|
40
41
|
def is_partial?
|
42
|
+
p(raw) if !raw_user
|
41
43
|
not raw_user.include?('friends_count')
|
42
44
|
end
|
43
45
|
|
44
|
-
|
45
46
|
def tweet
|
46
47
|
Tweet.from_hash raw_tweet if raw_tweet
|
47
48
|
end
|
@@ -66,7 +67,7 @@ module Wuclan::Twitter::Scrape
|
|
66
67
|
#
|
67
68
|
def fix_raw_user!
|
68
69
|
return unless raw_user
|
69
|
-
raw_user['scraped_at'] = self.moreinfo['scraped_at']
|
70
|
+
raw_user['scraped_at'] = ModelCommon.flatten_date(self.moreinfo['scraped_at'])
|
70
71
|
raw_user['created_at'] = ModelCommon.flatten_date(raw_user['created_at'])
|
71
72
|
raw_user['id'] = ModelCommon.zeropad_id( raw_user['id'])
|
72
73
|
raw_user['protected'] = ModelCommon.unbooleanize(raw_user['protected'])
|
@@ -88,7 +89,7 @@ module Wuclan::Twitter::Scrape
|
|
88
89
|
raw_tweet['created_at'] = ModelCommon.flatten_date(raw_tweet['created_at'])
|
89
90
|
raw_tweet['favorited'] = ModelCommon.unbooleanize(raw_tweet['favorited'])
|
90
91
|
raw_tweet['truncated'] = ModelCommon.unbooleanize(raw_tweet['truncated'])
|
91
|
-
raw_tweet['twitter_user_id'] = ModelCommon.zeropad_id(
|
92
|
+
raw_tweet['twitter_user_id'] = ModelCommon.zeropad_id( raw_user['id'] )
|
92
93
|
raw_tweet['in_reply_to_user_id'] = ModelCommon.zeropad_id( raw_tweet['in_reply_to_user_id']) unless raw_tweet['in_reply_to_user_id'].blank? || (raw_tweet['in_reply_to_user_id'].to_i == 0)
|
93
94
|
raw_tweet['in_reply_to_status_id'] = ModelCommon.zeropad_id( raw_tweet['in_reply_to_status_id']) unless raw_tweet['in_reply_to_status_id'].blank? || (raw_tweet['in_reply_to_status_id'].to_i == 0)
|
94
95
|
Wukong.encode_components raw_tweet, 'text', 'in_reply_to_screen_name'
|
@@ -96,9 +97,7 @@ module Wuclan::Twitter::Scrape
|
|
96
97
|
end
|
97
98
|
end
|
98
99
|
|
99
|
-
|
100
100
|
class JsonUserWithTweet < JsonUserTweetPair
|
101
|
-
|
102
101
|
def raw_tweet
|
103
102
|
return @raw_tweet if @raw_tweet
|
104
103
|
@raw_tweet = raw['status']
|
@@ -112,7 +111,6 @@ end
|
|
112
111
|
|
113
112
|
|
114
113
|
class JsonTweetWithUser < JsonUserTweetPair
|
115
|
-
|
116
114
|
def raw_tweet
|
117
115
|
@raw_tweet ||= raw
|
118
116
|
end
|
@@ -122,3 +120,38 @@ class JsonTweetWithUser < JsonUserTweetPair
|
|
122
120
|
@raw_user
|
123
121
|
end
|
124
122
|
end
|
123
|
+
|
124
|
+
|
125
|
+
|
126
|
+
class JsonDeleteTweet
|
127
|
+
attr_accessor :raw, :moreinfo, :scraped_at
|
128
|
+
def initialize raw, moreinfo={}
|
129
|
+
self.raw = raw
|
130
|
+
self.moreinfo = moreinfo
|
131
|
+
self.scraped_at = nil # TODO -- extract this from neighbors
|
132
|
+
end
|
133
|
+
|
134
|
+
# Extracted JSON should be an array
|
135
|
+
def healthy?()
|
136
|
+
raw && raw.is_a?(Hash)
|
137
|
+
end
|
138
|
+
|
139
|
+
def delete_tweet
|
140
|
+
Wuclan::Twitter::Model::DeleteTweet.new(
|
141
|
+
raw['delete']['status']['id'],
|
142
|
+
self.scraped_at,
|
143
|
+
raw['delete']['status']['user_id']
|
144
|
+
) rescue nil
|
145
|
+
end
|
146
|
+
|
147
|
+
def each *args, &block
|
148
|
+
return unless healthy?
|
149
|
+
yield delete_tweet
|
150
|
+
end
|
151
|
+
|
152
|
+
# true if this model looks like it will parse the given JSON
|
153
|
+
def self.parses? hsh
|
154
|
+
# KLUDGE
|
155
|
+
hsh =~ /"delete":\{/
|
156
|
+
end
|
157
|
+
end
|
@@ -21,6 +21,7 @@ class TwitterRequestStream < Monkeyshines::RequestStream::SimpleRequestStream
|
|
21
21
|
# can be a screen_name, but we need the numeric ID for followers_request's, etc.
|
22
22
|
def each_request twitter_user_id, *args
|
23
23
|
user_req = TwitterUserRequest.new(twitter_user_id)
|
24
|
+
# this performs the request in-place: req holds the fulfilled response
|
24
25
|
yield(user_req)
|
25
26
|
return unless user_req.healthy?
|
26
27
|
twitter_user_id = user_req.parsed_contents['id'].to_i if (user_req.parsed_contents['id'].to_i > 0)
|
@@ -0,0 +1,44 @@
|
|
1
|
+
module Wuclan
|
2
|
+
module Twitter
|
3
|
+
module Scrape
|
4
|
+
|
5
|
+
class TwitterStreamRequest < Struct.new(:contents)
|
6
|
+
# Contents are JSON
|
7
|
+
include Monkeyshines::RawJsonContents
|
8
|
+
|
9
|
+
# self.hard_request_limit = 1
|
10
|
+
# def make_url() "http://stream.twitter.com/1/statuses/sample.json" end
|
11
|
+
|
12
|
+
# Extracted JSON should be an array
|
13
|
+
def healthy?()
|
14
|
+
parsed_contents && parsed_contents.is_a?(Hash)
|
15
|
+
end
|
16
|
+
|
17
|
+
def parsed_as_delete_tweet *args, &block
|
18
|
+
p parsed_contents
|
19
|
+
json_obj = JsonDeleteTweet.new(parsed_contents)
|
20
|
+
json_obj.each(&block)
|
21
|
+
end
|
22
|
+
|
23
|
+
# Extract user and tweet
|
24
|
+
def parsed_as_tweet *args, &block
|
25
|
+
json_obj = JsonTweetWithUser.new(
|
26
|
+
parsed_contents, 'scraped_at' => parsed_contents['created_at'])
|
27
|
+
json_obj.each(&block)
|
28
|
+
end
|
29
|
+
|
30
|
+
#
|
31
|
+
# unpacks the raw API response, yielding all the interesting objects
|
32
|
+
# and relationships within.
|
33
|
+
#
|
34
|
+
def parse *args, &block
|
35
|
+
return unless healthy?
|
36
|
+
return parsed_as_delete_tweet(*args, &block) if JsonDeleteTweet.parses?(contents)
|
37
|
+
# else
|
38
|
+
parsed_as_tweet(*args, &block)
|
39
|
+
end
|
40
|
+
end
|
41
|
+
|
42
|
+
end
|
43
|
+
end
|
44
|
+
end
|
data/wuclan.gemspec
CHANGED
@@ -5,11 +5,11 @@
|
|
5
5
|
|
6
6
|
Gem::Specification.new do |s|
|
7
7
|
s.name = %q{wuclan}
|
8
|
-
s.version = "0.2.
|
8
|
+
s.version = "0.2.1"
|
9
9
|
|
10
10
|
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
11
11
|
s.authors = ["Philip (flip) Kromer"]
|
12
|
-
s.date = %q{2009-
|
12
|
+
s.date = %q{2009-11-02}
|
13
13
|
s.description = %q{Massive-scale social network analysis. Nothing to f with.}
|
14
14
|
s.email = %q{flip@infochimps.org}
|
15
15
|
s.extra_rdoc_files = [
|
@@ -35,6 +35,7 @@ Gem::Specification.new do |s|
|
|
35
35
|
"examples/twitter/old/scrape_twitter_trending.rb",
|
36
36
|
"examples/twitter/parse/parse_twitter_requests.rb",
|
37
37
|
"examples/twitter/parse/parse_twitter_search_requests.rb",
|
38
|
+
"examples/twitter/parse/parse_twitter_stream_requests.rb",
|
38
39
|
"examples/twitter/scrape_twitter_api/scrape_twitter_api.rb",
|
39
40
|
"examples/twitter/scrape_twitter_api/seed.tsv",
|
40
41
|
"examples/twitter/scrape_twitter_api/start_cache_twitter.sh",
|
@@ -49,9 +50,13 @@ Gem::Specification.new do |s|
|
|
49
50
|
"examples/twitter/scrape_twitter_hosebird/scrape_twitter_hosebird.rb",
|
50
51
|
"examples/twitter/scrape_twitter_hosebird/test_spewer.rb",
|
51
52
|
"examples/twitter/scrape_twitter_hosebird/twitter_hosebird_god.yaml",
|
53
|
+
"examples/twitter/scrape_twitter_search/README.textile",
|
54
|
+
"examples/twitter/scrape_twitter_search/README.textile",
|
52
55
|
"examples/twitter/scrape_twitter_search/dump_twitter_search_jobs.rb",
|
56
|
+
"examples/twitter/scrape_twitter_search/edamame_global_config-template.yaml",
|
53
57
|
"examples/twitter/scrape_twitter_search/load_twitter_search_jobs.rb",
|
54
58
|
"examples/twitter/scrape_twitter_search/scrape_twitter_search.rb",
|
59
|
+
"examples/twitter/scrape_twitter_search/seed.tsv",
|
55
60
|
"examples/twitter/scrape_twitter_search/twitter_search_daemons.god",
|
56
61
|
"lib/old/twitter_api.rb",
|
57
62
|
"lib/wuclan.rb",
|
@@ -102,6 +107,7 @@ Gem::Specification.new do |s|
|
|
102
107
|
"lib/wuclan/twitter/model/tweet/tweet_token.rb",
|
103
108
|
"lib/wuclan/twitter/model/twitter_user.rb",
|
104
109
|
"lib/wuclan/twitter/model/twitter_user/style/color_to_hsv.rb",
|
110
|
+
"lib/wuclan/twitter/parse.rb",
|
105
111
|
"lib/wuclan/twitter/parse/ff_ids_parser.rb",
|
106
112
|
"lib/wuclan/twitter/parse/friends_followers_parser.rb",
|
107
113
|
"lib/wuclan/twitter/parse/generic_json_parser.rb",
|
@@ -123,6 +129,7 @@ Gem::Specification.new do |s|
|
|
123
129
|
"lib/wuclan/twitter/scrape/twitter_search_job.rb",
|
124
130
|
"lib/wuclan/twitter/scrape/twitter_search_request.rb",
|
125
131
|
"lib/wuclan/twitter/scrape/twitter_search_request_stream.rb",
|
132
|
+
"lib/wuclan/twitter/scrape/twitter_stream_request.rb",
|
126
133
|
"lib/wuclan/twitter/scrape/twitter_timeline_request.rb",
|
127
134
|
"lib/wuclan/twitter/scrape/twitter_user_request.rb",
|
128
135
|
"spec/spec_helper.rb",
|
@@ -151,6 +158,7 @@ Gem::Specification.new do |s|
|
|
151
158
|
"examples/twitter/old/scrape_twitter_trending.rb",
|
152
159
|
"examples/twitter/parse/parse_twitter_requests.rb",
|
153
160
|
"examples/twitter/parse/parse_twitter_search_requests.rb",
|
161
|
+
"examples/twitter/parse/parse_twitter_stream_requests.rb",
|
154
162
|
"examples/twitter/scrape_twitter_api/scrape_twitter_api.rb",
|
155
163
|
"examples/twitter/scrape_twitter_api/support/make_request_stats.rb",
|
156
164
|
"examples/twitter/scrape_twitter_api/support/make_requests_by_id_and_date_1.rb",
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: wuclan
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.2.
|
4
|
+
version: 0.2.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Philip (flip) Kromer
|
@@ -9,7 +9,7 @@ autorequire:
|
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
11
|
|
12
|
-
date: 2009-
|
12
|
+
date: 2009-11-02 00:00:00 -06:00
|
13
13
|
default_executable:
|
14
14
|
dependencies:
|
15
15
|
- !ruby/object:Gem::Dependency
|
@@ -70,6 +70,7 @@ files:
|
|
70
70
|
- examples/twitter/old/scrape_twitter_trending.rb
|
71
71
|
- examples/twitter/parse/parse_twitter_requests.rb
|
72
72
|
- examples/twitter/parse/parse_twitter_search_requests.rb
|
73
|
+
- examples/twitter/parse/parse_twitter_stream_requests.rb
|
73
74
|
- examples/twitter/scrape_twitter_api/scrape_twitter_api.rb
|
74
75
|
- examples/twitter/scrape_twitter_api/seed.tsv
|
75
76
|
- examples/twitter/scrape_twitter_api/start_cache_twitter.sh
|
@@ -84,9 +85,12 @@ files:
|
|
84
85
|
- examples/twitter/scrape_twitter_hosebird/scrape_twitter_hosebird.rb
|
85
86
|
- examples/twitter/scrape_twitter_hosebird/test_spewer.rb
|
86
87
|
- examples/twitter/scrape_twitter_hosebird/twitter_hosebird_god.yaml
|
88
|
+
- examples/twitter/scrape_twitter_search/README.textile
|
87
89
|
- examples/twitter/scrape_twitter_search/dump_twitter_search_jobs.rb
|
90
|
+
- examples/twitter/scrape_twitter_search/edamame_global_config-template.yaml
|
88
91
|
- examples/twitter/scrape_twitter_search/load_twitter_search_jobs.rb
|
89
92
|
- examples/twitter/scrape_twitter_search/scrape_twitter_search.rb
|
93
|
+
- examples/twitter/scrape_twitter_search/seed.tsv
|
90
94
|
- examples/twitter/scrape_twitter_search/twitter_search_daemons.god
|
91
95
|
- lib/old/twitter_api.rb
|
92
96
|
- lib/wuclan.rb
|
@@ -136,6 +140,7 @@ files:
|
|
136
140
|
- lib/wuclan/twitter/model/tweet/tweet_token.rb
|
137
141
|
- lib/wuclan/twitter/model/twitter_user.rb
|
138
142
|
- lib/wuclan/twitter/model/twitter_user/style/color_to_hsv.rb
|
143
|
+
- lib/wuclan/twitter/parse.rb
|
139
144
|
- lib/wuclan/twitter/parse/ff_ids_parser.rb
|
140
145
|
- lib/wuclan/twitter/parse/friends_followers_parser.rb
|
141
146
|
- lib/wuclan/twitter/parse/generic_json_parser.rb
|
@@ -157,6 +162,7 @@ files:
|
|
157
162
|
- lib/wuclan/twitter/scrape/twitter_search_job.rb
|
158
163
|
- lib/wuclan/twitter/scrape/twitter_search_request.rb
|
159
164
|
- lib/wuclan/twitter/scrape/twitter_search_request_stream.rb
|
165
|
+
- lib/wuclan/twitter/scrape/twitter_stream_request.rb
|
160
166
|
- lib/wuclan/twitter/scrape/twitter_timeline_request.rb
|
161
167
|
- lib/wuclan/twitter/scrape/twitter_user_request.rb
|
162
168
|
- spec/spec_helper.rb
|
@@ -207,6 +213,7 @@ test_files:
|
|
207
213
|
- examples/twitter/old/scrape_twitter_trending.rb
|
208
214
|
- examples/twitter/parse/parse_twitter_requests.rb
|
209
215
|
- examples/twitter/parse/parse_twitter_search_requests.rb
|
216
|
+
- examples/twitter/parse/parse_twitter_stream_requests.rb
|
210
217
|
- examples/twitter/scrape_twitter_api/scrape_twitter_api.rb
|
211
218
|
- examples/twitter/scrape_twitter_api/support/make_request_stats.rb
|
212
219
|
- examples/twitter/scrape_twitter_api/support/make_requests_by_id_and_date_1.rb
|