RubyGems - wuclan - Versions diffs - 0.2.0 → 0.2.1 - Mend

wuclan 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (24) hide show

data/README.textile +81 -6
data/examples/twitter/parse/parse_twitter_requests.rb +24 -18
data/examples/twitter/parse/parse_twitter_search_requests.rb +12 -16
data/examples/twitter/parse/parse_twitter_stream_requests.rb +40 -0
data/examples/twitter/scrape_twitter_search/README.textile +95 -0
data/examples/twitter/scrape_twitter_search/edamame_global_config-template.yaml +17 -0
data/examples/twitter/scrape_twitter_search/scrape_twitter_search.rb +1 -1
data/examples/twitter/scrape_twitter_search/seed.tsv +19 -0
data/examples/twitter/scrape_twitter_search/twitter_search_daemons.god +38 -18
data/lib/wuclan/twitter.rb +2 -0
data/lib/wuclan/twitter/model.rb +1 -0
data/lib/wuclan/twitter/model/relationship.rb +34 -26
data/lib/wuclan/twitter/model/tweet.rb +7 -0
data/lib/wuclan/twitter/model/twitter_user.rb +5 -0
data/lib/wuclan/twitter/parse.rb +3 -0
data/lib/wuclan/twitter/parse/twitter_search_parse.rb +1 -0
data/lib/wuclan/twitter/scrape.rb +2 -0
data/lib/wuclan/twitter/scrape/old_skool_request_classes.rb +8 -5
data/lib/wuclan/twitter/scrape/twitter_ff_ids_request.rb +4 -2
data/lib/wuclan/twitter/scrape/twitter_json_response.rb +39 -6
data/lib/wuclan/twitter/scrape/twitter_request_stream.rb +1 -0
data/lib/wuclan/twitter/scrape/twitter_stream_request.rb +44 -0
data/wuclan.gemspec +10 -2
metadata +9 -2

data/README.textile CHANGED

@@ -1,9 +1,29 @@
-h2. Help!
+Wuclan uses "Wukong":http://mrflip.github.com/wukong (Hadoop massive-data processing made easy) and "Monkeyshines":http://mrflip.github.com/monkeyshines (massive-scale directed scraper) to grok the deep structure of social networks. It is designed to scrape in a way that respectful of the terms and technical limits of each site while being agressive and efficient with your resources. We use it in practice to collect and analyze social graphs as large as 50 million-nodes, 1 billion-edges, 500 GB raw data  -- all of it actual data extracted in compliance with the site's terms of service.
-Send Wuclan questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
+Currently wuclan handles:
-h3. lib/wuclan/models
+* Twitter -- API
+* Twitter -- Search
+* Twitter -- Hosebird
+* Last.fm
+* Opensocial
+<notextile><div class="toggle"></notextile>
+h2. Why?
+APIs are nice and all, but they prevent any insight into a) global properties, or b) deep structure.  You can't find global word frequency and dispersion, or average clustering coefficient, or calculate pagerank, or determine weighted-shortest-paths connections between two people through an API call.  But with a 10 machine hadoop cluster and a good-sized collection of data, you can (and wuclan has scripts to help answer many of those questions).
+Wuclan is strictly meant for such massive-scale investigations. Unless you're planning to do your final analysis on either hadoop or an enterprise-grade database system it's probably not worth the hassle.
+<notextile></div><div class="toggle"></notextile>
+h2. Wuclan: Scraping
+is almost ready for public use. Check back shortly.
+h3. lib/wuclan/*/models
 Defines the Wukong objects we'll most often use
@@ -12,9 +32,7 @@ Defines the Wukong objects we'll most often use
 * TwitterUser
 * TwitterUserProfiles
-h3. lib/wuclan/request
+h3. lib/wuclan/*/request
 * Request -- the basic request metadata
@@ -25,4 +43,61 @@ h3. lib/wuclan/request
   ensures that the request is left alone while recordizing.
+<notextile></div><div class="toggle"></notextile>
+h2. Wuclan: Analysis
+actually most of this still lives in the imw_twitter_friends repo.
+<notextile></div><div class="toggle"></notextile>
+h2. Install
+** "Main Install and Setup Documentation":http://mrflip.github.com/edamame/INSTALL.html **
+h3. Get the code
+We're still actively developing edamame.  The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/edamame
+pre. $ git clone git://github.com/mrflip/edamame
+A gem is available from "gemcutter:":http://gemcutter.org/gems/edamame
+pre. $ sudo gem install edamame --source=http://gemcutter.org
+(don't use the gems.github.com version -- it's way out of date.)
+You can instead download this project in either "zip":http://github.com/mrflip/edamame/zipball/master or "tar":http://github.com/mrflip/edamame/tarball/master formats.
+h3. Get the Dependencies
+To finish setting up, see the "detailed setup instructions":http://mrflip.github.com/edamame/INSTALL.html and then read the "usage notes":http://mrflip.github.com/edamame/usage.html
+* "beanstalkd 1.3,":http://xph.us/dist/beanstalkd/ "libevent 1.4,":http://monkey.org/~provos/libevent/ and "beanstalk-client":http://github.com/dustin/beanstalk-client
+* "Tokyo Tyrant,":http://tokyocabinet.sourceforge.net/tyrantdoc/ "Tokyo Tyrant Ruby libs,":http://tokyocabinet.sourceforge.net/tyrantrubydoc/ "Tokyo Cabinet,":http://tokyocabinet.sourceforge.net and "Tokyo Cabinet Ruby libs":http://tokyocabinet.sourceforge.net/tyrantdoc/
+* Gems: "wukong":http://mrflip.github.com/wukong and "monkeyshines":http://mrflip.github.com/monkeyshines
+See the "Detailed install instructions":http://mrflip.github.com/edamame/INSTALL.html (it also has hints about installing Tokyo*, Beanstalkd and friends.
+<notextile></div><div class="toggle"></notextile>
 h3. lib/wuclan/
+---------------------------------------------------------------------------
+<notextile><div class="toggle"></notextile>
+h2. More info
+There are many useful examples in the examples/ directory.
+h3. Credits
+wuclan was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org / "@mrflip":http://twitter.com/mrflip) for the "infochimps project":http://infochimps.org
+h3. Help!
+Send wuclan questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
+<notextile></div></notextile>

data/examples/twitter/parse/parse_twitter_requests.rb CHANGED

@@ -5,23 +5,24 @@ require 'wukong'
 require 'monkeyshines'
 require 'wuclan/twitter'
-# if you're anyone but original author this next require is useless but harmless.
-require 'wuclan/twitter/scrape/old_skool_request_classes'
 # un-namespace request classes.
 include Wuclan::Twitter::Scrape
 include Wuclan::Twitter::Model
+# if you're anyone but original author this next require is useless but harmless.
+require 'wuclan/twitter/scrape/old_skool_request_classes'
 #
+# Incoming objects are Wuclan::Twitter::Scrape requests.
 #
-# Instantiate each incoming request.
-# Stream out the contained classes it generates.
-#
+# Their #parse method disgorges a stream of Wuclan::Twitter::Model objects, as
+# few or as many as found.  For example, a twitter_user_request will assumedly
+# have a twitter_user record if it is healthy, but may not have a tweet (if the
+# user hasn't ever tweeted) and might not have profile or style info (if the
+# user is protected).
 #
 class TwitterRequestParser < Wukong::Streamer::StructStreamer
   def process request, *args, &block
     request.parse(*args) do |obj|
-      next if obj.is_a? BadRecord
       yield obj.to_flat(false)
     end
   end
@@ -31,29 +32,34 @@ end
 # We want to record each individual state of the resource, with the last-seen of
 # its timestamps (if there are many). So if we saw
 #
-#     rsrc  id   screen_name   followers_count  friends_count  (... more)
-#     user  23   skidoo        47               61
-#     user  23   skidoo        48               62
-#     user  23   skidoo        48               62
-#     user  23   skidoo        52               62
-#     user  23   skidoo        52               63
+#     rsrc  id   screen_name   followers_count  friends_count  (...) scraped_at
+#     user  23   skidoo        47               61                   20090608
+#     user  23   skidoo        48               62                   20090802
+#     user  23   skidoo        48               62                   20090901
+#     user  23   skidoo        52               62                   20090920
+#     user  23   skidoo        52               62                   20090922
+#     user  23   skidoo        52               63                   20090923
+#
+# we would only keep
 #
+#     user  23   skidoo        47               61                   20090608
+#     user  23   skidoo        48               62                   20090802
+#     user  23   skidoo        52               62                   20090920
+#     user  23   skidoo        52               63                   20090922
 #
 class TwitterRequestUniqer < Wukong::Streamer::UniqByLastReducer
   include Wukong::Streamer::StructRecordizer
   attr_accessor :uniquer_count
   #
-  #
-  #
+  # FIXME -- move this into the models themselves.
   #
   # for immutable objects we can just work off their ID.
   #
   # for mutable objects we want to record each unique state: all the fields
   # apart from the scraped_at timestamp.
   #
-  def get_key obj
+  def get_key obj, *_
     case obj
     when Tweet
       obj.id
@@ -71,7 +77,7 @@ class TwitterRequestUniqer < Wukong::Streamer::UniqByLastReducer
     super *args
   end
-  def accumulate obj
+  def accumulate obj, *_
     self.uniquer_count      += 1
     self.final_value = [self.uniquer_count, obj.to_flat].flatten
   end

data/examples/twitter/parse/parse_twitter_search_requests.rb CHANGED

@@ -1,28 +1,24 @@
 #!/usr/bin/env ruby
-#$: << ENV['WUKONG_PATH']
 require 'rubygems'
 require 'wukong'
-require 'monkeyshines'
-require 'wuclan/twitter'
-require 'wuclan/twitter/scrape/twitter_search_request'
-require 'wuclan/twitter/parse/twitter_search_parse'
+require 'wuclan/twitter';
+require 'wuclan/twitter/parse';
 include Wuclan::Twitter::Scrape
-#
-#
-# Instantiate each incoming request.
-# Stream out the contained classes it generates.
-#
-#
 class TwitterRequestParser < Wukong::Streamer::StructStreamer
+  #
+  # Object: parse thyself.
+  #
   def process request, *args, &block
     request.parse(*args) do |obj|
-      next if obj.is_a? BadRecord
-      yield obj.to_flat(false)
+      next if obj.blank? || obj.is_a?(BadRecord)
+      yield obj
     end
   end
 end
-# This makes the script go.
-Wukong::Script.new(TwitterRequestParser, nil).run
+# Go, script, go!
+Wukong::Script.new(
+  TwitterRequestParser,
+  nil
+  ).run

data/examples/twitter/parse/parse_twitter_stream_requests.rb ADDED

@@ -0,0 +1,40 @@
+#!/usr/bin/env ruby
+require 'rubygems'
+require 'wukong'
+require 'monkeyshines';
+require 'wuclan/twitter';
+require 'wuclan/twitter/scrape/twitter_search_request';
+require 'wuclan/twitter/parse/twitter_search_parse';
+include Wuclan::Twitter::Scrape
+#
+# Twitter stream requests
+#
+#   http://apiwiki.twitter.com/Streaming-API-Documentation
+#
+# Fills a file with JSON status records, one line per status.
+#
+#   {"text":"Hey #bigdata #hadoop geeks: who's missing? @mrflip/bigdata / http://bit.ly/datatweeps","favorited":false,"geo":null,"in_reply_to_screen_name":null,"source":"web","created_at":"Thu Oct 29 09:29:32 +0000 2009","user":{"verified":false,"notifications":null,"profile_text_color":"000000","time_zone":"Central Time (US & Canada)","following":null,"profile_link_color":"0000ff","profile_image_url":"http://a3.twimg.com/profile_images/377919497/FlipCircle-2009-900-trans_normal.png","profile_background_image_url":"http://a3.twimg.com/profile_background_images/2348065/2005Mar-AustinTypeTour-075_-_Rappers_Delight_Raindrop.jpg","description":"Increasing access to free open data, building tools to Organize, Explore and Comprehend massive data sources - http://infochimps.org","location":"iPhone: 30.316122,-97.733817","profile_sidebar_fill_color":"ffffff","screen_name":"mrflip","profile_background_tile":false,"profile_sidebar_border_color":"f0edd8","statuses_count":1307,"followers_count":678,"protected":false,"url":"http://infochimps.org","created_at":"Mon Mar 19 21:08:24 +0000 2007","friends_count":514,"name":"Philip Flip Kromer","geo_enabled":false,"profile_background_color":"BCC0C8","id":1554031,"utc_offset":-21600,"favourites_count":61},"id":5254924802,"in_reply_to_user_id":null,"in_reply_to_status_id":null,"truncated":false}
+#
+# Try it with
+#   twuserpass='name:pass'
+#   curl -s -u $twpass http://stream.twitter.com/1/statuses/sample.json > /tmp/sample.json
+#   cat /tmp/sample.json | parse_twitter_stream_requests.rb --map
+#
+class TwitterRequestParser < Wukong::Streamer::RecordStreamer
+  def recordize *args
+    foo = args.first
+    [ TwitterStreamRequest.new(super(*args).first) ]
+  end
+  def process request, *args, &block
+    request.parse(*args) do |obj|
+      next if obj.is_a? BadRecord
+      yield obj.to_flat(false) # if obj.is_a?(DeleteTweet)
+    end
+  end
+end
+# This makes the script go.
+Wukong::Script.new(TwitterRequestParser, nil).run

data/examples/twitter/scrape_twitter_search/README.textile ADDED

@@ -0,0 +1,95 @@
+This is actually less rickety than it seems, but you'll have to hand edit a few paths and config files.  Feel free to suggest a more polite organization of it all.
+h2. Initial setup
+* Install prerequisites using rubygems:
+<pre>
+  sudo gem install htmlentities extlib god
+</pre>
+* check out each of the "monkeyshines":http://github.com/mrflip/monkeyshines, "wukong":http://github.com/mrflip/wukong, "wuclan":http://github.com/mrflip/wuclan, and "edamame":http://github.com/mrflip/edamame repos using git, preferably as neighbors in the same directory.
+* follow instructions from http://mrflip.github.com/edamame/INSTALL.html for beanstalkd and tokyo tyrant
+h2. Find the scraper
+Although you can install wuclan as a gem, I actually recommend installing it from git source:
+<pre>
+  git clone git://github.com/mrflip/wuclan.git
+</pre>
+You'll run the scraper from
+<pre>
+  wuclan/examples/twitter/scrape_twitter_search
+</pre>
+h2. Make the scrape destination
+You will need to set up a landing place for the files, probably by editing the work/ symlink (sorry, this is kludgy and should be fixed).
+The naming scheme I use is good for running scrapers against a lot of targets.         From the wuclan/examples/twitter/scrape_twitter_search directory:
+<pre>
+  mkdir ../../../../data/ripd/com.tw/com.twitter.search
+</pre>
+(this constructs a tree that is a sibling of the wuclan dir).
+Wherever you put the scrape destination,
+* DO NOT add it to your code's git repo
+* exclude it from spotlight indexing and so forth
+h2. Add your search terms to seed.tsv
+To add a search job, edit seed.tsv: add each search phrase and its priority, separated by a tab (Lower priority == more important). **Don't** url-encode your query terms.  Spaces will be replaced by plus signs+ and other non-alphanumerics will be url-encoded.
+h2. Start the queue daemons
+Copy @edamame_global_config-template.yaml@ to your scrape destination, and name it @edamame_global_config.yaml@ Also, edit the @./twitter_search_daemons.god@ file to indicate the scrape destination.
+Use god to start the daemons: @sudo god -c ./twitter_search_daemons.god@ (add the -D flag to debug)
+h2. Load the search terms
+Load this data with
+<pre>
+  ./load_twitter_search_jobs.rb --handle=com.twitter.search --source-filename=./seed_lim.tsv
+</pre>
+You can check it was loaded with
+<pre>
+  /path/to/edamame/bin/edamame-sync  --handle=com.twitter.search --store=:11241 --queue=:11240
+</pre>
+(This unloads all jobs from the transient queue and stuffs them back in from the database).
+Empty all search queues with
+<pre>
+  /path/to/edamame/bin/edamame-nuke  --handle=com.twitter.search --store=:11241 --queue=:11240
+</pre>
+h2. Run the scraper
+<pre>
+  nohup ./scrape_twitter_search.rb --handle=com.twitter.search >> work/log/twitter_search-console.log 2>&1 &
+</pre>
+This will run forever.  Check its progress with
+<pre>
+  tail -f work/log/twitter_search-console.log
+</pre>
+If you want to watch the output files,
+<pre>
+  datename=`date "+%Y%m%d"` ; tail -f work/$datename/* | cut -c 1-2000
+</pre>
+Be careful dumping the output files to screen -- each line can be tens of thousand characters long and will lock your terminal right up.

data/examples/twitter/scrape_twitter_search/edamame_global_config-template.yaml ADDED

@@ -0,0 +1,17 @@
+--- # -*- YAML -*-
+#
+# Save this file in your god dir, *then* change the settings below.
+# Make sure your version control system is set to ignore the file.
+#
+:email:
+  :domain:              your.domain.com
+  :username:            robot@your.domain.com
+  :password:            YOURPASSWORD
+  :to:                  people_who_can_fix_errors@your.domain.com
+  :to_name:             People who can fix the scraper
+# these apply to all processes
+:god_process:
+  :flapping_notify:     default

data/examples/twitter/scrape_twitter_search/scrape_twitter_search.rb CHANGED

@@ -34,7 +34,7 @@ loop do
         :dest    => { :type  => :chunked_flat_file_store, :rootdir => WORK_DIR, :filemode => 'a' },
         # :dest    => { :type  => :flat_file_store, :filename => WORK_DIR+"/test_output.tsv" },
         # :fetcher => { :type => TwitterSearchFakeFetcher },
-        :sleep_time  => 1 ,
+        :sleep_time  => 1.25 ,
       })
     Log.info "Starting a run!"
     scraper.run

data/examples/twitter/scrape_twitter_search/seed.tsv ADDED

@@ -0,0 +1,19 @@
+# See the readme for instructions.
+# To add a search job, put the search phrase and its priority, separated by a
+# tab (Lower priority == more important).
+red sox            	  1000
+yankees            	  1000
+# You can recycle the output of dump_twitter_search_jobs
+hadoop             	    50	 0.103063874053513	4985477488	    3852	9700.94447529118952	20091019021751
+infochimp          	    50	 0.106963481416675	4827288395	      62	9347.39116701332932	20091019022759
+infochimps         	    50	 0.102891460905350	4922555326	     575	9717.31832415175086	20091019025808
+semantic           	    50	 0.103387063739869	4985327841	    4400	9670.39361100528913	20091019021435
+semanticweb        	    50	 0.102956390169747	4986072386	    3156	9711.01359700656940	20091019025943
+# These will quickly generate a buttload of data
+# RT               	110000	 6.977299880525690	4986408447	 9628514	 115.87012880821899	20091019032753
+# http             	110000	28.312757201646100	4986411833	70327665	   6.90163844186046	20091019032825
+# twitter          	110000	 1.319672131147540	4986376554	 7432567	 733.71915915527904	20091019032511

data/examples/twitter/scrape_twitter_search/twitter_search_daemons.god CHANGED

@@ -1,25 +1,45 @@
-$: << File.dirname(__FILE__)+'/../../../../edamame/lib'
-require 'edamame/monitoring'
-WORK_DIR = File.dirname(__FILE__)+'/work'
+require 'yaml'
+require 'extlib'
+require 'wukong/extensions/hash'
+require "edamame/monitoring"
 #
-# For debugging:
+# You can load this file with
+#   sudo god -c ./twitter_search_daemons.god
+# To debug, run
+#   sudo god -c ./twitter_search_daemons.god -D
 #
-#   sudo god -c this_file.god -D
+#
+# Change this to point to your scrape destination.
+#
+WORK_DIR = '/data/ripd/com.tw/com.twitter.search'
+#
+# Also, make a copy of edamame_global_config-template.yaml in that directory,
+# but rename it edamame_global_config.yaml and edit it to suit.
 #
-# (for production, use the etc/initc.d script in this directory)
+GodProcess::GLOBAL_SITE_OPTIONS_FILES << WORK_DIR+'/edamame_global_config.yaml'
+# Files will be timestamped by when god is started.
+DATESTAMP = Time.now.utc.strftime("%Y%m%d")
+# Uncomment for a bunch of diagnostics:
+# p GodProcess.global_site_options,
+#   TyrantGod.site_options, TyrantGod.default_options.deep_merge(TyrantGod.site_options),
+#   GodProcess.site_options
 #
-# TODO: define an EdamameDirector that lets us name these collections.
+# Define email notifiers and attach one by default
 #
-THE_FAITHFUL = [
-  # twitter_search
-  [BeanstalkdGod, { :port => 11240, :max_mem_usage => 100.megabytes,  }],
-  [TyrantGod,     { :port => 11241, :db_dirname => WORK_DIR, :db_name => "twitter_search-queue.tct" }],
-  #
-  # [TyrantGod,     { :port => 11249, :db_dirname => WORK_DIR, :db_name => "twitter_search-flat.tct" }],
-]
+God.setup_email GodProcess.global_site_options[:email]
+GodProcess::DEFAULT_OPTIONS[:flapping_notify] = 'default'
-THE_FAITHFUL.each do |klass, config|
-  proc = klass.create(config.merge :flapping_notify => 'default')
-  proc.mkdirs!
-end
+#
+# Twitter Search
+#
+handle     = 'comtwittersearch'
+base_port  = 11220
+db_dirname = WORK_DIR+'/distdb/'+DATESTAMP
+BeanstalkdGod.create :port => base_port + 0, :max_mem_usage => 100.megabytes
+TyrantGod.create     :port => base_port + 1, :db_name => handle+'-queue.tct', :db_dirname => db_dirname

data/lib/wuclan/twitter.rb CHANGED

@@ -1,3 +1,5 @@
+$KCODE='u' unless "1.9".respond_to?(:encoding)
 module Wuclan
   module Twitter
     autoload :Scrape, 'wuclan/twitter/scrape'

data/lib/wuclan/twitter/model.rb CHANGED

@@ -9,6 +9,7 @@ module Wuclan
       autoload :TwitterUserSearchId, 'wuclan/twitter/model/twitter_user'
       autoload :TwitterUserId,       'wuclan/twitter/model/twitter_user'
       autoload :Tweet,               'wuclan/twitter/model/tweet'
+      autoload :DeleteTweet,         'wuclan/twitter/model/tweet'
       autoload :SearchTweet,         'wuclan/twitter/model/tweet'
       autoload :AFollowsB,           'wuclan/twitter/model/relationship'
       autoload :AFavoritesB,         'wuclan/twitter/model/relationship'

data/lib/wuclan/twitter/model/relationship.rb CHANGED

@@ -7,6 +7,14 @@ module Wuclan::Twitter::Model
       end
     end
+    def status_id
+      tweet_id
+    end
+    def in_reply_to_status_id
+      in_reply_to_status_id
+    end
     def self.included base
       base.class_eval{ extend ClassMethods }
     end
@@ -28,43 +36,43 @@ module Wuclan::Twitter::Model
   class AFavoritesB        < TypedStruct.new(
       [:user_a_id,              Integer],
       [:user_b_id,              Integer],
-      [:status_id,              Integer]
+      [:tweet_id,              Integer]
       )
     include ModelCommon
     include RelationshipBase
-    # Key on user_a-user_b-status_id (really just user_a-status_id is enough)
+    # Key on user_a-user_b-tweet_id (really just user_a-tweet_id is enough)
     def num_key_fields()  3 end
-    def numeric_id_fields()     [:user_a_id, :user_b_id, :status_id] ; end
+    def numeric_id_fields()     [:user_a_id, :user_b_id, :tweet_id] ; end
   end
   # Direct (threaded) replies: occur at the start of a tweet.
   class ARepliesB           < TypedStruct.new(
       [:user_a_id,              Integer],
       [:user_b_id,              Integer],
-      [:status_id,              Integer],
-      [:in_reply_to_status_id,  Integer]
+      [:tweet_id,              Integer],
+      [:in_reply_to_tweet_id,  Integer]
       )
     include ModelCommon
     include RelationshipBase
-    # Key on user_a-user_b-status_id
+    # Key on user_a-user_b-tweet_id
     def num_key_fields()  3  end
-    def numeric_id_fields()     [:user_a_id, :user_b_id, :status_id, :in_reply_to_status_id] ; end
+    def numeric_id_fields()     [:user_a_id, :user_b_id, :tweet_id, :in_reply_to_tweet_id] ; end
   end
   # Direct (threaded) replies: occur at the start of a tweet.
   class ARepliesBName       < TypedStruct.new(
-      [:user_a_name,            Integer],
-      [:user_b_name,            Integer],
-      [:status_id,              Integer],
-      [:in_reply_to_status_id,  Integer],
-      [:user_a_sid,             Integer],
-      [:user_b_sid,             Integer]
+      [:user_a_name,           String],
+      [:user_b_name,           String],
+      [:tweet_id,              Integer],
+      [:in_reply_to_tweet_id,  Integer],
+      [:user_a_sid,            Integer],
+      [:user_b_sid,            Integer]
       )
     include ModelCommon
     include RelationshipBase
-    # Key on user_a-user_b-status_id
+    # Key on user_a-user_b-tweet_id
     def num_key_fields()  3  end
-    def numeric_id_fields()     [:user_a_id, :user_b_id, :status_id, :in_reply_to_status_id] ; end
+    def numeric_id_fields()     [:user_a_id, :user_b_id, :tweet_id, :in_reply_to_tweet_id] ; end
   end
   # Atsign mentions anywhere in the tweet
@@ -72,13 +80,13 @@ module Wuclan::Twitter::Model
   class AAtsignsB           < TypedStruct.new(
       [:user_a_id,              Integer],
       [:user_b_name,            String],
-      [:status_id,              Integer]
+      [:tweet_id,              Integer]
       )
     include ModelCommon
     include RelationshipBase
-    # Key on user_a-user_b-status_id
+    # Key on user_a-user_b-tweet_id
     def num_key_fields()  3 end
-    def numeric_id_fields()     [:user_a_id, :status_id] ; end
+    def numeric_id_fields()     [:user_a_id, :tweet_id] ; end
   end
   # Atsign mentions anywhere in the tweet
@@ -86,13 +94,13 @@ module Wuclan::Twitter::Model
   class AAtsignsBId         < TypedStruct.new(
       [:user_a_id,              Integer],
       [:user_b_id,              Integer],
-      [:status_id,              Integer]
+      [:tweet_id,              Integer]
       )
     include ModelCommon
     include RelationshipBase
-    # Key on user_a-user_b-status_id
+    # Key on user_a-user_b-tweet_id
     def num_key_fields()  3 end
-    def numeric_id_fields()     [:user_a_id, :user_b_id, :status_id] ; end
+    def numeric_id_fields()     [:user_a_id, :user_b_id, :tweet_id] ; end
   end
@@ -112,7 +120,7 @@ module Wuclan::Twitter::Model
   # non-retweet-whore-requests have user_b_name set and unset respectively.)
   #
   # +user_a_id:+   the user who sent the re-tweet
-  # +status_id:+   the id of the tweet *containing* the re-tweet (for the ID of the original tweet you're on your own.)
+  # +tweet_id:+   the id of the tweet *containing* the re-tweet (for the ID of the original tweet you're on your own.)
   # +user_b_name:+ the user citied as originating: RT @user_b_name
   # +please_flag:+ a 1 if the text contains 'please' or 'plz' as a stand-alone word
   # +text:+        the *full* text of the tweet
@@ -120,7 +128,7 @@ module Wuclan::Twitter::Model
   class ARetweetsB <  TypedStruct.new(
       [:user_a_id,              Integer],
       [:user_b_name,            String],
-      [:status_id,              Integer],
+      [:tweet_id,              Integer],
       [:please_flag,            Integer],
       [:text,                   String]
       )
@@ -133,7 +141,7 @@ module Wuclan::Twitter::Model
     end
     # Key on retweeting_user-user-tweet_id
     def num_key_fields()  3  end
-    def numeric_id_fields()     [:user_a_id, :status_id] ; end
+    def numeric_id_fields()     [:user_a_id, :tweet_id] ; end
     #
     # If there's no user we'll assume this
     # is a retweet and not an rtwhore.
@@ -146,7 +154,7 @@ module Wuclan::Twitter::Model
   class ARetweetsBId <  TypedStruct.new(
       [:user_a_id,              Integer],
       [:user_b_id,              Integer],
-      [:status_id,              Integer],
+      [:tweet_id,              Integer],
       [:please_flag,            Integer],
       [:text,                   String]
       )
@@ -160,7 +168,7 @@ module Wuclan::Twitter::Model
     # Key on retweeting_user-user-tweet_id
     def num_key_fields()  3  end
-    def numeric_id_fields()     [:user_a_id, :user_b_id, :status_id] ; end
+    def numeric_id_fields()     [:user_a_id, :user_b_id, :tweet_id] ; end
     #
     # If there's no user we'll assume this

data/lib/wuclan/twitter/model/tweet.rb CHANGED

@@ -31,6 +31,13 @@ module Wuclan::Twitter::Model
     def numeric_id_fields()     [:id, :twitter_user_id, :in_reply_to_status_id, :in_reply_to_user_id] ; end
   end
+  class DeleteTweet < TypedStruct.new(
+      [:id,                      Integer     ],
+      [:created_at,              Bignum      ],
+      [:twitter_user_id,         Integer     ]
+      )
+    include ModelCommon
+  end
   #
   # SearchTweet

data/lib/wuclan/twitter/model/twitter_user.rb CHANGED

@@ -30,6 +30,8 @@ module Wuclan::Twitter::Model
   end
   #
   # Fundamental information on a user.
   #
@@ -57,6 +59,9 @@ module Wuclan::Twitter::Model
     def tweets_per_day()       tweets_count.to_i    / days_since_created  end
   end
   #
   # Outside of a users/show page, when a user is mentioned
   # only this subset of fields appear.

data/lib/wuclan/twitter/parse.rb ADDED

@@ -0,0 +1,3 @@
+require 'monkeyshines';
+require 'wuclan/twitter/scrape/twitter_search_request';
+require 'wuclan/twitter/parse/twitter_search_parse';

data/lib/wuclan/twitter/parse/twitter_search_parse.rb CHANGED

@@ -15,6 +15,7 @@ module Wuclan
         # Parse
         #
         def parse *args, &block
+          return unless items
           items.each do |item|
             self.encode_and_sanitize!(item)
             tweet = tweet_from_parse(item)

data/lib/wuclan/twitter/scrape.rb CHANGED

@@ -14,8 +14,10 @@ module Wuclan
       autoload :TwitterFriendsIdsRequest,     'wuclan/twitter/scrape/twitter_ff_ids_request'
       autoload :TwitterUserTimelineRequest,   'wuclan/twitter/scrape/twitter_timeline_request'
       autoload :TwitterPublicTimelineRequest, 'wuclan/twitter/scrape/twitter_timeline_request'
+      autoload :TwitterStreamRequest,         'wuclan/twitter/scrape/twitter_stream_request'
       autoload :JsonUserWithTweet,            'wuclan/twitter/scrape/twitter_json_response'
       autoload :JsonTweetWithUser,            'wuclan/twitter/scrape/twitter_json_response'
+      autoload :JsonDeleteTweet,              'wuclan/twitter/scrape/twitter_json_response'
     end
   end

data/lib/wuclan/twitter/scrape/old_skool_request_classes.rb CHANGED

@@ -13,8 +13,7 @@ module Wuclan::Twitter::Scrape
     def parse *args, &block
       handle_special_cases!(*args, &block) or return
-      # super *args
-      yield self
+      super *args
     end
     def handle_special_cases! *args, &block
@@ -26,10 +25,14 @@ module Wuclan::Twitter::Scrape
     end
   end
-  class Followers < TwitterFollowersRequest       ; include OldSkoolRequest ; end
-  class Friends   < TwitterFriendsRequest         ; include OldSkoolRequest ; end
-  class Favorites < TwitterFavoritesRequest       ; include OldSkoolRequest ; end
+  class User         < TwitterUserRequest         ; include OldSkoolRequest ; end
+  class Followers    < TwitterFollowersRequest    ; include OldSkoolRequest ; end
+  class Friends      < TwitterFriendsRequest      ; include OldSkoolRequest ; end
+  class FollowersIds < TwitterFollowersIdsRequest ; include OldSkoolRequest ; end
+  class FriendsIds   < TwitterFriendsIdsRequest   ; include OldSkoolRequest ; end
+  class Favorites    < TwitterFavoritesRequest    ; include OldSkoolRequest ; end
   class UserTimeline < TwitterUserTimelineRequest ; include OldSkoolRequest ; end
   class Bogus < BadRecord ;
     def parse suffix=nil, *args
       errors = suffix.split('-')

data/lib/wuclan/twitter/scrape/twitter_ff_ids_request.rb CHANGED

@@ -27,10 +27,11 @@ module Wuclan
         # unpacks the raw API response, yielding all the relationships.
         #
         def parse *args, &block
+          return unless healthy?
           parsed_contents.each do |user_b_id|
             user_b_id = "%010d"%user_b_id.to_i
             # B is a follower: B follows user.
-            yield AFollowsB.new(user_b_id, user_a_id)
+            yield AFollowsB.new(user_b_id, twitter_user_id)
           end
         end
       end
@@ -62,10 +63,11 @@ module Wuclan
         # unpacks the raw API response, yielding all the relationships.
         #
         def parse *args, &block
+          return unless healthy?
           parsed_contents.each do |user_b_id|
             user_b_id = "%010d"%user_b_id.to_i
             # B is a friend: user follows B
-            yield AFollowsB.new(user_a_id, user_b_id)
+            yield AFollowsB.new(twitter_user_id, user_b_id)
           end
         end
       end

data/lib/wuclan/twitter/scrape/twitter_json_response.rb CHANGED

@@ -20,6 +20,7 @@ module Wuclan::Twitter::Scrape
     # generate all the contained TwitterXXX objects
     #
     def each
+      return unless healthy?
       if is_partial?
         yield user
       else
@@ -38,10 +39,10 @@ module Wuclan::Twitter::Scrape
     # This method tries to guess, based on the fields in the raw_user, which it has.
     #
     def is_partial?
+      p(raw) if !raw_user
       not raw_user.include?('friends_count')
     end
     def tweet
       Tweet.from_hash raw_tweet if raw_tweet
     end
@@ -66,7 +67,7 @@ module Wuclan::Twitter::Scrape
     #
     def fix_raw_user!
       return unless raw_user
-      raw_user['scraped_at'] = self.moreinfo['scraped_at']
+      raw_user['scraped_at'] = ModelCommon.flatten_date(self.moreinfo['scraped_at'])
       raw_user['created_at'] = ModelCommon.flatten_date(raw_user['created_at'])
       raw_user['id']         = ModelCommon.zeropad_id(  raw_user['id'])
       raw_user['protected']  = ModelCommon.unbooleanize(raw_user['protected'])
@@ -88,7 +89,7 @@ module Wuclan::Twitter::Scrape
       raw_tweet['created_at']             = ModelCommon.flatten_date(raw_tweet['created_at'])
       raw_tweet['favorited']              = ModelCommon.unbooleanize(raw_tweet['favorited'])
       raw_tweet['truncated']              = ModelCommon.unbooleanize(raw_tweet['truncated'])
-      raw_tweet['twitter_user_id']        = ModelCommon.zeropad_id(  raw_tweet['twitter_user_id'] )
+      raw_tweet['twitter_user_id']        = ModelCommon.zeropad_id(   raw_user['id'] )
       raw_tweet['in_reply_to_user_id']    = ModelCommon.zeropad_id(  raw_tweet['in_reply_to_user_id'])   unless raw_tweet['in_reply_to_user_id'].blank?   || (raw_tweet['in_reply_to_user_id'].to_i   == 0)
       raw_tweet['in_reply_to_status_id']  = ModelCommon.zeropad_id(  raw_tweet['in_reply_to_status_id']) unless raw_tweet['in_reply_to_status_id'].blank? || (raw_tweet['in_reply_to_status_id'].to_i == 0)
       Wukong.encode_components raw_tweet, 'text', 'in_reply_to_screen_name'
@@ -96,9 +97,7 @@ module Wuclan::Twitter::Scrape
   end
 end
 class JsonUserWithTweet < JsonUserTweetPair
   def raw_tweet
     return @raw_tweet if @raw_tweet
     @raw_tweet = raw['status']
@@ -112,7 +111,6 @@ end
 class JsonTweetWithUser < JsonUserTweetPair
   def raw_tweet
     @raw_tweet ||= raw
   end
@@ -122,3 +120,38 @@ class JsonTweetWithUser < JsonUserTweetPair
     @raw_user
   end
 end
+class JsonDeleteTweet
+  attr_accessor :raw, :moreinfo, :scraped_at
+  def initialize raw, moreinfo={}
+    self.raw        = raw
+    self.moreinfo   = moreinfo
+    self.scraped_at = nil # TODO -- extract this from neighbors
+  end
+  # Extracted JSON should be an array
+  def healthy?()
+    raw && raw.is_a?(Hash)
+  end
+  def delete_tweet
+    Wuclan::Twitter::Model::DeleteTweet.new(
+      raw['delete']['status']['id'],
+      self.scraped_at,
+      raw['delete']['status']['user_id']
+      ) rescue nil
+  end
+  def each *args, &block
+    return unless healthy?
+    yield delete_tweet
+  end
+  # true if this model looks like it will parse the given JSON
+  def self.parses? hsh
+    # KLUDGE
+    hsh =~ /"delete":\{/
+  end
+end

data/lib/wuclan/twitter/scrape/twitter_request_stream.rb CHANGED

@@ -21,6 +21,7 @@ class TwitterRequestStream < Monkeyshines::RequestStream::SimpleRequestStream
   # can be a screen_name, but we need the numeric ID for followers_request's, etc.
   def each_request twitter_user_id, *args
     user_req = TwitterUserRequest.new(twitter_user_id)
+    # this performs the request in-place: req holds the fulfilled response
     yield(user_req)
     return unless user_req.healthy?
     twitter_user_id = user_req.parsed_contents['id'].to_i if (user_req.parsed_contents['id'].to_i > 0)

data/lib/wuclan/twitter/scrape/twitter_stream_request.rb ADDED

@@ -0,0 +1,44 @@
+module Wuclan
+  module Twitter
+    module Scrape
+      class TwitterStreamRequest < Struct.new(:contents)
+        # Contents are JSON
+        include Monkeyshines::RawJsonContents
+        # self.hard_request_limit = 1
+        # def make_url() "http://stream.twitter.com/1/statuses/sample.json"  end
+        # Extracted JSON should be an array
+        def healthy?()
+          parsed_contents && parsed_contents.is_a?(Hash)
+        end
+        def parsed_as_delete_tweet *args, &block
+          p parsed_contents
+          json_obj = JsonDeleteTweet.new(parsed_contents)
+          json_obj.each(&block)
+        end
+        # Extract user and tweet
+        def parsed_as_tweet *args, &block
+          json_obj = JsonTweetWithUser.new(
+            parsed_contents, 'scraped_at' => parsed_contents['created_at'])
+          json_obj.each(&block)
+        end
+        #
+        # unpacks the raw API response, yielding all the interesting objects
+        # and relationships within.
+        #
+        def parse *args, &block
+          return unless healthy?
+          return parsed_as_delete_tweet(*args, &block) if JsonDeleteTweet.parses?(contents)
+          # else
+          parsed_as_tweet(*args, &block)
+        end
+      end
+    end
+  end
+end

data/wuclan.gemspec CHANGED

@@ -5,11 +5,11 @@
 Gem::Specification.new do |s|
   s.name = %q{wuclan}
-  s.version = "0.2.0"
+  s.version = "0.2.1"
   s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
   s.authors = ["Philip (flip) Kromer"]
-  s.date = %q{2009-10-12}
+  s.date = %q{2009-11-02}
   s.description = %q{Massive-scale social network analysis. Nothing to f with.}
   s.email = %q{flip@infochimps.org}
   s.extra_rdoc_files = [
@@ -35,6 +35,7 @@ Gem::Specification.new do |s|
      "examples/twitter/old/scrape_twitter_trending.rb",
      "examples/twitter/parse/parse_twitter_requests.rb",
      "examples/twitter/parse/parse_twitter_search_requests.rb",
+     "examples/twitter/parse/parse_twitter_stream_requests.rb",
      "examples/twitter/scrape_twitter_api/scrape_twitter_api.rb",
      "examples/twitter/scrape_twitter_api/seed.tsv",
      "examples/twitter/scrape_twitter_api/start_cache_twitter.sh",
@@ -49,9 +50,13 @@ Gem::Specification.new do |s|
      "examples/twitter/scrape_twitter_hosebird/scrape_twitter_hosebird.rb",
      "examples/twitter/scrape_twitter_hosebird/test_spewer.rb",
      "examples/twitter/scrape_twitter_hosebird/twitter_hosebird_god.yaml",
+     "examples/twitter/scrape_twitter_search/README.textile",
+     "examples/twitter/scrape_twitter_search/README.textile",
      "examples/twitter/scrape_twitter_search/dump_twitter_search_jobs.rb",
+     "examples/twitter/scrape_twitter_search/edamame_global_config-template.yaml",
      "examples/twitter/scrape_twitter_search/load_twitter_search_jobs.rb",
      "examples/twitter/scrape_twitter_search/scrape_twitter_search.rb",
+     "examples/twitter/scrape_twitter_search/seed.tsv",
      "examples/twitter/scrape_twitter_search/twitter_search_daemons.god",
      "lib/old/twitter_api.rb",
      "lib/wuclan.rb",
@@ -102,6 +107,7 @@ Gem::Specification.new do |s|
      "lib/wuclan/twitter/model/tweet/tweet_token.rb",
      "lib/wuclan/twitter/model/twitter_user.rb",
      "lib/wuclan/twitter/model/twitter_user/style/color_to_hsv.rb",
+     "lib/wuclan/twitter/parse.rb",
      "lib/wuclan/twitter/parse/ff_ids_parser.rb",
      "lib/wuclan/twitter/parse/friends_followers_parser.rb",
      "lib/wuclan/twitter/parse/generic_json_parser.rb",
@@ -123,6 +129,7 @@ Gem::Specification.new do |s|
      "lib/wuclan/twitter/scrape/twitter_search_job.rb",
      "lib/wuclan/twitter/scrape/twitter_search_request.rb",
      "lib/wuclan/twitter/scrape/twitter_search_request_stream.rb",
+     "lib/wuclan/twitter/scrape/twitter_stream_request.rb",
      "lib/wuclan/twitter/scrape/twitter_timeline_request.rb",
      "lib/wuclan/twitter/scrape/twitter_user_request.rb",
      "spec/spec_helper.rb",
@@ -151,6 +158,7 @@ Gem::Specification.new do |s|
      "examples/twitter/old/scrape_twitter_trending.rb",
      "examples/twitter/parse/parse_twitter_requests.rb",
      "examples/twitter/parse/parse_twitter_search_requests.rb",
+     "examples/twitter/parse/parse_twitter_stream_requests.rb",
      "examples/twitter/scrape_twitter_api/scrape_twitter_api.rb",
      "examples/twitter/scrape_twitter_api/support/make_request_stats.rb",
      "examples/twitter/scrape_twitter_api/support/make_requests_by_id_and_date_1.rb",

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: wuclan
 version: !ruby/object:Gem::Version
-  version: 0.2.0
+  version: 0.2.1
 platform: ruby
 authors:
 - Philip (flip) Kromer
@@ -9,7 +9,7 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2009-10-12 00:00:00 -05:00
+date: 2009-11-02 00:00:00 -06:00
 default_executable:
 dependencies:
 - !ruby/object:Gem::Dependency
@@ -70,6 +70,7 @@ files:
 - examples/twitter/old/scrape_twitter_trending.rb
 - examples/twitter/parse/parse_twitter_requests.rb
 - examples/twitter/parse/parse_twitter_search_requests.rb
+- examples/twitter/parse/parse_twitter_stream_requests.rb
 - examples/twitter/scrape_twitter_api/scrape_twitter_api.rb
 - examples/twitter/scrape_twitter_api/seed.tsv
 - examples/twitter/scrape_twitter_api/start_cache_twitter.sh
@@ -84,9 +85,12 @@ files:
 - examples/twitter/scrape_twitter_hosebird/scrape_twitter_hosebird.rb
 - examples/twitter/scrape_twitter_hosebird/test_spewer.rb
 - examples/twitter/scrape_twitter_hosebird/twitter_hosebird_god.yaml
+- examples/twitter/scrape_twitter_search/README.textile
 - examples/twitter/scrape_twitter_search/dump_twitter_search_jobs.rb
+- examples/twitter/scrape_twitter_search/edamame_global_config-template.yaml
 - examples/twitter/scrape_twitter_search/load_twitter_search_jobs.rb
 - examples/twitter/scrape_twitter_search/scrape_twitter_search.rb
+- examples/twitter/scrape_twitter_search/seed.tsv
 - examples/twitter/scrape_twitter_search/twitter_search_daemons.god
 - lib/old/twitter_api.rb
 - lib/wuclan.rb
@@ -136,6 +140,7 @@ files:
 - lib/wuclan/twitter/model/tweet/tweet_token.rb
 - lib/wuclan/twitter/model/twitter_user.rb
 - lib/wuclan/twitter/model/twitter_user/style/color_to_hsv.rb
+- lib/wuclan/twitter/parse.rb
 - lib/wuclan/twitter/parse/ff_ids_parser.rb
 - lib/wuclan/twitter/parse/friends_followers_parser.rb
 - lib/wuclan/twitter/parse/generic_json_parser.rb
@@ -157,6 +162,7 @@ files:
 - lib/wuclan/twitter/scrape/twitter_search_job.rb
 - lib/wuclan/twitter/scrape/twitter_search_request.rb
 - lib/wuclan/twitter/scrape/twitter_search_request_stream.rb
+- lib/wuclan/twitter/scrape/twitter_stream_request.rb
 - lib/wuclan/twitter/scrape/twitter_timeline_request.rb
 - lib/wuclan/twitter/scrape/twitter_user_request.rb
 - spec/spec_helper.rb
@@ -207,6 +213,7 @@ test_files:
 - examples/twitter/old/scrape_twitter_trending.rb
 - examples/twitter/parse/parse_twitter_requests.rb
 - examples/twitter/parse/parse_twitter_search_requests.rb
+- examples/twitter/parse/parse_twitter_stream_requests.rb
 - examples/twitter/scrape_twitter_api/scrape_twitter_api.rb
 - examples/twitter/scrape_twitter_api/support/make_request_stats.rb
 - examples/twitter/scrape_twitter_api/support/make_requests_by_id_and_date_1.rb