RubyGems - twitter_to_csv - Versions diffs - 0.1.0 → 0.1.1 - Mend

twitter_to_csv 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

data/LICENSE.txt +7 -0
data/README.markdown +10 -2
data/bin/twitter_to_csv +8 -0
data/lib/twitter_to_csv/csv_builder.rb +39 -14
data/lib/twitter_to_csv/twitter_watcher.rb +4 -2
data/lib/twitter_to_csv/version.rb +1 -1
data/spec/csv_builder_spec.rb +36 -4
metadata +7 -4

data/LICENSE.txt ADDED

@@ -0,0 +1,7 @@
+Copyright (c) 2012 Andrew Cantino, Iteration Labs, LLC
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.markdown CHANGED

@@ -49,7 +49,12 @@ separate words unless the whole thing has a single known valence.
 Once you have a recorded Twitter stream, you can rollup retweets in various ways.  Here is an example that collapses retweets into the `retweet_count` field of the original tweet, only outputs tweets with at least 1 retweet, ignores retweets that happened more than 7 days after the original tweet, and outputs retweet count columns at half an hour, 2 hours, and 2 days after the original tweet:
-    twitter_to_csv --replay-from-file out.json -c out.csv --fields retweet_count,text -e --retweet-mode rollup --retweet-threshold 1 --retweet-window 7 --retweet-counts-at 0.5,2,48
+    twitter_to_csv --replay-from-file out.json -c out.csv \
+                   --retweet-mode rollup \
+                   --retweet-threshold 1 \
+                   --retweet-window 7 \
+                   --retweet-counts-at 0.5,2,48 \
+                   --fields retweet_count,text
 Note that all of the retweet features require you to `--replay-from-file` because they parse the stream backwards.  They will not function correctly from the stream directly.
@@ -57,7 +62,10 @@ Note that all of the retweet features require you to `--replay-from-file` becaus
 To select a specific window of time in a pre-recorded stream by `created_at`, pass in `--start` and `--end`, for example:
-    twitter_to_csv --replay-from-file out.json --start "Mon Mar 07 07:42:22 +0000 2011" --end "Mon Mar 08 07:42:22 +0000 2011"
+    twitter_to_csv --replay-from-file out.json \
+                   --start "Mon Mar 07 07:42:22 +0000 2011" \
+                   --end "Mon Mar 08 07:42:22 +0000 2011" \
+                   ...
 ## Mind the Gap

data/bin/twitter_to_csv CHANGED

@@ -36,6 +36,10 @@ parser = OptionParser.new do |opts|
     options[:fields] = fields.split(/\s*,\s*/)
   end
+  opts.on("--date-fields FIELD_NAMES", "Break these fields into separate numerical columns for weekday, day, month, your, hour, minute, and second.") do |date_fields|
+    options[:date_fields] = date_fields.split(/\s*,\s*/)
+  end
   opts.on("-e", "--require-english", "Attempt to filter out non-English tweets.", "This will have both false positives and false negatives.") do |e|
     options[:require_english] = e
   end
@@ -76,6 +80,10 @@ parser = OptionParser.new do |opts|
     options[:compute_word_count] = compute_word_count
   end
+  opts.on("--normalize-source", "Return just the domain name from the Tweet source (i.e., tweetdeck, facebook)") do |normalize_source|
+    options[:normalize_source] = normalize_source
+  end
   opts.on("--start TIME", "Ignore tweets with a created_at earlier than TIME") do |start_time|
     options[:start_time] = Time.parse(start_time)
   end

data/lib/twitter_to_csv/csv_builder.rb CHANGED

@@ -37,10 +37,10 @@ module TwitterToCsv
     end
     def within_time_window?(status)
-      if options[:start] || options[:end]
+      if options[:start_time] || options[:end_time]
         created_at = status['created_at'].is_a?(Time) ? status['created_at'] : Time.parse(status['created_at'])
-        return false if options[:start] && created_at < options[:start]
-        return false if options[:end] && created_at >= options[:end]
+        return false if options[:start_time] && created_at < options[:start_time]
+        return false if options[:end_time] && created_at >= options[:end_time]
       end
       true
     end
@@ -56,7 +56,6 @@ module TwitterToCsv
           @retweet_counts[status['retweeted_status']['id']] ||= 0
           @retweet_counts[status['retweeted_status']['id']] = status['retweeted_status']['retweet_count'] if status['retweeted_status']['retweet_count'] > @retweet_counts[status['retweeted_status']['id']]
           if options[:retweet_counts_at]
             @retweet_hour_counts[status['retweeted_status']['id']] ||= options[:retweet_counts_at].map { 0 }
             options[:retweet_counts_at].each.with_index do |hour_mark, index|
@@ -72,7 +71,14 @@ module TwitterToCsv
         if (@retweet_counts[status['id']] || 0) >= (options[:retweet_threshold] || 0)
           if !options[:retweet_window] || created_at <= @newest_status_at - options[:retweet_window] * 60 * 60 * 24
             status['retweet_count'] = @retweet_counts[status['id']] if @retweet_counts[status['id']] && @retweet_counts[status['id']] > status['retweet_count']
-            status['_retweet_hour_counts'] = @retweet_hour_counts.delete(status['id']) if options[:retweet_counts_at]
+            if options[:retweet_counts_at]
+              retweet_hour_data = @retweet_hour_counts.delete(status['id'])
+              if !retweet_hour_data
+                puts "Encountered missing retweet_data for tweet##{status['id']}, possibly due to a repeating id or a deleted tweet."
+                return false
+              end
+              status['_retweet_hour_counts'] = retweet_hour_data
+            end
             true
           else
             false
@@ -104,6 +110,14 @@ module TwitterToCsv
       header_labels += ["average_sentiment", "sentiment_words"] if options[:compute_sentiment]
       header_labels << "word_count" if options[:compute_word_count]
+      header_labels << "normalized_source" if options[:normalize_source]
+      (options[:date_fields] || []).each do |date_field|
+        %w[week_day day month year hour minute second].each do |value|
+          header_labels << "#{date_field}_#{value}"
+        end
+      end
       options[:retweet_counts_at].each { |hours| header_labels << "retweets_at_#{hours}_hours" } if options[:retweet_counts_at]
       options[:url_columns].times { |i| header_labels << "url_#{i+1}" } if options[:url_columns] && options[:url_columns] > 0
@@ -132,6 +146,22 @@ module TwitterToCsv
       row << status["text"].split(/\s+/).length if options[:compute_word_count]
+      row << status["source"].gsub(/<[^>]+>/, '').strip if options[:normalize_source]
+      (options[:date_fields] || []).each do |date_field|
+        time = Time.parse(date_field.split(".").inject(status) { |memo, segment|
+          memo && memo[segment]
+        }.to_s)
+        row << time.strftime("%w") # week_day
+        row << time.strftime("%-d") # day
+        row << time.strftime("%-m") # month
+        row << time.strftime("%Y") # year
+        row << time.strftime("%-H") # hour
+        row << time.strftime("%M") # minute
+        row << time.strftime("%S") # second
+      end
       row += status["_retweet_hour_counts"] if options[:retweet_counts_at]
       if options[:url_columns] && options[:url_columns] > 0
@@ -244,15 +274,10 @@ module TwitterToCsv
         return false
       end
-      if status['text'] =~ /[^[:ascii:]]/
-        STDERR.puts "Skipping \"#{status['text']}\" due to non-ascii text." if options[:verbose]
-        return false
-      end
-      unless status['user']['lang'] == "en"
-        STDERR.puts "Skipping \"#{status['text']}\" due to lang of #{status['user']['lang']}." if options[:verbose]
-        return false
-      end
+      #unless status['user']['lang'] == "en"
+      #  STDERR.puts "Skipping \"#{status['text']}\" due to lang of #{status['user']['lang']}." if options[:verbose]
+      #  return false
+      #end
       unless UnsupervisedLanguageDetection.is_english_tweet?(status['text'])
         STDERR.puts "Skipping \"#{status['text']}\" due to UnsupervisedLanguageDetection guessing non-English" if options[:verbose]

data/lib/twitter_to_csv/twitter_watcher.rb CHANGED

@@ -1,3 +1,5 @@
+require 'cgi'
 module TwitterToCsv
   class TwitterWatcher
     attr_accessor :username, :password, :filter, :fetch_errors
@@ -20,7 +22,7 @@ module TwitterToCsv
       while true
         EventMachine::run do
           stream = Twitter::JSONStream.connect(
-            :path    => "/1/statuses/#{(filter && filter.length > 0) ? 'filter' : 'sample'}.json#{"?track=#{filter.join(",")}" if filter && filter.length > 0}",
+            :path    => "/1/statuses/#{(filter && filter.length > 0) ? 'filter' : 'sample'}.json#{"?track=#{filter.map {|f| CGI::escape(f) }.join(",")}" if filter && filter.length > 0}",
             :auth    => "#{username}:#{password}",
             :ssl     => true
           )
@@ -55,4 +57,4 @@ module TwitterToCsv
       block.call(status)
     end
   end
-end
+end

data/lib/twitter_to_csv/version.rb CHANGED

@@ -1,3 +1,3 @@
 module TwitterToCsv
-  VERSION = "0.1.0"
+  VERSION = "0.1.1"
 end

data/spec/csv_builder_spec.rb CHANGED

@@ -9,10 +9,8 @@ describe TwitterToCsv::CsvBuilder do
         string_io = StringIO.new
         csv_builder = TwitterToCsv::CsvBuilder.new(:require_english => true, :csv => string_io, :fields => %w[text])
         csv_builder.handle_status('text' => "This is English", 'user' =>  { 'lang' => 'en' })
-        csv_builder.handle_status('text' => "هذه الجملة باللغة الإنجليزية.", 'user' =>  { 'lang' => 'en' })
         csv_builder.handle_status('text' => "Esta frase se encuentra en Ingles.", 'user' =>  { 'lang' => 'en' })
         csv_builder.handle_status('text' => "This is still English", 'user' =>  { 'lang' => 'en' })
-        csv_builder.handle_status('text' => "The lang code can lie, but we trust it for now.", 'user' =>  { 'lang' => 'fr' })
         string_io.rewind
         string_io.read.should == "\"This is English\"\n\"This is still English\"\n"
       end
@@ -20,8 +18,8 @@ describe TwitterToCsv::CsvBuilder do
       it "honors start_time and end_time" do
         string_io = StringIO.new
         csv_builder = TwitterToCsv::CsvBuilder.new(:csv => string_io, :fields => %w[text],
-                                                   :start => Time.parse("Mon Mar 07 07:42:22 +0000 2011"),
-                                                   :end   => Time.parse("Mon Mar 08 02:00:00 +0000 2011"))
+                                                   :start_time => Time.parse("Mon Mar 07 07:42:22 +0000 2011"),
+                                                   :end_time   => Time.parse("Mon Mar 08 02:00:00 +0000 2011"))
         # Order shouldn't matter
         csv_builder.handle_status('text' => "1", 'created_at' => 'Mon Mar 07 07:41:22 +0000 2011')
@@ -52,6 +50,14 @@ describe TwitterToCsv::CsvBuilder do
         string_io.read.should == '"something","url_1","url_2"' + "\n"
       end
+      it "includes date fields if requested" do
+        string_io = StringIO.new
+        csv_builder = TwitterToCsv::CsvBuilder.new(:csv => string_io, :fields => %w[something], :date_fields => %w[created_at])
+        csv_builder.log_csv_header
+        string_io.rewind
+        string_io.read.should == '"something","created_at_week_day","created_at_day","created_at_month","created_at_year","created_at_hour","created_at_minute","created_at_second"' + "\n"
+      end
       it "includes columns for the retweet_counts_at entries, if present" do
         string_io = StringIO.new
         csv_builder = TwitterToCsv::CsvBuilder.new(:csv => string_io,
@@ -162,6 +168,32 @@ describe TwitterToCsv::CsvBuilder do
         string_io.read.should == "\"hello1\",\"3\"\n" +
                                  "\"hello2\",\"2\"\n"
       end
+      it "can return date fields" do
+        string_io = StringIO.new
+        csv_builder = TwitterToCsv::CsvBuilder.new(:csv => string_io, :fields => %w[something], :date_fields => %w[created_at])
+        csv_builder.handle_status({
+            'something' => "hello1",
+            'text' => 'i love cheese',
+            'created_at' => "2012-06-29 13:12:09 -0700"
+        })
+        string_io.rewind
+        string_io.read.should == "\"hello1\",\"5\",\"29\",\"6\",\"2012\",\"13\",\"12\",\"09\"\n"
+      end
+      it "can return a normalized source" do
+        string_io = StringIO.new
+        csv_builder = TwitterToCsv::CsvBuilder.new(:csv => string_io, :fields => %w[something], :normalize_source => true)
+        csv_builder.handle_status({
+            'something' => "hello1",
+            'text' => 'i love cheese',
+            'source' => "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>"
+        })
+        string_io.rewind
+        string_io.read.should == "\"hello1\",\"Twitter for Android\"\n"
+      end
     end
     describe "retweet handling" do

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: twitter_to_csv
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.1.1
   prerelease:
 platform: ruby
 authors:
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-07-04 00:00:00.000000000 Z
+date: 2012-08-07 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: rspec
@@ -103,6 +103,7 @@ files:
 - .rspec
 - .rvmrc
 - Gemfile
+- LICENSE.txt
 - README.markdown
 - Rakefile
 - bin/twitter_to_csv
@@ -136,9 +137,11 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
 requirements: []
 rubyforge_project: twitter_to_csv
-rubygems_version: 1.8.24
+rubygems_version: 1.8.21
 signing_key:
 specification_version: 3
 summary: Dump the Twitter streaming API to a CSV or JSON file and then filter, handle
   retweets, apply sentiment analysis, and more.
-test_files: []
+test_files:
+- spec/csv_builder_spec.rb
+- spec/spec_helper.rb