bad_pigeon 0.1.0 → 0.1.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 383a6c54b644b03f93d3b6c67ca811045938f8bf0de44feaf3c5f7344838eaf8
4
- data.tar.gz: b69a8b24eeb45f7e3761309aa993620ed845a6625ad8fbaae1114c398578e17c
3
+ metadata.gz: 342303be8c8d86ebacfe718abecf55e553e1e9528540bac71d82d7f5729e6f84
4
+ data.tar.gz: cae73cef84dd9267c5cd79946afc129ccd93c8dcae62127e8c24b4ee88a8c6e6
5
5
  SHA512:
6
- metadata.gz: 35ccd0f949bca44ce2b0374f1ba185f7addedb35c2b574e3bedfbccf9c7487ca80a3d418f0e9cd27c8501ee25c874d564d77f1959e1a8e9a4925707673861be9
7
- data.tar.gz: 1af6038dafd3046b4f3af1cfcb739dd1df4428d527b0eea1ad986ef913711e0b67d9ce1ea8148d3f4d58b0d73b940e93409887f604c526c8c49c655d81608b5a
6
+ metadata.gz: d648e2cafc7d9f6b71650dfe283a57dc2a569d9b6da42938b38050559b8c0af0a53917624eb200febdc7a2522bcdfe1242b2645b11532dfcfddf81f408f3ba70
7
+ data.tar.gz: f30740c2a5422ada1760dbb46ab66868095546781843d75d3c3f4b1d232994c928835ca78e8812451f71e822503cd0dd2f96cd47f4045c655ee2539d35c242de
data/CHANGELOG.md CHANGED
@@ -1,5 +1,14 @@
1
- ## [Unreleased]
1
+ ## [0.1.1] - 2023-06-20
2
2
 
3
- ## [0.1.0] - 2023-06-18
3
+ - timeline parsing improvements
4
+ - added some helper methods
5
+ - fix for HAR exports from Firefox/Chrome
4
6
 
5
- - Initial release
7
+ ## [0.1.0] - 2023-06-19
8
+
9
+ First working proof of concept version:
10
+
11
+ - parsing a HAR archive
12
+ - parsing user, list and home timelines from archive requests
13
+ - exporting tweet JSON data
14
+ - command line script
data/README.md CHANGED
@@ -1,31 +1,93 @@
1
1
  # BadPigeon
2
2
 
3
- TODO: Delete this and the text below, and describe your gem
3
+ A tool for exporting tweet data from Twitter by parsing GraphQL fetch requests made by the Twitter website.
4
4
 
5
- Welcome to your new gem! In this directory, you'll find the files you need to be able to package up your Ruby library into a gem. Put your Ruby code in the file `lib/bad_pigeon`. To experiment with that code, run `bin/console` for an interactive prompt.
5
+ <p>
6
+ <img src="https://github.com/mackuba/bad_pigeon/assets/28465/99c3eee1-1fab-41be-a909-6b53d141b7db" width="600"><br>
7
+ <i>Photo by Martin Vorel, <a href="https://libreshot.com">libreshot.com</a></i>
8
+ </p>
6
9
 
7
- ## Installation
8
10
 
9
- TODO: Replace `UPDATE_WITH_YOUR_GEM_NAME_PRIOR_TO_RELEASE_TO_RUBYGEMS_ORG` with your gem name right after releasing it to RubyGems.org. Please do not do it earlier due to security reasons. Alternatively, replace this section with instructions to install your gem from git if you don't plan to release to RubyGems.org.
11
+ ## What is this about?
10
12
 
11
- Install the gem and add to the application's Gemfile by executing:
13
+ **Problem:** You were running some kind of project that used Twitter API to load tweets from some number of feeds and process them in some way - for archiving, research, statistics, whatever. Now the free API access has been shut down, all your API keys have been revoked and your project doesn't work anymore ☹️
12
14
 
13
- $ bundle add UPDATE_WITH_YOUR_GEM_NAME_PRIOR_TO_RELEASE_TO_RUBYGEMS_ORG
15
+ **Solution 1:** sign up for paid access and pay more than all your streaming, media, internet, mobile and app subscriptions combined every month just to fetch some tweets 🤑💰💰💰
14
16
 
15
- If bundler is not being used to manage dependencies, install the gem by executing:
17
+ **Solution 2:** go the Chad Scraper route and scrape the data from the website with some scripts, playing a cat and mouse game and worrying that your account and/or IP will be blocked 😬
16
18
 
17
- $ gem install UPDATE_WITH_YOUR_GEM_NAME_PRIOR_TO_RELEASE_TO_RUBYGEMS_ORG
19
+ **Solution 3:** passively record the requests that the Twitter frontend is making to the API using Safari Web Inspector, then use some Ruby code to extract any data you want from the saved JSON responses 🤔
18
20
 
19
- ## Usage
20
21
 
21
- TODO: Write usage instructions here
22
+ ## How it works:
22
23
 
23
- ## Development
24
+ 1. Open the Twitter website in a browser (preferably Safari or Firefox).
25
+ 2. Open the Web Inspector / Developer Tools on the Network tab.
26
+ - in Safari, make sure the "Export" button is not grayed out; if it is, reload the page first
27
+ 3. Scroll through some timelines (home, lists etc.) to make sure everything you want to save has been loaded.
28
+ 4. In the Network tab list, type "graphql" to the filter bar - only those requests are parsed, so no point making the archive larger than necessary.
29
+ - it seems that Chrome-based browsers always export all requests to the archive, so the file size gets into tens of megabytes very quickly - so it's better to use Safari or Firefox, which only export requests matching the filter
30
+ 5. Click "Export" and save the requests to a "HAR" archive file.
31
+ - in Safari, the button is in the top-right corner of the Network tab
32
+ - in Firefox, click the "gear" button in the top-right corner and choose "Save All As HAR"
33
+ - in Chrome, click the down arrow button at the end of the top toolbar
34
+ 6. Feed the archive file to the Bad Pigeon (the Ruby code or the command line tool).
35
+ 7. Profit 👍
24
36
 
25
- After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
37
+ Note: one obvious drawback of this method is that the request recording part is somewhat manual, so it's (probably) not possible to completely automate it so that it runs on a server somewhere, unattended. However, it should be enough if you're ok with having to remember to periodically browse through a few timelines, save the export and run a script on it.
26
38
 
27
- To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and the created tag, and push the `.gem` file to [rubygems.org](https://rubygems.org).
28
39
 
29
- ## Contributing
40
+ ## Stability warning ⚠️
30
41
 
31
- Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/bad_pigeon.
42
+ This is a very early version of this tool. The API \*will\* change between versions, possibly even between point releases. Don't be surprised if something breaks.
43
+
44
+
45
+ ## How to use:
46
+
47
+ To install the tool, run:
48
+
49
+ ```
50
+ gem install bad_pigeon
51
+ ```
52
+
53
+ The [`TweetExtractor`](https://github.com/mackuba/bad_pigeon/blob/docs/lib/bad_pigeon/tweet_extractor.rb) class is the entry point. Pass the contents of the `.har` file to the `#get_tweets_from_har` method to get an array of `Tweet` objects parsed from the whole archive:
54
+
55
+ ```rb
56
+ require 'bad_pigeon'
57
+
58
+ data = File.read(path_to_har)
59
+ extractor = BadPigeon::TweetExtractor.new
60
+ tweets = extractor.get_tweets_from_har(data)
61
+
62
+ tweets.sort_by(&:created_at).reverse.each do |tweet|
63
+ puts "#{tweet.created_at} @#{tweet.user.screen_name}: \"#{tweet.text}\""
64
+ end
65
+ ```
66
+
67
+ The `Tweet` class is meant to be API compatible with the one from the popular [twitter gem](https://github.com/sferik/twitter/), so you should be able to use it as a drop-in replacement if your project used that library (although only some subset of properties will work right now - please [report issues](https://github.com/mackuba/bad_pigeon/issues) for any missing ones).
68
+
69
+
70
+ ### Command line
71
+
72
+ The gem also installs a command-line script `pigeon`. You can pass it the archive file and get a JSON array of tweet data on the output:
73
+
74
+ ```
75
+ pigeon < tweets.har > tweets.json
76
+ ```
77
+
78
+ At the moment this is the only thing it does. There will be some options in the future to e.g. filter the tweets only from some sources and so on. The format that it exports the tweets in is also meant to match the hashes returned from the `#attrs` method in the `Tweet` class in the [twitter gem](https://github.com/sferik/twitter/).
79
+
80
+
81
+ ## Credits
82
+
83
+ Copyright © 2023 Kuba Suder ([@mackuba.eu](https://bsky.app/profile/mackuba.eu)).
84
+
85
+ The code is available under the terms of the [zlib license](https://choosealicense.com/licenses/zlib/) (permissive, similar to MIT).
86
+
87
+ Bug reports and pull requests are welcome 😎 (note: if you're having problems parsing some tweets, please send me links to some examples of specific tweets that are making it fail).
88
+
89
+ ---
90
+
91
+ #### Why *bad* pigeon?
92
+
93
+ Because pigeons are generally bad :<
@@ -27,12 +27,16 @@ module BadPigeon
27
27
  @json['clientEventInfo'] && @json['clientEventInfo']['component']
28
28
  end
29
29
 
30
+ def all_tweets
31
+ items.map(&:tweet).compact
32
+ end
33
+
30
34
  def items
31
35
  case self.type
32
36
  when Type::ITEM
33
- item_from_content(@json['itemContent'])
37
+ [item_from_content(@json['itemContent'])].compact
34
38
  when Type::MODULE
35
- @json['items'].map { |i| item_from_content(i['item']['itemContent']) }
39
+ @json['items'].map { |i| item_from_content(i['item']['itemContent']) }.compact
36
40
  when Type::CURSOR
37
41
  []
38
42
  else
@@ -44,12 +48,12 @@ module BadPigeon
44
48
  def item_from_content(item_content)
45
49
  case item_content['itemType']
46
50
  when 'TimelineTweet'
47
- [TimelineTweet.new(item_content)]
51
+ TimelineTweet.new(item_content)
48
52
  when 'TimelineUser'
49
- []
53
+ nil
50
54
  else
51
55
  assert("Unknown itemContent type: #{item_content['itemType']}")
52
- []
56
+ nil
53
57
  end
54
58
  end
55
59
  end
@@ -28,7 +28,7 @@ module BadPigeon
28
28
  end
29
29
 
30
30
  def tweet
31
- Tweet.new(tweet_data)
31
+ tweet_data && Tweet.new(tweet_data)
32
32
  end
33
33
  end
34
34
  end
@@ -7,8 +7,12 @@ module BadPigeon
7
7
  @json = JSON.parse(data)
8
8
  end
9
9
 
10
- def entries
10
+ def requests
11
11
  @json['log']['entries'].map { |j| HARRequest.new(j) }
12
12
  end
13
+
14
+ def inspect
15
+ to_s
16
+ end
13
17
  end
14
18
  end
@@ -1,3 +1,4 @@
1
+ require 'addressable/uri'
1
2
  require 'json'
2
3
 
3
4
  module BadPigeon
@@ -18,6 +19,19 @@ module BadPigeon
18
19
  url.start_with?('https://api.twitter.com/graphql/') || url.start_with?('https://twitter.com/i/api/graphql/')
19
20
  end
20
21
 
22
+ def includes_tweet_data?
23
+ graphql_endpoint? && method == :get && status == 200 && has_json_response?
24
+ end
25
+
26
+ def endpoint_name
27
+ Addressable::URI.parse(url).path.split('/').last
28
+ end
29
+
30
+ def params
31
+ vars = Addressable::URI.parse(url).query_values['variables']
32
+ vars && JSON.parse(vars) || {}
33
+ end
34
+
21
35
  def status
22
36
  @json['response']['status']
23
37
  end
@@ -27,7 +41,7 @@ module BadPigeon
27
41
  end
28
42
 
29
43
  def has_json_response?
30
- mime_type == 'application/json'
44
+ mime_type.gsub(/;.*/, '').strip == 'application/json'
31
45
  end
32
46
 
33
47
  def response_body
@@ -37,5 +51,11 @@ module BadPigeon
37
51
  def response_json
38
52
  response_body && JSON.parse(response_body)
39
53
  end
54
+
55
+ def inspect
56
+ keys = [:method, :url, :status]
57
+ vars = keys.map { |k| "#{k}=#{self.send(k).inspect}" }.join(", ")
58
+ "#<#{self.class}:0x#{object_id} #{vars}>"
59
+ end
40
60
  end
41
61
  end
@@ -73,17 +73,27 @@ module BadPigeon
73
73
  def attrs
74
74
  user = json['core']['user_results']['result']
75
75
 
76
- fields = legacy.merge({
76
+ fields = {
77
77
  id: id,
78
78
  source: json['source'],
79
79
  text: text,
80
80
  truncated: false,
81
- }).reject { |k, v| ['retweeted_status_result', 'quoted_status_result'].include?(k) }
81
+ }
82
82
 
83
- user_fields = user['legacy'].merge({
83
+ legacy.each do |k, v|
84
+ next if ['retweeted_status_result', 'quoted_status_result'].include?(k)
85
+ fields[k.to_sym] = v
86
+ end
87
+
88
+ user_fields = {
84
89
  id: user['rest_id'].to_i,
85
90
  id_str: user['rest_id'],
86
- }).reject { |k, v| k =~ /^profile_\w+_extensions/ }
91
+ }
92
+
93
+ user['legacy'].each do |k, v|
94
+ next if k =~ /^profile_\w+_extensions/
95
+ user_fields[k.to_sym] = v
96
+ end
87
97
 
88
98
  fields[:user] = StrictHash[user_fields]
89
99
 
@@ -3,8 +3,25 @@ Dir[File.join(__dir__, 'timelines', '*.rb')].each { |f| require(f) }
3
3
  module BadPigeon
4
4
  TIMELINE_TYPES = {
5
5
  'UserTweets' => UserTimeline,
6
+ 'UserMedia' => UserTimeline,
6
7
  'HomeLatestTimeline' => HomeTimeline,
7
8
  'HomeTimeline' => HomeTimeline,
8
- 'ListLatestTweetsTimeline' => ListTimeline
9
+ 'ListLatestTweetsTimeline' => ListTimeline,
10
+
11
+ # ignored requests:
12
+ 'AudioSpaceById' => nil,
13
+ 'CommunitiesTabBarItemQuery' => nil,
14
+ 'DataSaverMode' => nil,
15
+ 'GetUserClaims' => nil,
16
+ 'ListByRestId' => nil,
17
+ 'ListMembers' => nil,
18
+ 'ListPins' => nil,
19
+ 'ListSubscribers' => nil,
20
+ 'ListsManagementPageTimeline' => nil,
21
+ 'ProfileSpotlightsQuery' => nil,
22
+ 'UserByRestId' => nil,
23
+ 'UserByScreenName' => nil,
24
+ 'Viewer' => nil,
25
+ 'getAltTextPromptPreference' => nil,
9
26
  }
10
27
  end
@@ -15,23 +15,21 @@ module BadPigeon
15
15
 
16
16
  def get_tweets_from_har(har_data)
17
17
  archive = HARArchive.new(har_data)
18
+ requests = archive.requests.select(&:includes_tweet_data?)
18
19
 
19
- requests = archive.entries.select { |e|
20
- e.graphql_endpoint? && e.method == :get && e.status == 200 && e.has_json_response?
21
- }
22
-
23
- entries = requests.map { |e|
24
- endpoint = URI(e.url).path.split('/').last
20
+ timeline_entries = requests.map { |e| timeline_entries_from_request(e) }.flatten
21
+ timeline_entries.select { |e| @filter.include_entry?(e) }.map(&:all_tweets).flatten
22
+ end
25
23
 
26
- if timeline_class = TIMELINE_TYPES[endpoint]
27
- timeline_class.new(e.response_json).instructions.map(&:entries)
28
- elsif !TIMELINE_TYPES.has_key?(endpoint)
29
- debug "Unknown endpoint: #{endpoint}"
30
- []
31
- end
32
- }.flatten
24
+ def timeline_entries_from_request(request)
25
+ endpoint = request.endpoint_name
33
26
 
34
- entries.select { |e| @filter.include_entry?(e) }.map(&:items).flatten.map(&:tweet).compact
27
+ if timeline_class = TIMELINE_TYPES[endpoint]
28
+ timeline_class.new(request.response_json).instructions.map(&:entries)
29
+ else
30
+ debug "Unknown endpoint: #{endpoint}" unless TIMELINE_TYPES.has_key?(endpoint)
31
+ []
32
+ end
35
33
  end
36
34
  end
37
35
  end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module BadPigeon
4
- VERSION = "0.1.0"
4
+ VERSION = "0.1.1"
5
5
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: bad_pigeon
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Kuba Suder
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2023-06-19 00:00:00.000000000 Z
11
+ date: 2023-06-20 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: addressable
@@ -25,10 +25,10 @@ dependencies:
25
25
  - !ruby/object:Gem::Version
26
26
  version: '2.8'
27
27
  description: "\n BadPigeon is a Ruby gem that allows you to extract tweet data
28
- from the XHR requests that the Twitter.com frontend\n website does in user's
29
- browser. The requests need to be saved into a \"HAR\" archive file from the browser's
30
- web\n inspector tool and then that file is fed into either the appropriate Ruby
31
- class or the `pigeon` command line tool.\n \n The tool intents to be API compatible
28
+ from the XHR requests that the Twitter frontend\n website does in user's browser.
29
+ The requests need to be saved into a \"HAR\" archive file from the browser's web\n
30
+ \ inspector tool and then that file is fed into either the appropriate Ruby class
31
+ or the `pigeon` command line tool.\n\n The tool intents to be API compatible
32
32
  with the popular `twitter` gem and generate the same kind of tweet JSON\n structure
33
33
  as is read and exported by that library.\n "
34
34
  email: