bad_pigeon 0.1.0 → 0.1.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +16 -3
- data/README.md +77 -15
- data/lib/bad_pigeon/elements/timeline_entry.rb +9 -5
- data/lib/bad_pigeon/elements/timeline_tweet.rb +1 -1
- data/lib/bad_pigeon/har/har_archive.rb +5 -1
- data/lib/bad_pigeon/har/har_request.rb +21 -1
- data/lib/bad_pigeon/models/tweet.rb +14 -4
- data/lib/bad_pigeon/timelines.rb +18 -1
- data/lib/bad_pigeon/tweet_extractor.rb +12 -14
- data/lib/bad_pigeon/version.rb +1 -1
- metadata +6 -6
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: c3f86a627c1ab2f039167161fe3ecfbeb3db6d96a925673c23165cf74a3cc54b
|
4
|
+
data.tar.gz: f43ac43250395cc7f9200c5cc8bcc2bb2596827aac61f88110e078ae0210fa64
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 7c02d9d1fe34e0c97a05d0135db11fef8dc401261fd1f2c1fa7a3cd961d191e547e2e462b7792728ac602a5567db8c756f2f2f6906ff9308cdbd8a2cdf53be34
|
7
|
+
data.tar.gz: 8eb675eb61e1c2bfb9ace4e21b2aa2526fd878933eedcb48798b892d9261f80d82ce47b765a7b25a1911c9c6586cc899e16a8160ccb18d1d12cfb2b93d43cb96
|
data/CHANGELOG.md
CHANGED
@@ -1,5 +1,18 @@
|
|
1
|
-
## [
|
1
|
+
## [0.1.2] - 2023-07-11
|
2
2
|
|
3
|
-
|
3
|
+
- fixed issue with some requests being ignored because they use POST method
|
4
4
|
|
5
|
-
-
|
5
|
+
## [0.1.1] - 2023-06-20
|
6
|
+
|
7
|
+
- timeline parsing improvements
|
8
|
+
- added some helper methods
|
9
|
+
- fix for HAR exports from Firefox/Chrome
|
10
|
+
|
11
|
+
## [0.1.0] - 2023-06-19
|
12
|
+
|
13
|
+
First working proof of concept version:
|
14
|
+
|
15
|
+
- parsing a HAR archive
|
16
|
+
- parsing user, list and home timelines from archive requests
|
17
|
+
- exporting tweet JSON data
|
18
|
+
- command line script
|
data/README.md
CHANGED
@@ -1,31 +1,93 @@
|
|
1
1
|
# BadPigeon
|
2
2
|
|
3
|
-
|
3
|
+
A tool for exporting tweet data from Twitter by parsing GraphQL fetch requests made by the Twitter website.
|
4
4
|
|
5
|
-
|
5
|
+
<p>
|
6
|
+
<img src="https://github.com/mackuba/bad_pigeon/assets/28465/99c3eee1-1fab-41be-a909-6b53d141b7db" width="600"><br>
|
7
|
+
<i>Photo by Martin Vorel, <a href="https://libreshot.com">libreshot.com</a></i>
|
8
|
+
</p>
|
6
9
|
|
7
|
-
## Installation
|
8
10
|
|
9
|
-
|
11
|
+
## What is this about?
|
10
12
|
|
11
|
-
|
13
|
+
**Problem:** You were running some kind of project that used Twitter API to load tweets from some number of feeds and process them in some way - for archiving, research, statistics, whatever. Now the free API access has been shut down, all your API keys have been revoked and your project doesn't work anymore ☹️
|
12
14
|
|
13
|
-
|
15
|
+
**Solution 1:** sign up for paid access and pay more than all your streaming, media, internet, mobile and app subscriptions combined every month just to fetch some tweets 🤑💰💰💰
|
14
16
|
|
15
|
-
|
17
|
+
**Solution 2:** go the Chad Scraper route and scrape the data from the website with some scripts, playing a cat and mouse game and worrying that your account and/or IP will be blocked 😬
|
16
18
|
|
17
|
-
|
19
|
+
**Solution 3:** passively record the requests that the Twitter frontend is making to the API using Safari Web Inspector, then use some Ruby code to extract any data you want from the saved JSON responses 🤔
|
18
20
|
|
19
|
-
## Usage
|
20
21
|
|
21
|
-
|
22
|
+
## How it works:
|
22
23
|
|
23
|
-
|
24
|
+
1. Open the Twitter website in a browser (preferably Safari or Firefox).
|
25
|
+
2. Open the Web Inspector / Developer Tools on the Network tab.
|
26
|
+
- in Safari, make sure the "Export" button is not grayed out; if it is, reload the page first
|
27
|
+
3. Scroll through some timelines (home, lists etc.) to make sure everything you want to save has been loaded.
|
28
|
+
4. In the Network tab list, type "graphql" to the filter bar - only those requests are parsed, so no point making the archive larger than necessary.
|
29
|
+
- it seems that Chrome-based browsers always export all requests to the archive, so the file size gets into tens of megabytes very quickly - so it's better to use Safari or Firefox, which only export requests matching the filter
|
30
|
+
5. Click "Export" and save the requests to a "HAR" archive file.
|
31
|
+
- in Safari, the button is in the top-right corner of the Network tab
|
32
|
+
- in Firefox, click the "gear" button in the top-right corner and choose "Save All As HAR"
|
33
|
+
- in Chrome, click the down arrow button at the end of the top toolbar
|
34
|
+
6. Feed the archive file to the Bad Pigeon (the Ruby code or the command line tool).
|
35
|
+
7. Profit 👍
|
24
36
|
|
25
|
-
|
37
|
+
Note: one obvious drawback of this method is that the request recording part is somewhat manual, so it's (probably) not possible to completely automate it so that it runs on a server somewhere, unattended. However, it should be enough if you're ok with having to remember to periodically browse through a few timelines, save the export and run a script on it.
|
26
38
|
|
27
|
-
To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and the created tag, and push the `.gem` file to [rubygems.org](https://rubygems.org).
|
28
39
|
|
29
|
-
##
|
40
|
+
## Stability warning ⚠️
|
30
41
|
|
31
|
-
|
42
|
+
This is a very early version of this tool. The API \*will\* change between versions, possibly even between point releases. Don't be surprised if something breaks.
|
43
|
+
|
44
|
+
|
45
|
+
## How to use:
|
46
|
+
|
47
|
+
To install the tool, run:
|
48
|
+
|
49
|
+
```
|
50
|
+
gem install bad_pigeon
|
51
|
+
```
|
52
|
+
|
53
|
+
The [`TweetExtractor`](https://github.com/mackuba/bad_pigeon/blob/docs/lib/bad_pigeon/tweet_extractor.rb) class is the entry point. Pass the contents of the `.har` file to the `#get_tweets_from_har` method to get an array of `Tweet` objects parsed from the whole archive:
|
54
|
+
|
55
|
+
```rb
|
56
|
+
require 'bad_pigeon'
|
57
|
+
|
58
|
+
data = File.read(path_to_har)
|
59
|
+
extractor = BadPigeon::TweetExtractor.new
|
60
|
+
tweets = extractor.get_tweets_from_har(data)
|
61
|
+
|
62
|
+
tweets.sort_by(&:created_at).reverse.each do |tweet|
|
63
|
+
puts "#{tweet.created_at} @#{tweet.user.screen_name}: \"#{tweet.text}\""
|
64
|
+
end
|
65
|
+
```
|
66
|
+
|
67
|
+
The `Tweet` class is meant to be API compatible with the one from the popular [twitter gem](https://github.com/sferik/twitter/), so you should be able to use it as a drop-in replacement if your project used that library (although only some subset of properties will work right now - please [report issues](https://github.com/mackuba/bad_pigeon/issues) for any missing ones).
|
68
|
+
|
69
|
+
|
70
|
+
### Command line
|
71
|
+
|
72
|
+
The gem also installs a command-line script `pigeon`. You can pass it the archive file and get a JSON array of tweet data on the output:
|
73
|
+
|
74
|
+
```
|
75
|
+
pigeon < tweets.har > tweets.json
|
76
|
+
```
|
77
|
+
|
78
|
+
At the moment this is the only thing it does. There will be some options in the future to e.g. filter the tweets only from some sources and so on. The format that it exports the tweets in is also meant to match the hashes returned from the `#attrs` method in the `Tweet` class in the [twitter gem](https://github.com/sferik/twitter/).
|
79
|
+
|
80
|
+
|
81
|
+
## Credits
|
82
|
+
|
83
|
+
Copyright © 2023 Kuba Suder ([@mackuba.eu](https://bsky.app/profile/mackuba.eu)).
|
84
|
+
|
85
|
+
The code is available under the terms of the [zlib license](https://choosealicense.com/licenses/zlib/) (permissive, similar to MIT).
|
86
|
+
|
87
|
+
Bug reports and pull requests are welcome 😎 (note: if you're having problems parsing some tweets, please send me links to some examples of specific tweets that are making it fail).
|
88
|
+
|
89
|
+
---
|
90
|
+
|
91
|
+
#### Why *bad* pigeon?
|
92
|
+
|
93
|
+
Because pigeons are generally bad :<
|
@@ -27,12 +27,16 @@ module BadPigeon
|
|
27
27
|
@json['clientEventInfo'] && @json['clientEventInfo']['component']
|
28
28
|
end
|
29
29
|
|
30
|
+
def all_tweets
|
31
|
+
items.map(&:tweet).compact
|
32
|
+
end
|
33
|
+
|
30
34
|
def items
|
31
35
|
case self.type
|
32
36
|
when Type::ITEM
|
33
|
-
item_from_content(@json['itemContent'])
|
37
|
+
[item_from_content(@json['itemContent'])].compact
|
34
38
|
when Type::MODULE
|
35
|
-
@json['items'].map { |i| item_from_content(i['item']['itemContent']) }
|
39
|
+
@json['items'].map { |i| item_from_content(i['item']['itemContent']) }.compact
|
36
40
|
when Type::CURSOR
|
37
41
|
[]
|
38
42
|
else
|
@@ -44,12 +48,12 @@ module BadPigeon
|
|
44
48
|
def item_from_content(item_content)
|
45
49
|
case item_content['itemType']
|
46
50
|
when 'TimelineTweet'
|
47
|
-
|
51
|
+
TimelineTweet.new(item_content)
|
48
52
|
when 'TimelineUser'
|
49
|
-
|
53
|
+
nil
|
50
54
|
else
|
51
55
|
assert("Unknown itemContent type: #{item_content['itemType']}")
|
52
|
-
|
56
|
+
nil
|
53
57
|
end
|
54
58
|
end
|
55
59
|
end
|
@@ -1,3 +1,4 @@
|
|
1
|
+
require 'addressable/uri'
|
1
2
|
require 'json'
|
2
3
|
|
3
4
|
module BadPigeon
|
@@ -18,6 +19,19 @@ module BadPigeon
|
|
18
19
|
url.start_with?('https://api.twitter.com/graphql/') || url.start_with?('https://twitter.com/i/api/graphql/')
|
19
20
|
end
|
20
21
|
|
22
|
+
def includes_tweet_data?
|
23
|
+
graphql_endpoint? && status == 200 && has_json_response?
|
24
|
+
end
|
25
|
+
|
26
|
+
def endpoint_name
|
27
|
+
Addressable::URI.parse(url).path.split('/').last
|
28
|
+
end
|
29
|
+
|
30
|
+
def params
|
31
|
+
vars = Addressable::URI.parse(url).query_values['variables']
|
32
|
+
vars && JSON.parse(vars) || {}
|
33
|
+
end
|
34
|
+
|
21
35
|
def status
|
22
36
|
@json['response']['status']
|
23
37
|
end
|
@@ -27,7 +41,7 @@ module BadPigeon
|
|
27
41
|
end
|
28
42
|
|
29
43
|
def has_json_response?
|
30
|
-
mime_type == 'application/json'
|
44
|
+
mime_type.gsub(/;.*/, '').strip == 'application/json'
|
31
45
|
end
|
32
46
|
|
33
47
|
def response_body
|
@@ -37,5 +51,11 @@ module BadPigeon
|
|
37
51
|
def response_json
|
38
52
|
response_body && JSON.parse(response_body)
|
39
53
|
end
|
54
|
+
|
55
|
+
def inspect
|
56
|
+
keys = [:method, :url, :status]
|
57
|
+
vars = keys.map { |k| "#{k}=#{self.send(k).inspect}" }.join(", ")
|
58
|
+
"#<#{self.class}:0x#{object_id} #{vars}>"
|
59
|
+
end
|
40
60
|
end
|
41
61
|
end
|
@@ -73,17 +73,27 @@ module BadPigeon
|
|
73
73
|
def attrs
|
74
74
|
user = json['core']['user_results']['result']
|
75
75
|
|
76
|
-
fields =
|
76
|
+
fields = {
|
77
77
|
id: id,
|
78
78
|
source: json['source'],
|
79
79
|
text: text,
|
80
80
|
truncated: false,
|
81
|
-
}
|
81
|
+
}
|
82
82
|
|
83
|
-
|
83
|
+
legacy.each do |k, v|
|
84
|
+
next if ['retweeted_status_result', 'quoted_status_result'].include?(k)
|
85
|
+
fields[k.to_sym] = v
|
86
|
+
end
|
87
|
+
|
88
|
+
user_fields = {
|
84
89
|
id: user['rest_id'].to_i,
|
85
90
|
id_str: user['rest_id'],
|
86
|
-
}
|
91
|
+
}
|
92
|
+
|
93
|
+
user['legacy'].each do |k, v|
|
94
|
+
next if k =~ /^profile_\w+_extensions/
|
95
|
+
user_fields[k.to_sym] = v
|
96
|
+
end
|
87
97
|
|
88
98
|
fields[:user] = StrictHash[user_fields]
|
89
99
|
|
data/lib/bad_pigeon/timelines.rb
CHANGED
@@ -3,8 +3,25 @@ Dir[File.join(__dir__, 'timelines', '*.rb')].each { |f| require(f) }
|
|
3
3
|
module BadPigeon
|
4
4
|
TIMELINE_TYPES = {
|
5
5
|
'UserTweets' => UserTimeline,
|
6
|
+
'UserMedia' => UserTimeline,
|
6
7
|
'HomeLatestTimeline' => HomeTimeline,
|
7
8
|
'HomeTimeline' => HomeTimeline,
|
8
|
-
'ListLatestTweetsTimeline' => ListTimeline
|
9
|
+
'ListLatestTweetsTimeline' => ListTimeline,
|
10
|
+
|
11
|
+
# ignored requests:
|
12
|
+
'AudioSpaceById' => nil,
|
13
|
+
'CommunitiesTabBarItemQuery' => nil,
|
14
|
+
'DataSaverMode' => nil,
|
15
|
+
'GetUserClaims' => nil,
|
16
|
+
'ListByRestId' => nil,
|
17
|
+
'ListMembers' => nil,
|
18
|
+
'ListPins' => nil,
|
19
|
+
'ListSubscribers' => nil,
|
20
|
+
'ListsManagementPageTimeline' => nil,
|
21
|
+
'ProfileSpotlightsQuery' => nil,
|
22
|
+
'UserByRestId' => nil,
|
23
|
+
'UserByScreenName' => nil,
|
24
|
+
'Viewer' => nil,
|
25
|
+
'getAltTextPromptPreference' => nil,
|
9
26
|
}
|
10
27
|
end
|
@@ -15,23 +15,21 @@ module BadPigeon
|
|
15
15
|
|
16
16
|
def get_tweets_from_har(har_data)
|
17
17
|
archive = HARArchive.new(har_data)
|
18
|
+
requests = archive.requests.select(&:includes_tweet_data?)
|
18
19
|
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
entries = requests.map { |e|
|
24
|
-
endpoint = URI(e.url).path.split('/').last
|
20
|
+
timeline_entries = requests.map { |e| timeline_entries_from_request(e) }.flatten
|
21
|
+
timeline_entries.select { |e| @filter.include_entry?(e) }.map(&:all_tweets).flatten
|
22
|
+
end
|
25
23
|
|
26
|
-
|
27
|
-
|
28
|
-
elsif !TIMELINE_TYPES.has_key?(endpoint)
|
29
|
-
debug "Unknown endpoint: #{endpoint}"
|
30
|
-
[]
|
31
|
-
end
|
32
|
-
}.flatten
|
24
|
+
def timeline_entries_from_request(request)
|
25
|
+
endpoint = request.endpoint_name
|
33
26
|
|
34
|
-
|
27
|
+
if timeline_class = TIMELINE_TYPES[endpoint]
|
28
|
+
timeline_class.new(request.response_json).instructions.map(&:entries).flatten
|
29
|
+
else
|
30
|
+
debug "Unknown endpoint: #{endpoint}" unless TIMELINE_TYPES.has_key?(endpoint)
|
31
|
+
[]
|
32
|
+
end
|
35
33
|
end
|
36
34
|
end
|
37
35
|
end
|
data/lib/bad_pigeon/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: bad_pigeon
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Kuba Suder
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2023-
|
11
|
+
date: 2023-07-11 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: addressable
|
@@ -25,10 +25,10 @@ dependencies:
|
|
25
25
|
- !ruby/object:Gem::Version
|
26
26
|
version: '2.8'
|
27
27
|
description: "\n BadPigeon is a Ruby gem that allows you to extract tweet data
|
28
|
-
from the XHR requests that the Twitter
|
29
|
-
|
30
|
-
|
31
|
-
|
28
|
+
from the XHR requests that the Twitter frontend\n website does in user's browser.
|
29
|
+
The requests need to be saved into a \"HAR\" archive file from the browser's web\n
|
30
|
+
\ inspector tool and then that file is fed into either the appropriate Ruby class
|
31
|
+
or the `pigeon` command line tool.\n\n The tool intents to be API compatible
|
32
32
|
with the popular `twitter` gem and generate the same kind of tweet JSON\n structure
|
33
33
|
as is read and exported by that library.\n "
|
34
34
|
email:
|