twitterscraper-ruby 0.6.0 → 0.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: fe6db831d59218f3e701e0211487d79ef2354524610f8338a1a17e4cc426e437
4
- data.tar.gz: 34cd890b8d2837bcacb3ab7b03fb43845294eecb9d7b0c4891c048eacedbe233
3
+ metadata.gz: f4382801b03a5384095aad6a955caea438787fa2eed96e3e001237df368925a2
4
+ data.tar.gz: 6722b4edce7242b3006e5c097dd78847f36e2da7edea009e2d7b89b09f5b25ff
5
5
  SHA512:
6
- metadata.gz: 6c16c89ca290cc3c9ed5fd245c5aa26e5386c95011cfa14277e774e860359495cafec1624fba0af55de98ebbb34abb599e75e210bdbb18b3b11e49bc1527b643
7
- data.tar.gz: d54e25e0294eddf8226c0e27a1d46c6128e9066c9d04edc429b382498c0de1af7ccf2e5c1333ad2031210bbefd7f9b7edc73b9a0e79ab2bd3673674b2e648f3c
6
+ metadata.gz: 4ca72a0bbce553c38061e0362f755a5e82b47a5288108508410c19a7eef9a2514b58682e88ed1bf89654d5b89c84c41edd8a5fa34fd7d1e5fbf92b267402884a
7
+ data.tar.gz: 8853b015cb37180d6814710d971a757d08aa4ddd4579af4131e204e34bb10c80ef3139c082f17be92303d9efc2e3f8eb4ba0d15bdf4f264fb4fba0cf87ed42d7
data/.gitignore CHANGED
@@ -6,5 +6,5 @@
6
6
  /pkg/
7
7
  /spec/reports/
8
8
  /tmp/
9
-
9
+ /cache
10
10
  /.idea
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- twitterscraper-ruby (0.6.0)
4
+ twitterscraper-ruby (0.11.0)
5
5
  nokogiri
6
6
  parallel
7
7
 
data/README.md CHANGED
@@ -1,46 +1,164 @@
1
1
  # twitterscraper-ruby
2
2
 
3
- Welcome to your new gem! In this directory, you'll find the files you need to be able to package up your Ruby library into a gem. Put your Ruby code in the file `lib/twitterscraper/ruby`. To experiment with that code, run `bin/console` for an interactive prompt.
3
+ [![Gem Version](https://badge.fury.io/rb/twitterscraper-ruby.svg)](https://badge.fury.io/rb/twitterscraper-ruby)
4
4
 
5
- TODO: Delete this and the text above, and describe your gem
5
+ A gem to scrape https://twitter.com/search. This gem is inspired by [taspinar/twitterscraper](https://github.com/taspinar/twitterscraper).
6
6
 
7
- ## Installation
8
7
 
9
- Add this line to your application's Gemfile:
8
+ ## Twitter Search API vs. twitterscraper-ruby
10
9
 
11
- ```ruby
12
- gem 'twitterscraper-ruby'
13
- ```
10
+ ### Twitter Search API
14
11
 
15
- And then execute:
12
+ - The number of tweets: 180 - 450 requests/15 minutes (18,000 - 45,000 tweets/15 minutes)
13
+ - The time window: the past 7 days
16
14
 
17
- $ bundle install
15
+ ### twitterscraper-ruby
18
16
 
19
- Or install it yourself as:
17
+ - The number of tweets: Unlimited
18
+ - The time window: from 2006-3-21 to today
20
19
 
21
- $ gem install twitterscraper-ruby
20
+
21
+ ## Installation
22
+
23
+ First install the library:
24
+
25
+ ```shell script
26
+ $ gem install twitterscraper-ruby
27
+ ````
28
+
22
29
 
23
30
  ## Usage
24
31
 
32
+ Command-line interface:
33
+
34
+ ```shell script
35
+ $ twitterscraper --query KEYWORD --start_date 2020-06-01 --end_date 2020-06-30 --lang ja \
36
+ --limit 100 --threads 10 --proxy --cache --output output.json
37
+ ```
38
+
39
+ From Within Ruby:
40
+
25
41
  ```ruby
26
42
  require 'twitterscraper'
43
+
44
+ options = {
45
+ start_date: '2020-06-01',
46
+ end_date: '2020-06-30',
47
+ lang: 'ja',
48
+ limit: 100,
49
+ threads: 10,
50
+ proxy: true
51
+ }
52
+
53
+ client = Twitterscraper::Client.new
54
+ tweets = client.query_tweets(KEYWORD, options)
55
+
56
+ tweets.each do |tweet|
57
+ puts tweet.tweet_id
58
+ puts tweet.text
59
+ puts tweet.tweet_url
60
+ puts tweet.created_at
61
+
62
+ hash = tweet.attrs
63
+ puts hash.keys
64
+ end
27
65
  ```
28
66
 
29
- ## Development
30
67
 
31
- After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
68
+ ## Attributes
69
+
70
+ ### Tweet
71
+
72
+ - screen_name
73
+ - name
74
+ - user_id
75
+ - tweet_id
76
+ - text
77
+ - links
78
+ - hashtags
79
+ - image_urls
80
+ - video_url
81
+ - has_media
82
+ - likes
83
+ - retweets
84
+ - replies
85
+ - is_replied
86
+ - is_reply_to
87
+ - parent_tweet_id
88
+ - reply_to_users
89
+ - tweet_url
90
+ - created_at
91
+
92
+
93
+ ## Search operators
94
+
95
+ | Operator | Finds Tweets... |
96
+ | ------------- | ------------- |
97
+ | watching now | containing both "watching" and "now". This is the default operator. |
98
+ | "happy hour" | containing the exact phrase "happy hour". |
99
+ | love OR hate | containing either "love" or "hate" (or both). |
100
+ | beer -root | containing "beer" but not "root". |
101
+ | #haiku | containing the hashtag "haiku". |
102
+ | from:interior | sent from Twitter account "interior". |
103
+ | to:NASA | a Tweet authored in reply to Twitter account "NASA". |
104
+ | @NASA | mentioning Twitter account "NASA". |
105
+ | puppy filter:media | containing "puppy" and an image or video. |
106
+ | puppy -filter:retweets | containing "puppy", filtering out retweets |
107
+ | superhero since:2015-12-21 | containing "superhero" and sent since date "2015-12-21" (year-month-day). |
108
+ | puppy until:2015-12-21 | containing "puppy" and sent before the date "2015-12-21". |
109
+
110
+ Search operators documentation is in [Standard search operators](https://developer.twitter.com/en/docs/tweets/rules-and-filtering/overview/standard-operators).
111
+
112
+
113
+ ## Examples
114
+
115
+ ```shell script
116
+ $ twitterscraper --query twitter --limit 1000
117
+ $ cat tweets.json | jq . | less
118
+ ```
119
+
120
+ ```json
121
+ [
122
+ {
123
+ "screen_name": "@screenname",
124
+ "name": "name",
125
+ "user_id": 1194529546483000000,
126
+ "tweet_id": 1282659891992000000,
127
+ "tweet_url": "https://twitter.com/screenname/status/1282659891992000000",
128
+ "created_at": "2020-07-13 12:00:00 +0000",
129
+ "text": "Thanks Twitter!"
130
+ }
131
+ ]
132
+ ```
133
+
134
+ ## CLI Options
135
+
136
+ | Option | Description | Default |
137
+ | ------------- | ------------- | ------------- |
138
+ | `-h`, `--help` | This option displays a summary of twitterscraper. | |
139
+ | `--query` | Specify a keyword used during the search. | |
140
+ | `--start_date` | Set the date from which twitterscraper-ruby should start scraping for your query. | |
141
+ | `--end_date` | Set the enddate which twitterscraper-ruby should use to stop scraping for your query. | |
142
+ | `--lang` | Retrieve tweets written in a specific language. | |
143
+ | `--limit` | Stop scraping when *at least* the number of tweets indicated with --limit is scraped. | 100 |
144
+ | `--threads` | Set the number of threads twitterscraper-ruby should initiate while scraping for your query. | 2 |
145
+ | `--proxy` | Scrape https://twitter.com/search via proxies. | false |
146
+ | `--cache` | Enable caching. | false |
147
+ | `--format` | The format of the output. | json |
148
+ | `--output` | The name of the output file. | tweets.json |
149
+ | `--verbose` | Print debug messages. | tweets.json |
32
150
 
33
- To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
34
151
 
35
152
  ## Contributing
36
153
 
37
- Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/twitterscraper-ruby. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/[USERNAME]/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
154
+ Bug reports and pull requests are welcome on GitHub at https://github.com/ts-3156/twitterscraper-ruby. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/ts-3156/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
38
155
 
39
156
 
40
157
  ## License
41
158
 
42
159
  The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
43
160
 
161
+
44
162
  ## Code of Conduct
45
163
 
46
- Everyone interacting in the Twitterscraper::Ruby project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
164
+ Everyone interacting in the twitterscraper-ruby project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/ts-3156/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
@@ -7,7 +7,7 @@ begin
7
7
  cli.parse
8
8
  cli.run
9
9
  rescue => e
10
- STDERR.puts e.message
10
+ STDERR.puts e.inspect
11
11
  STDERR.puts e.backtrace.join("\n")
12
12
  exit 1
13
13
  end
@@ -2,9 +2,11 @@ require 'twitterscraper/logger'
2
2
  require 'twitterscraper/proxy'
3
3
  require 'twitterscraper/http'
4
4
  require 'twitterscraper/lang'
5
+ require 'twitterscraper/cache'
5
6
  require 'twitterscraper/query'
6
7
  require 'twitterscraper/client'
7
8
  require 'twitterscraper/tweet'
9
+ require 'twitterscraper/template'
8
10
  require 'version'
9
11
 
10
12
  module Twitterscraper
@@ -0,0 +1,69 @@
1
+ require 'base64'
2
+ require 'digest/md5'
3
+
4
+ module Twitterscraper
5
+ class Cache
6
+ def initialize()
7
+ @ttl = 3600 # 1 hour
8
+ @dir = 'cache'
9
+ Dir.mkdir(@dir) unless File.exist?(@dir)
10
+ end
11
+
12
+ def read(key)
13
+ key = cache_key(key)
14
+ file = File.join(@dir, key)
15
+ entry = Entry.from_json(File.read(file))
16
+ entry.value if entry.time > Time.now - @ttl
17
+ rescue Errno::ENOENT => e
18
+ nil
19
+ end
20
+
21
+ def write(key, value)
22
+ key = cache_key(key)
23
+ entry = Entry.new(key, value, Time.now)
24
+ file = File.join(@dir, key)
25
+ File.write(file, entry.to_json)
26
+ end
27
+
28
+ def fetch(key, &block)
29
+ if (value = read(key))
30
+ value
31
+ else
32
+ yield.tap { |v| write(key, v) }
33
+ end
34
+ end
35
+
36
+ def cache_key(key)
37
+ value = key.gsub(':', '%3A').gsub('/', '%2F').gsub('?', '%3F').gsub('=', '%3D').gsub('&', '%26')
38
+ value = Digest::MD5.hexdigest(value) if value.length >= 100
39
+ value
40
+ end
41
+
42
+ class Entry < Hash
43
+ attr_reader :key, :value, :time
44
+
45
+ def initialize(key, value, time)
46
+ @key = key
47
+ @value = value
48
+ @time = time
49
+ end
50
+
51
+ def attrs
52
+ {key: @key, value: @value, time: @time}
53
+ end
54
+
55
+ def to_json
56
+ hash = attrs
57
+ hash[:value] = Base64.encode64(hash[:value])
58
+ hash.to_json
59
+ end
60
+
61
+ class << self
62
+ def from_json(text)
63
+ json = JSON.parse(text)
64
+ new(json['key'], Base64.decode64(json['value']), Time.parse(json['time']))
65
+ end
66
+ end
67
+ end
68
+ end
69
+ end
@@ -15,7 +15,6 @@ module Twitterscraper
15
15
  print_help || return if print_help?
16
16
  print_version || return if print_version?
17
17
 
18
- client = Twitterscraper::Client.new
19
18
  query_options = {
20
19
  start_date: options['start_date'],
21
20
  end_date: options['end_date'],
@@ -24,8 +23,21 @@ module Twitterscraper
24
23
  threads: options['threads'],
25
24
  proxy: options['proxy']
26
25
  }
26
+ client = Twitterscraper::Client.new(cache: options['cache'])
27
27
  tweets = client.query_tweets(options['query'], query_options)
28
- File.write(options['output'], generate_json(tweets))
28
+ export(tweets) unless tweets.empty?
29
+ end
30
+
31
+ def export(tweets)
32
+ write_json = lambda { File.write(options['output'], generate_json(tweets)) }
33
+
34
+ if options['format'] == 'json'
35
+ write_json.call
36
+ elsif options['format'] == 'html'
37
+ File.write('tweets.html', Template.tweets_embedded_html(tweets))
38
+ else
39
+ write_json.call
40
+ end
29
41
  end
30
42
 
31
43
  def generate_json(tweets)
@@ -53,6 +65,8 @@ module Twitterscraper
53
65
  'limit:',
54
66
  'threads:',
55
67
  'output:',
68
+ 'format:',
69
+ 'cache',
56
70
  'proxy',
57
71
  'pretty',
58
72
  'verbose',
@@ -61,7 +75,8 @@ module Twitterscraper
61
75
  options['lang'] ||= ''
62
76
  options['limit'] = (options['limit'] || 100).to_i
63
77
  options['threads'] = (options['threads'] || 2).to_i
64
- options['output'] ||= 'tweets.json'
78
+ options['format'] ||= 'json'
79
+ options['output'] ||= "tweets.#{options['format']}"
65
80
 
66
81
  options
67
82
  end
@@ -1,5 +1,13 @@
1
1
  module Twitterscraper
2
2
  class Client
3
3
  include Query
4
+
5
+ def initialize(cache:)
6
+ @cache = cache
7
+ end
8
+
9
+ def cache_enabled?
10
+ @cache
11
+ end
4
12
  end
5
13
  end
@@ -3,6 +3,7 @@ require 'net/http'
3
3
  require 'nokogiri'
4
4
  require 'date'
5
5
  require 'json'
6
+ require 'erb'
6
7
  require 'parallel'
7
8
 
8
9
  module Twitterscraper
@@ -41,7 +42,8 @@ module Twitterscraper
41
42
  end
42
43
  end
43
44
 
44
- def get_single_page(url, headers, proxies, timeout = 10, retries = 30)
45
+ def get_single_page(url, headers, proxies, timeout = 6, retries = 30)
46
+ return nil if stop_requested?
45
47
  Twitterscraper::Http.get(url, headers, proxies.sample, timeout)
46
48
  rescue => e
47
49
  logger.debug "query_single_page: #{e.inspect}"
@@ -54,6 +56,8 @@ module Twitterscraper
54
56
  end
55
57
 
56
58
  def parse_single_page(text, html = true)
59
+ return [nil, nil] if text.nil? || text == ''
60
+
57
61
  if html
58
62
  json_resp = nil
59
63
  items_html = text
@@ -68,12 +72,27 @@ module Twitterscraper
68
72
 
69
73
  def query_single_page(query, lang, pos, from_user = false, headers: [], proxies: [])
70
74
  logger.info("Querying #{query}")
71
- query = query.gsub(' ', '%20').gsub('#', '%23').gsub(':', '%3A').gsub('&', '%26')
75
+ query = ERB::Util.url_encode(query)
72
76
 
73
77
  url = build_query_url(query, lang, pos, from_user)
74
- logger.debug("Scraping tweets from #{url}")
78
+ http_request = lambda do
79
+ logger.debug("Scraping tweets from #{url}")
80
+ get_single_page(url, headers, proxies)
81
+ end
82
+
83
+ if cache_enabled?
84
+ client = Cache.new
85
+ if (response = client.read(url))
86
+ logger.debug('Fetching tweets from cache')
87
+ else
88
+ response = http_request.call
89
+ client.write(url, response)
90
+ end
91
+ else
92
+ response = http_request.call
93
+ end
94
+ return [], nil if response.nil?
75
95
 
76
- response = get_single_page(url, headers, proxies)
77
96
  html, json_resp = parse_single_page(response, pos.nil?)
78
97
 
79
98
  tweets = Tweet.from_html(html)
@@ -91,55 +110,112 @@ module Twitterscraper
91
110
  end
92
111
  end
93
112
 
94
- def query_tweets(query, start_date: nil, end_date: nil, lang: '', limit: 100, threads: 2, proxy: false)
95
- start_date = start_date ? Date.parse(start_date) : Date.parse('2006-3-21')
96
- end_date = end_date ? Date.parse(end_date) : Date.today
97
- if start_date == end_date
98
- raise 'Please specify different values for :start_date and :end_date.'
99
- elsif start_date > end_date
100
- raise ':start_date must occur before :end_date.'
113
+ OLDEST_DATE = Date.parse('2006-03-21')
114
+
115
+ def validate_options!(query, start_date:, end_date:, lang:, limit:, threads:, proxy:)
116
+ if query.nil? || query == ''
117
+ raise 'Please specify a search query.'
101
118
  end
102
119
 
103
- proxies = proxy ? Twitterscraper::Proxy::Pool.new : []
120
+ if ERB::Util.url_encode(query).length >= 500
121
+ raise ':query must be a UTF-8, URL-encoded search query of 500 characters maximum, including operators.'
122
+ end
104
123
 
105
- date_range = start_date.upto(end_date - 1)
106
- queries = date_range.map { |date| query + " since:#{date} until:#{date + 1}" }
107
- threads = queries.size if threads > queries.size
108
- logger.info("Threads #{threads}")
124
+ if start_date && end_date
125
+ if start_date == end_date
126
+ raise 'Please specify different values for :start_date and :end_date.'
127
+ elsif start_date > end_date
128
+ raise ':start_date must occur before :end_date.'
129
+ end
130
+ end
109
131
 
110
- headers = {'User-Agent': USER_AGENT_LIST.sample, 'X-Requested-With': 'XMLHttpRequest'}
111
- logger.info("Headers #{headers}")
132
+ if start_date
133
+ if start_date < OLDEST_DATE
134
+ raise ":start_date must be greater than or equal to #{OLDEST_DATE}"
135
+ end
136
+ end
112
137
 
113
- all_tweets = []
114
- mutex = Mutex.new
138
+ if end_date
139
+ today = Date.today
140
+ if end_date > Date.today
141
+ raise ":end_date must be less than or equal to today(#{today})"
142
+ end
143
+ end
144
+ end
115
145
 
116
- Parallel.each(queries, in_threads: threads) do |query|
146
+ def build_queries(query, start_date, end_date)
147
+ if start_date && end_date
148
+ date_range = start_date.upto(end_date - 1)
149
+ date_range.map { |date| query + " since:#{date} until:#{date + 1}" }
150
+ elsif start_date
151
+ [query + " since:#{start_date}"]
152
+ elsif end_date
153
+ [query + " until:#{end_date}"]
154
+ else
155
+ [query]
156
+ end
157
+ end
117
158
 
118
- pos = nil
159
+ def main_loop(query, lang, limit, headers, proxies)
160
+ pos = nil
119
161
 
120
- while true
121
- new_tweets, new_pos = query_single_page(query, lang, pos, headers: headers, proxies: proxies)
122
- unless new_tweets.empty?
123
- mutex.synchronize {
124
- all_tweets.concat(new_tweets)
125
- all_tweets.uniq! { |t| t.tweet_id }
126
- }
127
- end
128
- logger.info("Got #{new_tweets.size} tweets (total #{all_tweets.size}) worker=#{Parallel.worker_number}")
162
+ while true
163
+ new_tweets, new_pos = query_single_page(query, lang, pos, headers: headers, proxies: proxies)
164
+ unless new_tweets.empty?
165
+ @mutex.synchronize {
166
+ @all_tweets.concat(new_tweets)
167
+ @all_tweets.uniq! { |t| t.tweet_id }
168
+ }
169
+ end
170
+ logger.info("Got #{new_tweets.size} tweets (total #{@all_tweets.size})")
129
171
 
130
- break unless new_pos
131
- break if all_tweets.size >= limit
172
+ break unless new_pos
173
+ break if @all_tweets.size >= limit
132
174
 
133
- pos = new_pos
134
- end
175
+ pos = new_pos
176
+ end
135
177
 
136
- if all_tweets.size >= limit
137
- logger.info("Reached limit #{all_tweets.size}")
138
- raise Parallel::Break
178
+ if @all_tweets.size >= limit
179
+ logger.info("Limit reached #{@all_tweets.size}")
180
+ @stop_requested = true
181
+ end
182
+ end
183
+
184
+ def stop_requested?
185
+ @stop_requested
186
+ end
187
+
188
+ def query_tweets(query, start_date: nil, end_date: nil, lang: '', limit: 100, threads: 2, proxy: false)
189
+ start_date = Date.parse(start_date) if start_date && start_date.is_a?(String)
190
+ end_date = Date.parse(end_date) if end_date && end_date.is_a?(String)
191
+ queries = build_queries(query, start_date, end_date)
192
+ threads = queries.size if threads > queries.size
193
+ proxies = proxy ? Twitterscraper::Proxy::Pool.new : []
194
+
195
+ validate_options!(queries[0], start_date: start_date, end_date: end_date, lang: lang, limit: limit, threads: threads, proxy: proxy)
196
+
197
+ logger.info("The number of threads #{threads}")
198
+
199
+ headers = {'User-Agent': USER_AGENT_LIST.sample, 'X-Requested-With': 'XMLHttpRequest'}
200
+ logger.info("Headers #{headers}")
201
+
202
+ @all_tweets = []
203
+ @mutex = Mutex.new
204
+ @stop_requested = false
205
+
206
+ if threads > 1
207
+ Parallel.each(queries, in_threads: threads) do |query|
208
+ main_loop(query, lang, limit, headers, proxies)
209
+ raise Parallel::Break if stop_requested?
210
+ end
211
+ else
212
+ queries.each do |query|
213
+ main_loop(query, lang, limit, headers, proxies)
214
+ break if stop_requested?
139
215
  end
140
216
  end
141
217
 
142
- all_tweets.sort_by { |tweet| -tweet.created_at.to_i }
218
+ @all_tweets.sort_by { |tweet| -tweet.created_at.to_i }
143
219
  end
144
220
  end
145
221
  end
@@ -0,0 +1,48 @@
1
+ module Twitterscraper
2
+ module Template
3
+ module_function
4
+
5
+ def tweets_embedded_html(tweets)
6
+ tweets_html = tweets.map { |t| EMBED_TWEET_HTML.sub('__TWEET_URL__', t.tweet_url) }
7
+ EMBED_TWEETS_HTML.sub('__TWEETS__', tweets_html.join)
8
+ end
9
+
10
+ EMBED_TWEET_HTML = <<~'HTML'
11
+ <blockquote class="twitter-tweet">
12
+ <a href="__TWEET_URL__"></a>
13
+ </blockquote>
14
+ HTML
15
+
16
+ EMBED_TWEETS_HTML = <<~'HTML'
17
+ <html>
18
+ <head>
19
+ <style type=text/css>
20
+ .twitter-tweet {
21
+ margin: 30px auto 0 auto !important;
22
+ }
23
+ </style>
24
+ <script>
25
+ window.twttr = (function(d, s, id) {
26
+ var js, fjs = d.getElementsByTagName(s)[0], t = window.twttr || {};
27
+ if (d.getElementById(id)) return t;
28
+ js = d.createElement(s);
29
+ js.id = id;
30
+ js.src = "https://platform.twitter.com/widgets.js";
31
+ fjs.parentNode.insertBefore(js, fjs);
32
+
33
+ t._e = [];
34
+ t.ready = function(f) {
35
+ t._e.push(f);
36
+ };
37
+
38
+ return t;
39
+ }(document, "script", "twitter-wjs"));
40
+ </script>
41
+ </head>
42
+ <body>
43
+ __TWEETS__
44
+ </body>
45
+ </html>
46
+ HTML
47
+ end
48
+ end
@@ -2,7 +2,28 @@ require 'time'
2
2
 
3
3
  module Twitterscraper
4
4
  class Tweet
5
- KEYS = [:screen_name, :name, :user_id, :tweet_id, :tweet_url, :created_at, :text]
5
+ KEYS = [
6
+ :screen_name,
7
+ :name,
8
+ :user_id,
9
+ :tweet_id,
10
+ :text,
11
+ :links,
12
+ :hashtags,
13
+ :image_urls,
14
+ :video_url,
15
+ :has_media,
16
+ :likes,
17
+ :retweets,
18
+ :replies,
19
+ :is_replied,
20
+ :is_reply_to,
21
+ :parent_tweet_id,
22
+ :reply_to_users,
23
+ :tweet_url,
24
+ :timestamp,
25
+ :created_at,
26
+ ]
6
27
  attr_reader *KEYS
7
28
 
8
29
  def initialize(attrs)
@@ -11,13 +32,25 @@ module Twitterscraper
11
32
  end
12
33
  end
13
34
 
14
- def to_json(options = {})
35
+ def attrs
15
36
  KEYS.map do |key|
16
37
  [key, send(key)]
17
- end.to_h.to_json
38
+ end.to_h
39
+ end
40
+
41
+ def to_json(options = {})
42
+ attrs.to_json
18
43
  end
19
44
 
20
45
  class << self
46
+ def from_json(text)
47
+ json = JSON.parse(text)
48
+ json.map do |tweet|
49
+ tweet['created_at'] = Time.parse(tweet['created_at'])
50
+ new(tweet)
51
+ end
52
+ end
53
+
21
54
  def from_html(text)
22
55
  html = Nokogiri::HTML(text)
23
56
  from_tweets_html(html.xpath("//li[@class[contains(., 'js-stream-item')]]/div[@class[contains(., 'js-stream-tweet')]]"))
@@ -31,15 +64,51 @@ module Twitterscraper
31
64
 
32
65
  def from_tweet_html(html)
33
66
  inner_html = Nokogiri::HTML(html.inner_html)
67
+ tweet_id = html.attr('data-tweet-id').to_i
68
+ text = inner_html.xpath("//div[@class[contains(., 'js-tweet-text-container')]]/p[@class[contains(., 'js-tweet-text')]]").first.text
69
+ links = inner_html.xpath("//a[@class[contains(., 'twitter-timeline-link')]]").map { |elem| elem.attr('data-expanded-url') }.select { |link| link && !link.include?('pic.twitter') }
70
+ image_urls = inner_html.xpath("//div[@class[contains(., 'AdaptiveMedia-photoContainer')]]").map { |elem| elem.attr('data-image-url') }
71
+ video_url = inner_html.xpath("//div[@class[contains(., 'PlayableMedia-container')]]/a").map { |elem| elem.attr('href') }[0]
72
+ has_media = !image_urls.empty? || (video_url && !video_url.empty?)
73
+
74
+ actions = inner_html.xpath("//div[@class[contains(., 'ProfileTweet-actionCountList')]]")
75
+ likes = actions.xpath("//span[@class[contains(., 'ProfileTweet-action--favorite')]]/span[@class[contains(., 'ProfileTweet-actionCount')]]").first.attr('data-tweet-stat-count').to_i || 0
76
+ retweets = actions.xpath("//span[@class[contains(., 'ProfileTweet-action--retweet')]]/span[@class[contains(., 'ProfileTweet-actionCount')]]").first.attr('data-tweet-stat-count').to_i || 0
77
+ replies = actions.xpath("//span[@class[contains(., 'ProfileTweet-action--reply u-hiddenVisually')]]/span[@class[contains(., 'ProfileTweet-actionCount')]]").first.attr('data-tweet-stat-count').to_i || 0
78
+ is_replied = replies != 0
79
+
80
+ parent_tweet_id = inner_html.xpath('//*[@data-conversation-id]').first.attr('data-conversation-id').to_i
81
+ if tweet_id == parent_tweet_id
82
+ is_reply_to = false
83
+ parent_tweet_id = nil
84
+ reply_to_users = []
85
+ else
86
+ is_reply_to = true
87
+ reply_to_users = inner_html.xpath("//div[@class[contains(., 'ReplyingToContextBelowAuthor')]]/a").map { |user| {screen_name: user.text.delete_prefix('@'), user_id: user.attr('data-user-id')} }
88
+ end
89
+
34
90
  timestamp = inner_html.xpath("//span[@class[contains(., 'js-short-timestamp')]]").first.attr('data-time').to_i
35
91
  new(
36
92
  screen_name: html.attr('data-screen-name'),
37
93
  name: html.attr('data-name'),
38
94
  user_id: html.attr('data-user-id').to_i,
39
- tweet_id: html.attr('data-tweet-id').to_i,
95
+ tweet_id: tweet_id,
96
+ text: text,
97
+ links: links,
98
+ hashtags: text.scan(/#\w+/).map { |tag| tag.delete_prefix('#') },
99
+ image_urls: image_urls,
100
+ video_url: video_url,
101
+ has_media: has_media,
102
+ likes: likes,
103
+ retweets: retweets,
104
+ replies: replies,
105
+ is_replied: is_replied,
106
+ is_reply_to: is_reply_to,
107
+ parent_tweet_id: parent_tweet_id,
108
+ reply_to_users: reply_to_users,
40
109
  tweet_url: 'https://twitter.com' + html.attr('data-permalink-path'),
110
+ timestamp: timestamp,
41
111
  created_at: Time.at(timestamp, in: '+00:00'),
42
- text: inner_html.xpath("//div[@class[contains(., 'js-tweet-text-container')]]/p[@class[contains(., 'js-tweet-text')]]").first.text,
43
112
  )
44
113
  end
45
114
  end
@@ -1,3 +1,3 @@
1
1
  module Twitterscraper
2
- VERSION = '0.6.0'
2
+ VERSION = '0.11.0'
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: twitterscraper-ruby
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.6.0
4
+ version: 0.11.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - ts-3156
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-07-13 00:00:00.000000000 Z
11
+ date: 2020-07-15 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
@@ -61,6 +61,7 @@ files:
61
61
  - bin/twitterscraper
62
62
  - lib/twitterscraper-ruby.rb
63
63
  - lib/twitterscraper.rb
64
+ - lib/twitterscraper/cache.rb
64
65
  - lib/twitterscraper/cli.rb
65
66
  - lib/twitterscraper/client.rb
66
67
  - lib/twitterscraper/http.rb
@@ -68,6 +69,7 @@ files:
68
69
  - lib/twitterscraper/logger.rb
69
70
  - lib/twitterscraper/proxy.rb
70
71
  - lib/twitterscraper/query.rb
72
+ - lib/twitterscraper/template.rb
71
73
  - lib/twitterscraper/tweet.rb
72
74
  - lib/version.rb
73
75
  - twitterscraper-ruby.gemspec