twitterscraper-ruby 0.6.0 → 0.11.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: fe6db831d59218f3e701e0211487d79ef2354524610f8338a1a17e4cc426e437
4
- data.tar.gz: 34cd890b8d2837bcacb3ab7b03fb43845294eecb9d7b0c4891c048eacedbe233
3
+ metadata.gz: f4382801b03a5384095aad6a955caea438787fa2eed96e3e001237df368925a2
4
+ data.tar.gz: 6722b4edce7242b3006e5c097dd78847f36e2da7edea009e2d7b89b09f5b25ff
5
5
  SHA512:
6
- metadata.gz: 6c16c89ca290cc3c9ed5fd245c5aa26e5386c95011cfa14277e774e860359495cafec1624fba0af55de98ebbb34abb599e75e210bdbb18b3b11e49bc1527b643
7
- data.tar.gz: d54e25e0294eddf8226c0e27a1d46c6128e9066c9d04edc429b382498c0de1af7ccf2e5c1333ad2031210bbefd7f9b7edc73b9a0e79ab2bd3673674b2e648f3c
6
+ metadata.gz: 4ca72a0bbce553c38061e0362f755a5e82b47a5288108508410c19a7eef9a2514b58682e88ed1bf89654d5b89c84c41edd8a5fa34fd7d1e5fbf92b267402884a
7
+ data.tar.gz: 8853b015cb37180d6814710d971a757d08aa4ddd4579af4131e204e34bb10c80ef3139c082f17be92303d9efc2e3f8eb4ba0d15bdf4f264fb4fba0cf87ed42d7
data/.gitignore CHANGED
@@ -6,5 +6,5 @@
6
6
  /pkg/
7
7
  /spec/reports/
8
8
  /tmp/
9
-
9
+ /cache
10
10
  /.idea
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- twitterscraper-ruby (0.6.0)
4
+ twitterscraper-ruby (0.11.0)
5
5
  nokogiri
6
6
  parallel
7
7
 
data/README.md CHANGED
@@ -1,46 +1,164 @@
1
1
  # twitterscraper-ruby
2
2
 
3
- Welcome to your new gem! In this directory, you'll find the files you need to be able to package up your Ruby library into a gem. Put your Ruby code in the file `lib/twitterscraper/ruby`. To experiment with that code, run `bin/console` for an interactive prompt.
3
+ [![Gem Version](https://badge.fury.io/rb/twitterscraper-ruby.svg)](https://badge.fury.io/rb/twitterscraper-ruby)
4
4
 
5
- TODO: Delete this and the text above, and describe your gem
5
+ A gem to scrape https://twitter.com/search. This gem is inspired by [taspinar/twitterscraper](https://github.com/taspinar/twitterscraper).
6
6
 
7
- ## Installation
8
7
 
9
- Add this line to your application's Gemfile:
8
+ ## Twitter Search API vs. twitterscraper-ruby
10
9
 
11
- ```ruby
12
- gem 'twitterscraper-ruby'
13
- ```
10
+ ### Twitter Search API
14
11
 
15
- And then execute:
12
+ - The number of tweets: 180 - 450 requests/15 minutes (18,000 - 45,000 tweets/15 minutes)
13
+ - The time window: the past 7 days
16
14
 
17
- $ bundle install
15
+ ### twitterscraper-ruby
18
16
 
19
- Or install it yourself as:
17
+ - The number of tweets: Unlimited
18
+ - The time window: from 2006-3-21 to today
20
19
 
21
- $ gem install twitterscraper-ruby
20
+
21
+ ## Installation
22
+
23
+ First install the library:
24
+
25
+ ```shell script
26
+ $ gem install twitterscraper-ruby
27
+ ````
28
+
22
29
 
23
30
  ## Usage
24
31
 
32
+ Command-line interface:
33
+
34
+ ```shell script
35
+ $ twitterscraper --query KEYWORD --start_date 2020-06-01 --end_date 2020-06-30 --lang ja \
36
+ --limit 100 --threads 10 --proxy --cache --output output.json
37
+ ```
38
+
39
+ From Within Ruby:
40
+
25
41
  ```ruby
26
42
  require 'twitterscraper'
43
+
44
+ options = {
45
+ start_date: '2020-06-01',
46
+ end_date: '2020-06-30',
47
+ lang: 'ja',
48
+ limit: 100,
49
+ threads: 10,
50
+ proxy: true
51
+ }
52
+
53
+ client = Twitterscraper::Client.new
54
+ tweets = client.query_tweets(KEYWORD, options)
55
+
56
+ tweets.each do |tweet|
57
+ puts tweet.tweet_id
58
+ puts tweet.text
59
+ puts tweet.tweet_url
60
+ puts tweet.created_at
61
+
62
+ hash = tweet.attrs
63
+ puts hash.keys
64
+ end
27
65
  ```
28
66
 
29
- ## Development
30
67
 
31
- After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
68
+ ## Attributes
69
+
70
+ ### Tweet
71
+
72
+ - screen_name
73
+ - name
74
+ - user_id
75
+ - tweet_id
76
+ - text
77
+ - links
78
+ - hashtags
79
+ - image_urls
80
+ - video_url
81
+ - has_media
82
+ - likes
83
+ - retweets
84
+ - replies
85
+ - is_replied
86
+ - is_reply_to
87
+ - parent_tweet_id
88
+ - reply_to_users
89
+ - tweet_url
90
+ - created_at
91
+
92
+
93
+ ## Search operators
94
+
95
+ | Operator | Finds Tweets... |
96
+ | ------------- | ------------- |
97
+ | watching now | containing both "watching" and "now". This is the default operator. |
98
+ | "happy hour" | containing the exact phrase "happy hour". |
99
+ | love OR hate | containing either "love" or "hate" (or both). |
100
+ | beer -root | containing "beer" but not "root". |
101
+ | #haiku | containing the hashtag "haiku". |
102
+ | from:interior | sent from Twitter account "interior". |
103
+ | to:NASA | a Tweet authored in reply to Twitter account "NASA". |
104
+ | @NASA | mentioning Twitter account "NASA". |
105
+ | puppy filter:media | containing "puppy" and an image or video. |
106
+ | puppy -filter:retweets | containing "puppy", filtering out retweets |
107
+ | superhero since:2015-12-21 | containing "superhero" and sent since date "2015-12-21" (year-month-day). |
108
+ | puppy until:2015-12-21 | containing "puppy" and sent before the date "2015-12-21". |
109
+
110
+ Search operators documentation is in [Standard search operators](https://developer.twitter.com/en/docs/tweets/rules-and-filtering/overview/standard-operators).
111
+
112
+
113
+ ## Examples
114
+
115
+ ```shell script
116
+ $ twitterscraper --query twitter --limit 1000
117
+ $ cat tweets.json | jq . | less
118
+ ```
119
+
120
+ ```json
121
+ [
122
+ {
123
+ "screen_name": "@screenname",
124
+ "name": "name",
125
+ "user_id": 1194529546483000000,
126
+ "tweet_id": 1282659891992000000,
127
+ "tweet_url": "https://twitter.com/screenname/status/1282659891992000000",
128
+ "created_at": "2020-07-13 12:00:00 +0000",
129
+ "text": "Thanks Twitter!"
130
+ }
131
+ ]
132
+ ```
133
+
134
+ ## CLI Options
135
+
136
+ | Option | Description | Default |
137
+ | ------------- | ------------- | ------------- |
138
+ | `-h`, `--help` | This option displays a summary of twitterscraper. | |
139
+ | `--query` | Specify a keyword used during the search. | |
140
+ | `--start_date` | Set the date from which twitterscraper-ruby should start scraping for your query. | |
141
+ | `--end_date` | Set the enddate which twitterscraper-ruby should use to stop scraping for your query. | |
142
+ | `--lang` | Retrieve tweets written in a specific language. | |
143
+ | `--limit` | Stop scraping when *at least* the number of tweets indicated with --limit is scraped. | 100 |
144
+ | `--threads` | Set the number of threads twitterscraper-ruby should initiate while scraping for your query. | 2 |
145
+ | `--proxy` | Scrape https://twitter.com/search via proxies. | false |
146
+ | `--cache` | Enable caching. | false |
147
+ | `--format` | The format of the output. | json |
148
+ | `--output` | The name of the output file. | tweets.json |
149
+ | `--verbose` | Print debug messages. | tweets.json |
32
150
 
33
- To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
34
151
 
35
152
  ## Contributing
36
153
 
37
- Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/twitterscraper-ruby. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/[USERNAME]/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
154
+ Bug reports and pull requests are welcome on GitHub at https://github.com/ts-3156/twitterscraper-ruby. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/ts-3156/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
38
155
 
39
156
 
40
157
  ## License
41
158
 
42
159
  The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
43
160
 
161
+
44
162
  ## Code of Conduct
45
163
 
46
- Everyone interacting in the Twitterscraper::Ruby project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
164
+ Everyone interacting in the twitterscraper-ruby project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/ts-3156/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
@@ -7,7 +7,7 @@ begin
7
7
  cli.parse
8
8
  cli.run
9
9
  rescue => e
10
- STDERR.puts e.message
10
+ STDERR.puts e.inspect
11
11
  STDERR.puts e.backtrace.join("\n")
12
12
  exit 1
13
13
  end
@@ -2,9 +2,11 @@ require 'twitterscraper/logger'
2
2
  require 'twitterscraper/proxy'
3
3
  require 'twitterscraper/http'
4
4
  require 'twitterscraper/lang'
5
+ require 'twitterscraper/cache'
5
6
  require 'twitterscraper/query'
6
7
  require 'twitterscraper/client'
7
8
  require 'twitterscraper/tweet'
9
+ require 'twitterscraper/template'
8
10
  require 'version'
9
11
 
10
12
  module Twitterscraper
@@ -0,0 +1,69 @@
1
+ require 'base64'
2
+ require 'digest/md5'
3
+
4
+ module Twitterscraper
5
+ class Cache
6
+ def initialize()
7
+ @ttl = 3600 # 1 hour
8
+ @dir = 'cache'
9
+ Dir.mkdir(@dir) unless File.exist?(@dir)
10
+ end
11
+
12
+ def read(key)
13
+ key = cache_key(key)
14
+ file = File.join(@dir, key)
15
+ entry = Entry.from_json(File.read(file))
16
+ entry.value if entry.time > Time.now - @ttl
17
+ rescue Errno::ENOENT => e
18
+ nil
19
+ end
20
+
21
+ def write(key, value)
22
+ key = cache_key(key)
23
+ entry = Entry.new(key, value, Time.now)
24
+ file = File.join(@dir, key)
25
+ File.write(file, entry.to_json)
26
+ end
27
+
28
+ def fetch(key, &block)
29
+ if (value = read(key))
30
+ value
31
+ else
32
+ yield.tap { |v| write(key, v) }
33
+ end
34
+ end
35
+
36
+ def cache_key(key)
37
+ value = key.gsub(':', '%3A').gsub('/', '%2F').gsub('?', '%3F').gsub('=', '%3D').gsub('&', '%26')
38
+ value = Digest::MD5.hexdigest(value) if value.length >= 100
39
+ value
40
+ end
41
+
42
+ class Entry < Hash
43
+ attr_reader :key, :value, :time
44
+
45
+ def initialize(key, value, time)
46
+ @key = key
47
+ @value = value
48
+ @time = time
49
+ end
50
+
51
+ def attrs
52
+ {key: @key, value: @value, time: @time}
53
+ end
54
+
55
+ def to_json
56
+ hash = attrs
57
+ hash[:value] = Base64.encode64(hash[:value])
58
+ hash.to_json
59
+ end
60
+
61
+ class << self
62
+ def from_json(text)
63
+ json = JSON.parse(text)
64
+ new(json['key'], Base64.decode64(json['value']), Time.parse(json['time']))
65
+ end
66
+ end
67
+ end
68
+ end
69
+ end
@@ -15,7 +15,6 @@ module Twitterscraper
15
15
  print_help || return if print_help?
16
16
  print_version || return if print_version?
17
17
 
18
- client = Twitterscraper::Client.new
19
18
  query_options = {
20
19
  start_date: options['start_date'],
21
20
  end_date: options['end_date'],
@@ -24,8 +23,21 @@ module Twitterscraper
24
23
  threads: options['threads'],
25
24
  proxy: options['proxy']
26
25
  }
26
+ client = Twitterscraper::Client.new(cache: options['cache'])
27
27
  tweets = client.query_tweets(options['query'], query_options)
28
- File.write(options['output'], generate_json(tweets))
28
+ export(tweets) unless tweets.empty?
29
+ end
30
+
31
+ def export(tweets)
32
+ write_json = lambda { File.write(options['output'], generate_json(tweets)) }
33
+
34
+ if options['format'] == 'json'
35
+ write_json.call
36
+ elsif options['format'] == 'html'
37
+ File.write('tweets.html', Template.tweets_embedded_html(tweets))
38
+ else
39
+ write_json.call
40
+ end
29
41
  end
30
42
 
31
43
  def generate_json(tweets)
@@ -53,6 +65,8 @@ module Twitterscraper
53
65
  'limit:',
54
66
  'threads:',
55
67
  'output:',
68
+ 'format:',
69
+ 'cache',
56
70
  'proxy',
57
71
  'pretty',
58
72
  'verbose',
@@ -61,7 +75,8 @@ module Twitterscraper
61
75
  options['lang'] ||= ''
62
76
  options['limit'] = (options['limit'] || 100).to_i
63
77
  options['threads'] = (options['threads'] || 2).to_i
64
- options['output'] ||= 'tweets.json'
78
+ options['format'] ||= 'json'
79
+ options['output'] ||= "tweets.#{options['format']}"
65
80
 
66
81
  options
67
82
  end
@@ -1,5 +1,13 @@
1
1
  module Twitterscraper
2
2
  class Client
3
3
  include Query
4
+
5
+ def initialize(cache:)
6
+ @cache = cache
7
+ end
8
+
9
+ def cache_enabled?
10
+ @cache
11
+ end
4
12
  end
5
13
  end
@@ -3,6 +3,7 @@ require 'net/http'
3
3
  require 'nokogiri'
4
4
  require 'date'
5
5
  require 'json'
6
+ require 'erb'
6
7
  require 'parallel'
7
8
 
8
9
  module Twitterscraper
@@ -41,7 +42,8 @@ module Twitterscraper
41
42
  end
42
43
  end
43
44
 
44
- def get_single_page(url, headers, proxies, timeout = 10, retries = 30)
45
+ def get_single_page(url, headers, proxies, timeout = 6, retries = 30)
46
+ return nil if stop_requested?
45
47
  Twitterscraper::Http.get(url, headers, proxies.sample, timeout)
46
48
  rescue => e
47
49
  logger.debug "query_single_page: #{e.inspect}"
@@ -54,6 +56,8 @@ module Twitterscraper
54
56
  end
55
57
 
56
58
  def parse_single_page(text, html = true)
59
+ return [nil, nil] if text.nil? || text == ''
60
+
57
61
  if html
58
62
  json_resp = nil
59
63
  items_html = text
@@ -68,12 +72,27 @@ module Twitterscraper
68
72
 
69
73
  def query_single_page(query, lang, pos, from_user = false, headers: [], proxies: [])
70
74
  logger.info("Querying #{query}")
71
- query = query.gsub(' ', '%20').gsub('#', '%23').gsub(':', '%3A').gsub('&', '%26')
75
+ query = ERB::Util.url_encode(query)
72
76
 
73
77
  url = build_query_url(query, lang, pos, from_user)
74
- logger.debug("Scraping tweets from #{url}")
78
+ http_request = lambda do
79
+ logger.debug("Scraping tweets from #{url}")
80
+ get_single_page(url, headers, proxies)
81
+ end
82
+
83
+ if cache_enabled?
84
+ client = Cache.new
85
+ if (response = client.read(url))
86
+ logger.debug('Fetching tweets from cache')
87
+ else
88
+ response = http_request.call
89
+ client.write(url, response)
90
+ end
91
+ else
92
+ response = http_request.call
93
+ end
94
+ return [], nil if response.nil?
75
95
 
76
- response = get_single_page(url, headers, proxies)
77
96
  html, json_resp = parse_single_page(response, pos.nil?)
78
97
 
79
98
  tweets = Tweet.from_html(html)
@@ -91,55 +110,112 @@ module Twitterscraper
91
110
  end
92
111
  end
93
112
 
94
- def query_tweets(query, start_date: nil, end_date: nil, lang: '', limit: 100, threads: 2, proxy: false)
95
- start_date = start_date ? Date.parse(start_date) : Date.parse('2006-3-21')
96
- end_date = end_date ? Date.parse(end_date) : Date.today
97
- if start_date == end_date
98
- raise 'Please specify different values for :start_date and :end_date.'
99
- elsif start_date > end_date
100
- raise ':start_date must occur before :end_date.'
113
+ OLDEST_DATE = Date.parse('2006-03-21')
114
+
115
+ def validate_options!(query, start_date:, end_date:, lang:, limit:, threads:, proxy:)
116
+ if query.nil? || query == ''
117
+ raise 'Please specify a search query.'
101
118
  end
102
119
 
103
- proxies = proxy ? Twitterscraper::Proxy::Pool.new : []
120
+ if ERB::Util.url_encode(query).length >= 500
121
+ raise ':query must be a UTF-8, URL-encoded search query of 500 characters maximum, including operators.'
122
+ end
104
123
 
105
- date_range = start_date.upto(end_date - 1)
106
- queries = date_range.map { |date| query + " since:#{date} until:#{date + 1}" }
107
- threads = queries.size if threads > queries.size
108
- logger.info("Threads #{threads}")
124
+ if start_date && end_date
125
+ if start_date == end_date
126
+ raise 'Please specify different values for :start_date and :end_date.'
127
+ elsif start_date > end_date
128
+ raise ':start_date must occur before :end_date.'
129
+ end
130
+ end
109
131
 
110
- headers = {'User-Agent': USER_AGENT_LIST.sample, 'X-Requested-With': 'XMLHttpRequest'}
111
- logger.info("Headers #{headers}")
132
+ if start_date
133
+ if start_date < OLDEST_DATE
134
+ raise ":start_date must be greater than or equal to #{OLDEST_DATE}"
135
+ end
136
+ end
112
137
 
113
- all_tweets = []
114
- mutex = Mutex.new
138
+ if end_date
139
+ today = Date.today
140
+ if end_date > Date.today
141
+ raise ":end_date must be less than or equal to today(#{today})"
142
+ end
143
+ end
144
+ end
115
145
 
116
- Parallel.each(queries, in_threads: threads) do |query|
146
+ def build_queries(query, start_date, end_date)
147
+ if start_date && end_date
148
+ date_range = start_date.upto(end_date - 1)
149
+ date_range.map { |date| query + " since:#{date} until:#{date + 1}" }
150
+ elsif start_date
151
+ [query + " since:#{start_date}"]
152
+ elsif end_date
153
+ [query + " until:#{end_date}"]
154
+ else
155
+ [query]
156
+ end
157
+ end
117
158
 
118
- pos = nil
159
+ def main_loop(query, lang, limit, headers, proxies)
160
+ pos = nil
119
161
 
120
- while true
121
- new_tweets, new_pos = query_single_page(query, lang, pos, headers: headers, proxies: proxies)
122
- unless new_tweets.empty?
123
- mutex.synchronize {
124
- all_tweets.concat(new_tweets)
125
- all_tweets.uniq! { |t| t.tweet_id }
126
- }
127
- end
128
- logger.info("Got #{new_tweets.size} tweets (total #{all_tweets.size}) worker=#{Parallel.worker_number}")
162
+ while true
163
+ new_tweets, new_pos = query_single_page(query, lang, pos, headers: headers, proxies: proxies)
164
+ unless new_tweets.empty?
165
+ @mutex.synchronize {
166
+ @all_tweets.concat(new_tweets)
167
+ @all_tweets.uniq! { |t| t.tweet_id }
168
+ }
169
+ end
170
+ logger.info("Got #{new_tweets.size} tweets (total #{@all_tweets.size})")
129
171
 
130
- break unless new_pos
131
- break if all_tweets.size >= limit
172
+ break unless new_pos
173
+ break if @all_tweets.size >= limit
132
174
 
133
- pos = new_pos
134
- end
175
+ pos = new_pos
176
+ end
135
177
 
136
- if all_tweets.size >= limit
137
- logger.info("Reached limit #{all_tweets.size}")
138
- raise Parallel::Break
178
+ if @all_tweets.size >= limit
179
+ logger.info("Limit reached #{@all_tweets.size}")
180
+ @stop_requested = true
181
+ end
182
+ end
183
+
184
+ def stop_requested?
185
+ @stop_requested
186
+ end
187
+
188
+ def query_tweets(query, start_date: nil, end_date: nil, lang: '', limit: 100, threads: 2, proxy: false)
189
+ start_date = Date.parse(start_date) if start_date && start_date.is_a?(String)
190
+ end_date = Date.parse(end_date) if end_date && end_date.is_a?(String)
191
+ queries = build_queries(query, start_date, end_date)
192
+ threads = queries.size if threads > queries.size
193
+ proxies = proxy ? Twitterscraper::Proxy::Pool.new : []
194
+
195
+ validate_options!(queries[0], start_date: start_date, end_date: end_date, lang: lang, limit: limit, threads: threads, proxy: proxy)
196
+
197
+ logger.info("The number of threads #{threads}")
198
+
199
+ headers = {'User-Agent': USER_AGENT_LIST.sample, 'X-Requested-With': 'XMLHttpRequest'}
200
+ logger.info("Headers #{headers}")
201
+
202
+ @all_tweets = []
203
+ @mutex = Mutex.new
204
+ @stop_requested = false
205
+
206
+ if threads > 1
207
+ Parallel.each(queries, in_threads: threads) do |query|
208
+ main_loop(query, lang, limit, headers, proxies)
209
+ raise Parallel::Break if stop_requested?
210
+ end
211
+ else
212
+ queries.each do |query|
213
+ main_loop(query, lang, limit, headers, proxies)
214
+ break if stop_requested?
139
215
  end
140
216
  end
141
217
 
142
- all_tweets.sort_by { |tweet| -tweet.created_at.to_i }
218
+ @all_tweets.sort_by { |tweet| -tweet.created_at.to_i }
143
219
  end
144
220
  end
145
221
  end
@@ -0,0 +1,48 @@
1
+ module Twitterscraper
2
+ module Template
3
+ module_function
4
+
5
+ def tweets_embedded_html(tweets)
6
+ tweets_html = tweets.map { |t| EMBED_TWEET_HTML.sub('__TWEET_URL__', t.tweet_url) }
7
+ EMBED_TWEETS_HTML.sub('__TWEETS__', tweets_html.join)
8
+ end
9
+
10
+ EMBED_TWEET_HTML = <<~'HTML'
11
+ <blockquote class="twitter-tweet">
12
+ <a href="__TWEET_URL__"></a>
13
+ </blockquote>
14
+ HTML
15
+
16
+ EMBED_TWEETS_HTML = <<~'HTML'
17
+ <html>
18
+ <head>
19
+ <style type=text/css>
20
+ .twitter-tweet {
21
+ margin: 30px auto 0 auto !important;
22
+ }
23
+ </style>
24
+ <script>
25
+ window.twttr = (function(d, s, id) {
26
+ var js, fjs = d.getElementsByTagName(s)[0], t = window.twttr || {};
27
+ if (d.getElementById(id)) return t;
28
+ js = d.createElement(s);
29
+ js.id = id;
30
+ js.src = "https://platform.twitter.com/widgets.js";
31
+ fjs.parentNode.insertBefore(js, fjs);
32
+
33
+ t._e = [];
34
+ t.ready = function(f) {
35
+ t._e.push(f);
36
+ };
37
+
38
+ return t;
39
+ }(document, "script", "twitter-wjs"));
40
+ </script>
41
+ </head>
42
+ <body>
43
+ __TWEETS__
44
+ </body>
45
+ </html>
46
+ HTML
47
+ end
48
+ end
@@ -2,7 +2,28 @@ require 'time'
2
2
 
3
3
  module Twitterscraper
4
4
  class Tweet
5
- KEYS = [:screen_name, :name, :user_id, :tweet_id, :tweet_url, :created_at, :text]
5
+ KEYS = [
6
+ :screen_name,
7
+ :name,
8
+ :user_id,
9
+ :tweet_id,
10
+ :text,
11
+ :links,
12
+ :hashtags,
13
+ :image_urls,
14
+ :video_url,
15
+ :has_media,
16
+ :likes,
17
+ :retweets,
18
+ :replies,
19
+ :is_replied,
20
+ :is_reply_to,
21
+ :parent_tweet_id,
22
+ :reply_to_users,
23
+ :tweet_url,
24
+ :timestamp,
25
+ :created_at,
26
+ ]
6
27
  attr_reader *KEYS
7
28
 
8
29
  def initialize(attrs)
@@ -11,13 +32,25 @@ module Twitterscraper
11
32
  end
12
33
  end
13
34
 
14
- def to_json(options = {})
35
+ def attrs
15
36
  KEYS.map do |key|
16
37
  [key, send(key)]
17
- end.to_h.to_json
38
+ end.to_h
39
+ end
40
+
41
+ def to_json(options = {})
42
+ attrs.to_json
18
43
  end
19
44
 
20
45
  class << self
46
+ def from_json(text)
47
+ json = JSON.parse(text)
48
+ json.map do |tweet|
49
+ tweet['created_at'] = Time.parse(tweet['created_at'])
50
+ new(tweet)
51
+ end
52
+ end
53
+
21
54
  def from_html(text)
22
55
  html = Nokogiri::HTML(text)
23
56
  from_tweets_html(html.xpath("//li[@class[contains(., 'js-stream-item')]]/div[@class[contains(., 'js-stream-tweet')]]"))
@@ -31,15 +64,51 @@ module Twitterscraper
31
64
 
32
65
  def from_tweet_html(html)
33
66
  inner_html = Nokogiri::HTML(html.inner_html)
67
+ tweet_id = html.attr('data-tweet-id').to_i
68
+ text = inner_html.xpath("//div[@class[contains(., 'js-tweet-text-container')]]/p[@class[contains(., 'js-tweet-text')]]").first.text
69
+ links = inner_html.xpath("//a[@class[contains(., 'twitter-timeline-link')]]").map { |elem| elem.attr('data-expanded-url') }.select { |link| link && !link.include?('pic.twitter') }
70
+ image_urls = inner_html.xpath("//div[@class[contains(., 'AdaptiveMedia-photoContainer')]]").map { |elem| elem.attr('data-image-url') }
71
+ video_url = inner_html.xpath("//div[@class[contains(., 'PlayableMedia-container')]]/a").map { |elem| elem.attr('href') }[0]
72
+ has_media = !image_urls.empty? || (video_url && !video_url.empty?)
73
+
74
+ actions = inner_html.xpath("//div[@class[contains(., 'ProfileTweet-actionCountList')]]")
75
+ likes = actions.xpath("//span[@class[contains(., 'ProfileTweet-action--favorite')]]/span[@class[contains(., 'ProfileTweet-actionCount')]]").first.attr('data-tweet-stat-count').to_i || 0
76
+ retweets = actions.xpath("//span[@class[contains(., 'ProfileTweet-action--retweet')]]/span[@class[contains(., 'ProfileTweet-actionCount')]]").first.attr('data-tweet-stat-count').to_i || 0
77
+ replies = actions.xpath("//span[@class[contains(., 'ProfileTweet-action--reply u-hiddenVisually')]]/span[@class[contains(., 'ProfileTweet-actionCount')]]").first.attr('data-tweet-stat-count').to_i || 0
78
+ is_replied = replies != 0
79
+
80
+ parent_tweet_id = inner_html.xpath('//*[@data-conversation-id]').first.attr('data-conversation-id').to_i
81
+ if tweet_id == parent_tweet_id
82
+ is_reply_to = false
83
+ parent_tweet_id = nil
84
+ reply_to_users = []
85
+ else
86
+ is_reply_to = true
87
+ reply_to_users = inner_html.xpath("//div[@class[contains(., 'ReplyingToContextBelowAuthor')]]/a").map { |user| {screen_name: user.text.delete_prefix('@'), user_id: user.attr('data-user-id')} }
88
+ end
89
+
34
90
  timestamp = inner_html.xpath("//span[@class[contains(., 'js-short-timestamp')]]").first.attr('data-time').to_i
35
91
  new(
36
92
  screen_name: html.attr('data-screen-name'),
37
93
  name: html.attr('data-name'),
38
94
  user_id: html.attr('data-user-id').to_i,
39
- tweet_id: html.attr('data-tweet-id').to_i,
95
+ tweet_id: tweet_id,
96
+ text: text,
97
+ links: links,
98
+ hashtags: text.scan(/#\w+/).map { |tag| tag.delete_prefix('#') },
99
+ image_urls: image_urls,
100
+ video_url: video_url,
101
+ has_media: has_media,
102
+ likes: likes,
103
+ retweets: retweets,
104
+ replies: replies,
105
+ is_replied: is_replied,
106
+ is_reply_to: is_reply_to,
107
+ parent_tweet_id: parent_tweet_id,
108
+ reply_to_users: reply_to_users,
40
109
  tweet_url: 'https://twitter.com' + html.attr('data-permalink-path'),
110
+ timestamp: timestamp,
41
111
  created_at: Time.at(timestamp, in: '+00:00'),
42
- text: inner_html.xpath("//div[@class[contains(., 'js-tweet-text-container')]]/p[@class[contains(., 'js-tweet-text')]]").first.text,
43
112
  )
44
113
  end
45
114
  end
@@ -1,3 +1,3 @@
1
1
  module Twitterscraper
2
- VERSION = '0.6.0'
2
+ VERSION = '0.11.0'
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: twitterscraper-ruby
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.6.0
4
+ version: 0.11.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - ts-3156
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-07-13 00:00:00.000000000 Z
11
+ date: 2020-07-15 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
@@ -61,6 +61,7 @@ files:
61
61
  - bin/twitterscraper
62
62
  - lib/twitterscraper-ruby.rb
63
63
  - lib/twitterscraper.rb
64
+ - lib/twitterscraper/cache.rb
64
65
  - lib/twitterscraper/cli.rb
65
66
  - lib/twitterscraper/client.rb
66
67
  - lib/twitterscraper/http.rb
@@ -68,6 +69,7 @@ files:
68
69
  - lib/twitterscraper/logger.rb
69
70
  - lib/twitterscraper/proxy.rb
70
71
  - lib/twitterscraper/query.rb
72
+ - lib/twitterscraper/template.rb
71
73
  - lib/twitterscraper/tweet.rb
72
74
  - lib/version.rb
73
75
  - twitterscraper-ruby.gemspec