twitterscraper-ruby 0.15.0 → 0.18.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: a950fb24329aaa1020441e258a8a2144100d732142b6c227bb9b026b8bb73996
4
- data.tar.gz: 1f64f31e43189e2ee439f5ef6f6d54bc6ea58895adbed67cb8ddbe91af07681a
3
+ metadata.gz: 8e9bdefe1c4d10e6d9f1d12aeb279b2a3751c570e96e05daaf849dd423bb03bf
4
+ data.tar.gz: 7de97de19daeecce2837fe8e5999b6c9490ab49a18a2ab9e603bf4d039abc4b9
5
5
  SHA512:
6
- metadata.gz: 8573affbc9a5faa05e5e489364bb2ba0da1aa4f12af35445e5de8b1f8c399eb0575cc9f408b2ba96c3d7fd8b2a74b7dd703229053a33c1f8a883856818033cb9
7
- data.tar.gz: 2b2b3ad0b2dd9d089a7b6127ed1b0db21e7f4fa5f0c31e6b366d9b5ae444e2244d4200c813b7a3257f43702d2caa9f264515e701602c24f4482a746b89d41328
6
+ metadata.gz: 55b7e0b52b2ce44418305798ed27a677405244a48f5ad0a797e3abf7958b0581a313ebd33f3f69b891ba7454f8f5c9c0db845c9ca8be321cd27212932821776e
7
+ data.tar.gz: 8fe97a0dc164fc0108b8e6a35843fba19ade5fbaf4f1ee2b4a400afbd3bdbb220a49dfbef4fceb1d8ecc43df3b4f4b7bad0ee5ea94c0aac464c0477e42efb866
data/.gitignore CHANGED
@@ -8,3 +8,4 @@
8
8
  /tmp/
9
9
  /cache
10
10
  /.idea
11
+ .DS_Store
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- twitterscraper-ruby (0.15.0)
4
+ twitterscraper-ruby (0.18.0)
5
5
  nokogiri
6
6
  parallel
7
7
 
data/README.md CHANGED
@@ -5,15 +5,17 @@
5
5
 
6
6
  A gem to scrape https://twitter.com/search. This gem is inspired by [taspinar/twitterscraper](https://github.com/taspinar/twitterscraper).
7
7
 
8
+ Please feel free to ask [@ts_3156](https://twitter.com/ts_3156) if you have any questions.
9
+
8
10
 
9
11
  ## Twitter Search API vs. twitterscraper-ruby
10
12
 
11
- ### Twitter Search API
13
+ #### Twitter Search API
12
14
 
13
15
  - The number of tweets: 180 - 450 requests/15 minutes (18,000 - 45,000 tweets/15 minutes)
14
16
  - The time window: the past 7 days
15
17
 
16
- ### twitterscraper-ruby
18
+ #### twitterscraper-ruby
17
19
 
18
20
  - The number of tweets: Unlimited
19
21
  - The time window: from 2006-3-21 to today
@@ -30,37 +32,49 @@ $ gem install twitterscraper-ruby
30
32
 
31
33
  ## Usage
32
34
 
33
- Command-line interface:
35
+ #### Command-line interface:
36
+
37
+ Returns a collection of relevant tweets matching a specified query.
34
38
 
35
39
  ```shell script
36
- # Returns a collection of relevant tweets matching a specified query.
37
40
  $ twitterscraper --type search --query KEYWORD --start_date 2020-06-01 --end_date 2020-06-30 --lang ja \
38
41
  --limit 100 --threads 10 --output tweets.json
39
42
  ```
40
43
 
44
+ Returns a collection of the most recent tweets posted by the user indicated by the screen_name
45
+
41
46
  ```shell script
42
- # Returns a collection of the most recent tweets posted by the user indicated by the screen_name
43
47
  $ twitterscraper --type user --query SCREEN_NAME --limit 100 --output tweets.json
44
48
  ```
45
49
 
46
- From Within Ruby:
50
+ #### From Within Ruby:
47
51
 
48
52
  ```ruby
49
53
  require 'twitterscraper'
50
54
  client = Twitterscraper::Client.new(cache: true, proxy: true)
51
55
  ```
52
56
 
57
+ Returns a collection of relevant tweets matching a specified query.
58
+
53
59
  ```ruby
54
- # Returns a collection of relevant tweets matching a specified query.
55
60
  tweets = client.search(KEYWORD, start_date: '2020-06-01', end_date: '2020-06-30', lang: 'ja', limit: 100, threads: 10)
56
61
  ```
57
62
 
63
+ Returns a collection of the most recent tweets posted by the user indicated by the screen_name
64
+
58
65
  ```ruby
59
- # Returns a collection of the most recent tweets posted by the user indicated by the screen_name
60
66
  tweets = client.user_timeline(SCREEN_NAME, limit: 100)
61
67
  ```
62
68
 
63
69
 
70
+ ## Examples
71
+
72
+ ```shell script
73
+ $ twitterscraper --query twitter --limit 1000
74
+ $ cat tweets.json | jq . | less
75
+ ```
76
+
77
+
64
78
  ## Attributes
65
79
 
66
80
  ### Tweet
@@ -72,14 +86,44 @@ tweets.each do |tweet|
72
86
  puts tweet.tweet_url
73
87
  puts tweet.created_at
74
88
 
89
+ attr_names = hash.keys
75
90
  hash = tweet.attrs
76
- puts hash.keys
91
+ json = tweet.to_json
77
92
  end
78
93
  ```
79
94
 
95
+ ```json
96
+ [
97
+ {
98
+ "screen_name": "@name",
99
+ "name": "Name",
100
+ "user_id": 12340000,
101
+ "profile_image_url": "https://pbs.twimg.com/profile_images/1826000000/0000.png",
102
+ "tweet_id": 1234000000000000,
103
+ "text": "Thanks Twitter!",
104
+ "links": [],
105
+ "hashtags": [],
106
+ "image_urls": [],
107
+ "video_url": null,
108
+ "has_media": null,
109
+ "likes": 10,
110
+ "retweets": 20,
111
+ "replies": 0,
112
+ "is_replied": false,
113
+ "is_reply_to": false,
114
+ "parent_tweet_id": null,
115
+ "reply_to_users": [],
116
+ "tweet_url": "https://twitter.com/name/status/1234000000000000",
117
+ "timestamp": 1594793000,
118
+ "created_at": "2020-07-15 00:00:00 +0000"
119
+ }
120
+ ]
121
+ ```
122
+
80
123
  - screen_name
81
124
  - name
82
125
  - user_id
126
+ - profile_image_url
83
127
  - tweet_id
84
128
  - text
85
129
  - links
@@ -118,45 +162,25 @@ end
118
162
  Search operators documentation is in [Standard search operators](https://developer.twitter.com/en/docs/tweets/rules-and-filtering/overview/standard-operators).
119
163
 
120
164
 
121
- ## Examples
122
-
123
- ```shell script
124
- $ twitterscraper --query twitter --limit 1000
125
- $ cat tweets.json | jq . | less
126
- ```
127
-
128
- ```json
129
- [
130
- {
131
- "screen_name": "@screenname",
132
- "name": "name",
133
- "user_id": 1194529546483000000,
134
- "tweet_id": 1282659891992000000,
135
- "tweet_url": "https://twitter.com/screenname/status/1282659891992000000",
136
- "created_at": "2020-07-13 12:00:00 +0000",
137
- "text": "Thanks Twitter!"
138
- }
139
- ]
140
- ```
141
-
142
165
  ## CLI Options
143
166
 
144
- | Option | Description | Default |
145
- | ------------- | ------------- | ------------- |
146
- | `-h`, `--help` | This option displays a summary of twitterscraper. | |
147
- | `--type` | Specify a search type. | search |
148
- | `--query` | Specify a keyword used during the search. | |
149
- | `--start_date` | Used as "since:yyyy-mm-dd for your query. This means "since the date". | |
150
- | `--end_date` | Used as "until:yyyy-mm-dd for your query. This means "before the date". | |
151
- | `--lang` | Retrieve tweets written in a specific language. | |
152
- | `--limit` | Stop scraping when *at least* the number of tweets indicated with --limit is scraped. | 100 |
153
- | `--order` | Sort order of the results. | desc |
154
- | `--threads` | Set the number of threads twitterscraper-ruby should initiate while scraping for your query. | 2 |
155
- | `--proxy` | Scrape https://twitter.com/search via proxies. | true |
156
- | `--cache` | Enable caching. | true |
157
- | `--format` | The format of the output. | json |
158
- | `--output` | The name of the output file. | tweets.json |
159
- | `--verbose` | Print debug messages. | tweets.json |
167
+ | Option | Type | Description | Value |
168
+ | ------------- | ------------- | ------------- | ------------- |
169
+ | `--help` | string | This option displays a summary of twitterscraper. | |
170
+ | `--type` | string | Specify a search type. | search(default) or user |
171
+ | `--query` | string | Specify a keyword used during the search. | |
172
+ | `--start_date` | string | Used as "since:yyyy-mm-dd for your query. This means "since the date". | |
173
+ | `--end_date` | string | Used as "until:yyyy-mm-dd for your query. This means "before the date". | |
174
+ | `--lang` | string | Retrieve tweets written in a specific language. | |
175
+ | `--limit` | integer | Stop scraping when *at least* the number of tweets indicated with --limit is scraped. | 100 |
176
+ | `--order` | string | Sort a order of the results. | desc(default) or asc |
177
+ | `--threads` | integer | Set the number of threads twitterscraper-ruby should initiate while scraping for your query. | 2 |
178
+ | `--threads_granularity` | string | | auto |
179
+ | `--proxy` | boolean | Scrape https://twitter.com/search via proxies. | true(default) or false |
180
+ | `--cache` | boolean | Enable caching. | true(default) or false |
181
+ | `--format` | string | The format of the output. | json(default) or html |
182
+ | `--output` | string | The name of the output file. | tweets.json |
183
+ | `--verbose` | | Print debug messages. | |
160
184
 
161
185
 
162
186
  ## Contributing
@@ -4,6 +4,7 @@ require 'twitterscraper/http'
4
4
  require 'twitterscraper/lang'
5
5
  require 'twitterscraper/cache'
6
6
  require 'twitterscraper/query'
7
+ require 'twitterscraper/type'
7
8
  require 'twitterscraper/client'
8
9
  require 'twitterscraper/tweet'
9
10
  require 'twitterscraper/template'
@@ -4,7 +4,7 @@ require 'digest/md5'
4
4
  module Twitterscraper
5
5
  class Cache
6
6
  def initialize()
7
- @ttl = 3600 # 1 hour
7
+ @ttl = 86400 # 1 day
8
8
  @dir = 'cache'
9
9
  Dir.mkdir(@dir) unless File.exist?(@dir)
10
10
  end
@@ -25,6 +25,12 @@ module Twitterscraper
25
25
  File.write(file, entry.to_json)
26
26
  end
27
27
 
28
+ def delete(key)
29
+ key = cache_key(key)
30
+ file = File.join(@dir, key)
31
+ File.delete(file) if File.exist?(file)
32
+ end
33
+
28
34
  def fetch(key, &block)
29
35
  if (value = read(key))
30
36
  value
@@ -24,19 +24,22 @@ module Twitterscraper
24
24
  daily_limit: options['daily_limit'],
25
25
  order: options['order'],
26
26
  threads: options['threads'],
27
+ threads_granularity: options['threads_granularity'],
27
28
  }
28
29
  client = Twitterscraper::Client.new(cache: options['cache'], proxy: options['proxy'])
29
30
  tweets = client.query_tweets(options['query'], query_options)
30
- export(tweets) unless tweets.empty?
31
+ export(options['query'], tweets) unless tweets.empty?
31
32
  end
32
33
 
33
- def export(tweets)
34
- write_json = lambda { File.write(options['output'], generate_json(tweets)) }
34
+ def export(name, tweets)
35
+ filepath = options['output']
36
+ Dir.mkdir(File.dirname(filepath)) unless File.exist?(File.dirname(filepath))
37
+ write_json = lambda { File.write(filepath, generate_json(tweets)) }
35
38
 
36
39
  if options['format'] == 'json'
37
40
  write_json.call
38
41
  elsif options['format'] == 'html'
39
- File.write('tweets.html', Template.tweets_embedded_html(tweets))
42
+ File.write(filepath, Template.new.tweets_embedded_html(name, tweets, options))
40
43
  else
41
44
  write_json.call
42
45
  end
@@ -69,6 +72,7 @@ module Twitterscraper
69
72
  'daily_limit:',
70
73
  'order:',
71
74
  'threads:',
75
+ 'threads_granularity:',
72
76
  'output:',
73
77
  'format:',
74
78
  'cache:',
@@ -82,10 +86,11 @@ module Twitterscraper
82
86
  options['lang'] ||= ''
83
87
  options['limit'] = (options['limit'] || 100).to_i
84
88
  options['daily_limit'] = options['daily_limit'].to_i if options['daily_limit']
85
- options['threads'] = (options['threads'] || 2).to_i
89
+ options['threads'] = (options['threads'] || 10).to_i
90
+ options['threads_granularity'] ||= 'auto'
86
91
  options['format'] ||= 'json'
87
92
  options['order'] ||= 'desc'
88
- options['output'] ||= "tweets.#{options['format']}"
93
+ options['output'] ||= build_output_name(options)
89
94
 
90
95
  options['cache'] = options['cache'] != 'false'
91
96
  options['proxy'] = options['proxy'] != 'false'
@@ -93,6 +98,12 @@ module Twitterscraper
93
98
  options
94
99
  end
95
100
 
101
+ def build_output_name(options)
102
+ query = options['query'].gsub(/[ :?#&]/, '_')
103
+ date = [options['start_date'], options['end_date']].select { |val| val && !val.empty? }.join('_')
104
+ File.join('out', [options['type'], 'tweets', date, query].compact.join('_') + '.' + options['format'])
105
+ end
106
+
96
107
  def initialize_logger
97
108
  Twitterscraper.logger.level = ::Logger::DEBUG if options['verbose']
98
109
  end
@@ -27,8 +27,8 @@ module Twitterscraper
27
27
  'include_available_features=1&include_entities=1&' +
28
28
  'max_position=__POS__&reset_error_state=false'
29
29
 
30
- def build_query_url(query, lang, from_user, pos)
31
- if from_user
30
+ def build_query_url(query, lang, type, pos)
31
+ if type.user?
32
32
  if pos
33
33
  RELOAD_URL_USER.sub('__USER__', query).sub('__POS__', pos.to_s)
34
34
  else
@@ -51,7 +51,7 @@ module Twitterscraper
51
51
  end
52
52
  Http.get(url, headers, proxy, timeout)
53
53
  rescue => e
54
- logger.debug "query_single_page: #{e.inspect}"
54
+ logger.debug "get_single_page: #{e.inspect}"
55
55
  if (retries -= 1) > 0
56
56
  logger.info "Retrying... (Attempts left: #{retries - 1})"
57
57
  retry
@@ -69,7 +69,6 @@ module Twitterscraper
69
69
  else
70
70
  json_resp = JSON.parse(text)
71
71
  items_html = json_resp['items_html'] || ''
72
- logger.warn json_resp['message'] if json_resp['message'] # Sorry, you are rate limited.
73
72
  end
74
73
 
75
74
  [items_html, json_resp]
@@ -77,22 +76,26 @@ module Twitterscraper
77
76
 
78
77
  def query_single_page(query, lang, type, pos, headers: [], proxies: [])
79
78
  logger.info "Querying #{query}"
80
- query = ERB::Util.url_encode(query)
79
+ encoded_query = ERB::Util.url_encode(query)
81
80
 
82
- url = build_query_url(query, lang, type == 'user', pos)
81
+ url = build_query_url(encoded_query, lang, type, pos)
83
82
  http_request = lambda do
84
- logger.debug "Scraping tweets from #{url}"
83
+ logger.debug "Scraping tweets from url=#{url}"
85
84
  get_single_page(url, headers, proxies)
86
85
  end
87
86
 
88
87
  if cache_enabled?
89
88
  client = Cache.new
90
89
  if (response = client.read(url))
91
- logger.debug 'Fetching tweets from cache'
90
+ logger.debug "Fetching tweets from cache url=#{url}"
92
91
  else
93
92
  response = http_request.call
94
93
  client.write(url, response) unless stop_requested?
95
94
  end
95
+ if @queries && query == @queries.last && pos.nil?
96
+ logger.debug "Delete a cache query=#{query}"
97
+ client.delete(url)
98
+ end
96
99
  else
97
100
  response = http_request.call
98
101
  end
@@ -100,6 +103,12 @@ module Twitterscraper
100
103
 
101
104
  html, json_resp = parse_single_page(response, pos.nil?)
102
105
 
106
+ if json_resp && json_resp['message']
107
+ logger.warn json_resp['message'] # Sorry, you are rate limited.
108
+ @stop_requested = true
109
+ Cache.new.delete(url) if cache_enabled?
110
+ end
111
+
103
112
  tweets = Tweet.from_html(html)
104
113
 
105
114
  if tweets.empty?
@@ -108,7 +117,7 @@ module Twitterscraper
108
117
 
109
118
  if json_resp
110
119
  [tweets, json_resp['min_position']]
111
- elsif type
120
+ elsif type.user?
112
121
  [tweets, tweets[-1].tweet_id]
113
122
  else
114
123
  [tweets, "TWEET-#{tweets[-1].tweet_id}-#{tweets[0].tweet_id}"]
@@ -140,19 +149,33 @@ module Twitterscraper
140
149
  raise Error.new(":start_date must be greater than or equal to #{OLDEST_DATE}")
141
150
  end
142
151
  end
143
-
144
- if end_date
145
- today = Date.today
146
- if end_date > Date.today
147
- raise Error.new(":end_date must be less than or equal to today(#{today})")
148
- end
149
- end
150
152
  end
151
153
 
152
- def build_queries(query, start_date, end_date)
154
+ def build_queries(query, start_date, end_date, threads_granularity)
153
155
  if start_date && end_date
154
- date_range = start_date.upto(end_date - 1)
155
- date_range.map { |date| query + " since:#{date} until:#{date + 1}" }
156
+ if threads_granularity == 'auto'
157
+ threads_granularity = start_date.upto(end_date - 1).to_a.size >= 28 ? 'day' : 'hour'
158
+ end
159
+
160
+ if threads_granularity == 'day'
161
+ date_range = start_date.upto(end_date - 1)
162
+ queries = date_range.map { |date| query + " since:#{date} until:#{date + 1}" }
163
+ else
164
+ time = Time.utc(start_date.year, start_date.month, start_date.day, 0, 0, 0)
165
+ end_time = Time.utc(end_date.year, end_date.month, end_date.day, 0, 0, 0)
166
+ queries = []
167
+
168
+ while true
169
+ if time < Time.now.utc
170
+ queries << (query + " since:#{time.strftime('%Y-%m-%d_%H:00:00')}_UTC until:#{(time + 3600).strftime('%Y-%m-%d_%H:00:00')}_UTC")
171
+ end
172
+ time += 3600
173
+ break if time >= end_time
174
+ end
175
+ end
176
+
177
+ @queries = queries
178
+
156
179
  elsif start_date
157
180
  [query + " since:#{start_date}"]
158
181
  elsif end_date
@@ -196,12 +219,18 @@ module Twitterscraper
196
219
  @stop_requested
197
220
  end
198
221
 
199
- def query_tweets(query, type: 'search', start_date: nil, end_date: nil, lang: nil, limit: 100, daily_limit: nil, order: 'desc', threads: 2)
200
- start_date = Date.parse(start_date) if start_date && start_date.is_a?(String)
201
- end_date = Date.parse(end_date) if end_date && end_date.is_a?(String)
202
- queries = build_queries(query, start_date, end_date)
222
+ def query_tweets(query, type: 'search', start_date: nil, end_date: nil, lang: nil, limit: 100, daily_limit: nil, order: 'desc', threads: 10, threads_granularity: 'auto')
223
+ type = Type.new(type)
224
+ if type.search?
225
+ start_date = Date.parse(start_date) if start_date && start_date.is_a?(String)
226
+ end_date = Date.parse(end_date) if end_date && end_date.is_a?(String)
227
+ elsif type.user?
228
+ start_date = nil
229
+ end_date = nil
230
+ end
231
+
232
+ queries = build_queries(query, start_date, end_date, threads_granularity)
203
233
  if threads > queries.size
204
- logger.warn 'The maximum number of :threads is the number of dates between :start_date and :end_date.'
205
234
  threads = queries.size
206
235
  end
207
236
  if proxy_enabled?
@@ -213,9 +242,9 @@ module Twitterscraper
213
242
  end
214
243
  logger.debug "Cache #{cache_enabled? ? 'enabled' : 'disabled'}"
215
244
 
216
-
217
245
  validate_options!(queries, type: type, start_date: start_date, end_date: end_date, lang: lang, limit: limit, threads: threads)
218
246
 
247
+ logger.info "The number of queries #{queries.size}"
219
248
  logger.info "The number of threads #{threads}"
220
249
 
221
250
  headers = {'User-Agent': USER_AGENT_LIST.sample, 'X-Requested-With': 'XMLHttpRequest'}
@@ -240,15 +269,17 @@ module Twitterscraper
240
269
  end
241
270
  end
242
271
 
272
+ logger.info "Return #{@all_tweets.size} tweets"
273
+
243
274
  @all_tweets.sort_by { |tweet| (order == 'desc' ? -1 : 1) * tweet.created_at.to_i }
244
275
  end
245
276
 
246
- def search(query, start_date: nil, end_date: nil, lang: '', limit: 100, daily_limit: nil, order: 'desc', threads: 2)
247
- query_tweets(query, type: 'search', start_date: start_date, end_date: end_date, lang: lang, limit: limit, daily_limit: daily_limit, order: order, threads: threads)
277
+ def search(query, start_date: nil, end_date: nil, lang: '', limit: 100, daily_limit: nil, order: 'desc', threads: 10, threads_granularity: 'auto')
278
+ query_tweets(query, type: 'search', start_date: start_date, end_date: end_date, lang: lang, limit: limit, daily_limit: daily_limit, order: order, threads: threads, threads_granularity: threads_granularity)
248
279
  end
249
280
 
250
281
  def user_timeline(screen_name, limit: 100, order: 'desc')
251
- query_tweets(screen_name, type: 'user', start_date: nil, end_date: nil, lang: nil, limit: limit, daily_limit: nil, order: order, threads: 1)
282
+ query_tweets(screen_name, type: 'user', start_date: nil, end_date: nil, lang: nil, limit: limit, daily_limit: nil, order: order, threads: 1, threads_granularity: nil)
252
283
  end
253
284
  end
254
285
  end
@@ -1,48 +1,59 @@
1
1
  module Twitterscraper
2
- module Template
3
- module_function
2
+ class Template
3
+ def tweets_embedded_html(name, tweets, options)
4
+ path = File.join(File.dirname(__FILE__), 'template/tweets.html.erb')
5
+ template = ERB.new(File.read(path))
4
6
 
5
- def tweets_embedded_html(tweets)
6
- tweets_html = tweets.map { |t| EMBED_TWEET_HTML.sub('__TWEET_URL__', t.tweet_url) }
7
- EMBED_TWEETS_HTML.sub('__TWEETS__', tweets_html.join)
7
+ tweets = tweets.sort_by { |t| t.created_at.to_i }
8
+
9
+ template.result_with_hash(
10
+ chart_name: name,
11
+ chart_data: chart_data(tweets).to_json,
12
+ first_tweet: tweets[0],
13
+ last_tweet: tweets[-1],
14
+ tweets: tweets,
15
+ convert_limit: 30,
16
+ )
8
17
  end
9
18
 
10
- EMBED_TWEET_HTML = <<~'HTML'
11
- <blockquote class="twitter-tweet">
12
- <a href="__TWEET_URL__"></a>
13
- </blockquote>
14
- HTML
15
-
16
- EMBED_TWEETS_HTML = <<~'HTML'
17
- <html>
18
- <head>
19
- <style type=text/css>
20
- .twitter-tweet {
21
- margin: 30px auto 0 auto !important;
22
- }
23
- </style>
24
- <script>
25
- window.twttr = (function(d, s, id) {
26
- var js, fjs = d.getElementsByTagName(s)[0], t = window.twttr || {};
27
- if (d.getElementById(id)) return t;
28
- js = d.createElement(s);
29
- js.id = id;
30
- js.src = "https://platform.twitter.com/widgets.js";
31
- fjs.parentNode.insertBefore(js, fjs);
32
-
33
- t._e = [];
34
- t.ready = function(f) {
35
- t._e.push(f);
36
- };
37
-
38
- return t;
39
- }(document, "script", "twitter-wjs"));
40
- </script>
41
- </head>
42
- <body>
43
- __TWEETS__
44
- </body>
45
- </html>
46
- HTML
19
+ def chart_data(tweets, trimming: true, smoothing: true)
20
+ min_interval = 5
21
+
22
+ data = tweets.each_with_object(Hash.new(0)) do |tweet, memo|
23
+ t = tweet.created_at
24
+ min = (t.min.to_f / min_interval).floor * min_interval
25
+ time = Time.new(t.year, t.month, t.day, t.hour, min, 0, '+00:00')
26
+ memo[time.to_i] += 1
27
+ end
28
+
29
+ if false && trimming
30
+ data.keys.sort.each.with_index do |timestamp, i|
31
+ break if data.size - 1 == i
32
+ if data[i] == 0 && data[i + 1] == 0
33
+ data.delete(timestamp)
34
+ end
35
+ end
36
+ end
37
+
38
+ if false && smoothing
39
+ time = data.keys.min
40
+ max_time = data.keys.max
41
+ sec_interval = 60 * min_interval
42
+
43
+ while true
44
+ next_time = time + sec_interval
45
+ break if next_time + sec_interval > max_time
46
+
47
+ unless data.has_key?(next_time)
48
+ data[next_time] = (data[time] + data[next_time + sec_interval]) / 2
49
+ end
50
+ time = next_time
51
+ end
52
+ end
53
+
54
+ data.sort_by { |k, _| k }.map do |timestamp, count|
55
+ [timestamp * 1000, count]
56
+ end
57
+ end
47
58
  end
48
59
  end
@@ -0,0 +1,109 @@
1
+ <html>
2
+ <head>
3
+ <script src="https://cdnjs.cloudflare.com/ajax/libs/moment.js/2.27.0/moment.min.js" integrity="sha512-rmZcZsyhe0/MAjquhTgiUcb4d9knaFc7b5xAfju483gbEXTkeJRUMIPk6s3ySZMYUHEcjKbjLjyddGWMrNEvZg==" crossorigin="anonymous"></script>
4
+ <script src="https://cdnjs.cloudflare.com/ajax/libs/moment-timezone/0.5.31/moment-timezone-with-data.min.js" integrity="sha512-HZcf3uHWA+Y2P5KNv+F/xa87/flKVP92kUTe/KXjU8URPshczF1Dx+cL5bw0VBGhmqWAK0UbhcqxBbyiNtAnWQ==" crossorigin="anonymous"></script>
5
+ <script src="https://code.highcharts.com/stock/highstock.js"></script>
6
+ <script>
7
+ function updateTweets() {
8
+ window.twttr = (function (d, s, id) {
9
+ var js, fjs = d.getElementsByTagName(s)[0], t = window.twttr || {};
10
+ if (d.getElementById(id)) return t;
11
+ js = d.createElement(s);
12
+ js.id = id;
13
+ js.src = "https://platform.twitter.com/widgets.js";
14
+ fjs.parentNode.insertBefore(js, fjs);
15
+
16
+ t._e = [];
17
+ t.ready = function (f) {
18
+ t._e.push(f);
19
+ };
20
+
21
+ return t;
22
+ }(document, "script", "twitter-wjs"));
23
+ }
24
+
25
+ function drawChart() {
26
+ Highcharts.setOptions({
27
+ time: {
28
+ timezone: moment.tz.guess()
29
+ }
30
+ });
31
+
32
+ var data = <%= chart_data %>;
33
+ var config = {
34
+ title: {
35
+ text: '<%= tweets.size %> tweets of <%= chart_name %>'
36
+ },
37
+ subtitle: {
38
+ text: 'since:<%= first_tweet.created_at.localtime.strftime('%Y-%m-%d %H:%M') %> until:<%= last_tweet.created_at.localtime.strftime('%Y-%m-%d %H:%M') %>'
39
+ },
40
+ series: [{
41
+ data: data
42
+ }],
43
+ rangeSelector: {enabled: false},
44
+ scrollbar: {enabled: false},
45
+ navigator: {enabled: false},
46
+ exporting: {enabled: false},
47
+ credits: {enabled: false}
48
+ };
49
+
50
+ Highcharts.stockChart('chart-container', config);
51
+ }
52
+
53
+ document.addEventListener("DOMContentLoaded", function () {
54
+ drawChart();
55
+ updateTweets();
56
+ });
57
+ </script>
58
+
59
+ <style type=text/css>
60
+ #chart-container {
61
+ max-width: 1200px;
62
+ height: 675px;
63
+ margin: 0 auto;
64
+ border: 1px solid rgb(204, 214, 221);
65
+ display: flex;
66
+ justify-content: center;
67
+ align-items: center;
68
+ }
69
+ .tweets-container {
70
+ max-width: 550px;
71
+ margin: 0 auto 0 auto;
72
+ }
73
+
74
+ .twitter-tweet {
75
+ margin: 15px 0 15px 0 !important;
76
+ }
77
+ </style>
78
+ </head>
79
+ <body>
80
+ <div id="chart-container"><div style="color: gray;">Loading...</div></div>
81
+
82
+ <div class="tweets-container">
83
+ <% tweets.sort_by { |t| -t.created_at.to_i }.each.with_index do |tweet, i| %>
84
+ <% tweet_time = tweet.created_at.localtime.strftime('%Y-%m-%d %H:%M') %>
85
+ <% if i < convert_limit %>
86
+ <blockquote class="twitter-tweet">
87
+ <% else %>
88
+ <div class="twitter-tweet" style="border: 1px solid rgb(204, 214, 221);">
89
+ <% end %>
90
+
91
+ <div style="display: grid; grid-template-rows: 24px 24px; grid-template-columns: 48px 1fr;">
92
+ <div style="grid-row: 1/3; grid-column: 1/2;"><img src="<%= tweet.profile_image_url %>" width="48" height="48" loading="lazy"></div>
93
+ <div style="grid-row: 1/2; grid-column: 2/3;"><%= tweet.name %></div>
94
+ <div style="grid-row: 2/3; grid-column: 2/3;"><a href="https://twitter.com/<%= tweet.screen_name %>">@<%= tweet.screen_name %></a></div>
95
+ </div>
96
+
97
+ <div><%= tweet.text %></div>
98
+ <div><a href="<%= tweet.tweet_url %>"><small><%= tweet_time %></small></a></div>
99
+
100
+ <% if i < convert_limit %>
101
+ </blockquote>
102
+ <% else %>
103
+ </div>
104
+ <% end %>
105
+ <% end %>
106
+ </div>
107
+
108
+ </body>
109
+ </html>
@@ -6,6 +6,7 @@ module Twitterscraper
6
6
  :screen_name,
7
7
  :name,
8
8
  :user_id,
9
+ :profile_image_url,
9
10
  :tweet_id,
10
11
  :text,
11
12
  :links,
@@ -51,6 +52,11 @@ module Twitterscraper
51
52
  end
52
53
  end
53
54
 
55
+ # .js-stream-item
56
+ # .js-stream-tweet{data: {screen-name:, tweet-id:}}
57
+ # .stream-item-header
58
+ # .js-tweet-text-container
59
+ # .stream-item-footer
54
60
  def from_html(text)
55
61
  html = Nokogiri::HTML(text)
56
62
  from_tweets_html(html.xpath("//li[@class[contains(., 'js-stream-item')]]/div[@class[contains(., 'js-stream-tweet')]]"))
@@ -72,6 +78,8 @@ module Twitterscraper
72
78
  end
73
79
 
74
80
  inner_html = Nokogiri::HTML(html.inner_html)
81
+
82
+ profile_image_url = inner_html.xpath("//img[@class[contains(., 'js-action-profile-avatar')]]").first.attr('src').gsub(/_bigger/, '')
75
83
  text = inner_html.xpath("//div[@class[contains(., 'js-tweet-text-container')]]/p[@class[contains(., 'js-tweet-text')]]").first.text
76
84
  links = inner_html.xpath("//a[@class[contains(., 'twitter-timeline-link')]]").map { |elem| elem.attr('data-expanded-url') }.select { |link| link && !link.include?('pic.twitter') }
77
85
  image_urls = inner_html.xpath("//div[@class[contains(., 'AdaptiveMedia-photoContainer')]]").map { |elem| elem.attr('data-image-url') }
@@ -99,6 +107,7 @@ module Twitterscraper
99
107
  screen_name: screen_name,
100
108
  name: html.attr('data-name'),
101
109
  user_id: html.attr('data-user-id').to_i,
110
+ profile_image_url: profile_image_url,
102
111
  tweet_id: tweet_id,
103
112
  text: text,
104
113
  links: links,
@@ -0,0 +1,15 @@
1
+ module Twitterscraper
2
+ class Type
3
+ def initialize(value)
4
+ @value = value
5
+ end
6
+
7
+ def search?
8
+ @value == 'search'
9
+ end
10
+
11
+ def user?
12
+ @value == 'user'
13
+ end
14
+ end
15
+ end
@@ -1,3 +1,3 @@
1
1
  module Twitterscraper
2
- VERSION = '0.15.0'
2
+ VERSION = '0.18.0'
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: twitterscraper-ruby
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.15.0
4
+ version: 0.18.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - ts-3156
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-07-17 00:00:00.000000000 Z
11
+ date: 2020-07-19 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
@@ -72,7 +72,9 @@ files:
72
72
  - lib/twitterscraper/proxy.rb
73
73
  - lib/twitterscraper/query.rb
74
74
  - lib/twitterscraper/template.rb
75
+ - lib/twitterscraper/template/tweets.html.erb
75
76
  - lib/twitterscraper/tweet.rb
77
+ - lib/twitterscraper/type.rb
76
78
  - lib/version.rb
77
79
  - twitterscraper-ruby.gemspec
78
80
  homepage: https://github.com/ts-3156/twitterscraper-ruby