twitterscraper-ruby 0.14.0 → 0.17.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: cf902c947e866cc99e79fbb9f8a51c829accd44aed03ef7657562bf41932c73d
4
- data.tar.gz: 1bc5a0698a17b244ee9228d7728767dd00218179a5a49e0852a74cc722322ef0
3
+ metadata.gz: ac0c10b18d836983cc6b73e25b9ed333af2f620106a07c6bc6a40058fb127895
4
+ data.tar.gz: e6fc18219d9127fb30ba57e39dc4656c0f0a3c108428d959de5bac9e7d317088
5
5
  SHA512:
6
- metadata.gz: 629de8698af1391c210b496e9aadb51ad5f9d7157b1be5d0aa669ae821671e2b5624ba51083fb14b61f93618ff3e90aea1ac0eccb6ea00360fac48a2dfc436c7
7
- data.tar.gz: 3f3706bee5f2a92a2addae034201e2e8cee3fef43efdc323be963cbaf1b94c31c53aa49a19e58a068498722dfe07e9796e097fb04364a9afda56d06132e6b935
6
+ metadata.gz: 90cbf06b606878dc36b4bba44669139c273bf03b08a777ad87036834841bcb4b052e0559813dc56e4be124442abfc5a7fc44c5c9524c74929ca02b1d287d346b
7
+ data.tar.gz: ada0b74ee42ff62964b73ad9b49358227cdaf4fc87420cf12cf65af95168ad9775615a504345ebc83d3b791e9c0d892691c55bc477eddd647b3e8934f752fb9c
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- twitterscraper-ruby (0.14.0)
4
+ twitterscraper-ruby (0.17.0)
5
5
  nokogiri
6
6
  parallel
7
7
 
data/README.md CHANGED
@@ -5,15 +5,17 @@
5
5
 
6
6
  A gem to scrape https://twitter.com/search. This gem is inspired by [taspinar/twitterscraper](https://github.com/taspinar/twitterscraper).
7
7
 
8
+ Please feel free to ask [@ts_3156](https://twitter.com/ts_3156) if you have any questions.
9
+
8
10
 
9
11
  ## Twitter Search API vs. twitterscraper-ruby
10
12
 
11
- ### Twitter Search API
13
+ #### Twitter Search API
12
14
 
13
15
  - The number of tweets: 180 - 450 requests/15 minutes (18,000 - 45,000 tweets/15 minutes)
14
16
  - The time window: the past 7 days
15
17
 
16
- ### twitterscraper-ruby
18
+ #### twitterscraper-ruby
17
19
 
18
20
  - The number of tweets: Unlimited
19
21
  - The time window: from 2006-3-21 to today
@@ -30,48 +32,98 @@ $ gem install twitterscraper-ruby
30
32
 
31
33
  ## Usage
32
34
 
33
- Command-line interface:
35
+ #### Command-line interface:
36
+
37
+ Returns a collection of relevant tweets matching a specified query.
34
38
 
35
39
  ```shell script
36
- $ twitterscraper --query KEYWORD --start_date 2020-06-01 --end_date 2020-06-30 --lang ja \
37
- --limit 100 --threads 10 --output output.json
40
+ $ twitterscraper --type search --query KEYWORD --start_date 2020-06-01 --end_date 2020-06-30 --lang ja \
41
+ --limit 100 --threads 10 --output tweets.json
38
42
  ```
39
43
 
40
- From Within Ruby:
44
+ Returns a collection of the most recent tweets posted by the user indicated by the screen_name
45
+
46
+ ```shell script
47
+ $ twitterscraper --type user --query SCREEN_NAME --limit 100 --output tweets.json
48
+ ```
49
+
50
+ #### From Within Ruby:
41
51
 
42
52
  ```ruby
43
53
  require 'twitterscraper'
54
+ client = Twitterscraper::Client.new(cache: true, proxy: true)
55
+ ```
44
56
 
45
- options = {
46
- start_date: '2020-06-01',
47
- end_date: '2020-06-30',
48
- lang: 'ja',
49
- limit: 100,
50
- threads: 10,
51
- }
57
+ Returns a collection of relevant tweets matching a specified query.
52
58
 
53
- client = Twitterscraper::Client.new(cache: true, proxy: true)
54
- tweets = client.query_tweets(KEYWORD, options)
59
+ ```ruby
60
+ tweets = client.search(KEYWORD, start_date: '2020-06-01', end_date: '2020-06-30', lang: 'ja', limit: 100, threads: 10)
61
+ ```
62
+
63
+ Returns a collection of the most recent tweets posted by the user indicated by the screen_name
64
+
65
+ ```ruby
66
+ tweets = client.user_timeline(SCREEN_NAME, limit: 100)
67
+ ```
68
+
69
+
70
+ ## Examples
71
+
72
+ ```shell script
73
+ $ twitterscraper --query twitter --limit 1000
74
+ $ cat tweets.json | jq . | less
75
+ ```
76
+
77
+
78
+ ## Attributes
55
79
 
80
+ ### Tweet
81
+
82
+ ```ruby
56
83
  tweets.each do |tweet|
57
84
  puts tweet.tweet_id
58
85
  puts tweet.text
59
86
  puts tweet.tweet_url
60
87
  puts tweet.created_at
61
88
 
89
+ attr_names = hash.keys
62
90
  hash = tweet.attrs
63
- puts hash.keys
91
+ json = tweet.to_json
64
92
  end
65
93
  ```
66
94
 
67
-
68
- ## Attributes
69
-
70
- ### Tweet
95
+ ```json
96
+ [
97
+ {
98
+ "screen_name": "@name",
99
+ "name": "Name",
100
+ "user_id": 12340000,
101
+ "profile_image_url": "https://pbs.twimg.com/profile_images/1826000000/0000.png",
102
+ "tweet_id": 1234000000000000,
103
+ "text": "Thanks Twitter!",
104
+ "links": [],
105
+ "hashtags": [],
106
+ "image_urls": [],
107
+ "video_url": null,
108
+ "has_media": null,
109
+ "likes": 10,
110
+ "retweets": 20,
111
+ "replies": 0,
112
+ "is_replied": false,
113
+ "is_reply_to": false,
114
+ "parent_tweet_id": null,
115
+ "reply_to_users": [],
116
+ "tweet_url": "https://twitter.com/name/status/1234000000000000",
117
+ "timestamp": 1594793000,
118
+ "created_at": "2020-07-15 00:00:00 +0000"
119
+ }
120
+ ]
121
+ ```
71
122
 
72
123
  - screen_name
73
124
  - name
74
125
  - user_id
126
+ - profile_image_url
75
127
  - tweet_id
76
128
  - text
77
129
  - links
@@ -110,44 +162,25 @@ end
110
162
  Search operators documentation is in [Standard search operators](https://developer.twitter.com/en/docs/tweets/rules-and-filtering/overview/standard-operators).
111
163
 
112
164
 
113
- ## Examples
114
-
115
- ```shell script
116
- $ twitterscraper --query twitter --limit 1000
117
- $ cat tweets.json | jq . | less
118
- ```
119
-
120
- ```json
121
- [
122
- {
123
- "screen_name": "@screenname",
124
- "name": "name",
125
- "user_id": 1194529546483000000,
126
- "tweet_id": 1282659891992000000,
127
- "tweet_url": "https://twitter.com/screenname/status/1282659891992000000",
128
- "created_at": "2020-07-13 12:00:00 +0000",
129
- "text": "Thanks Twitter!"
130
- }
131
- ]
132
- ```
133
-
134
165
  ## CLI Options
135
166
 
136
- | Option | Description | Default |
137
- | ------------- | ------------- | ------------- |
138
- | `-h`, `--help` | This option displays a summary of twitterscraper. | |
139
- | `--query` | Specify a keyword used during the search. | |
140
- | `--start_date` | Used as "since:yyyy-mm-dd for your query. This means "since the date". | |
141
- | `--end_date` | Used as "until:yyyy-mm-dd for your query. This means "before the date". | |
142
- | `--lang` | Retrieve tweets written in a specific language. | |
143
- | `--limit` | Stop scraping when *at least* the number of tweets indicated with --limit is scraped. | 100 |
144
- | `--order` | Sort order of the results. | desc |
145
- | `--threads` | Set the number of threads twitterscraper-ruby should initiate while scraping for your query. | 2 |
146
- | `--proxy` | Scrape https://twitter.com/search via proxies. | true |
147
- | `--cache` | Enable caching. | true |
148
- | `--format` | The format of the output. | json |
149
- | `--output` | The name of the output file. | tweets.json |
150
- | `--verbose` | Print debug messages. | tweets.json |
167
+ | Option | Type | Description | Value |
168
+ | ------------- | ------------- | ------------- | ------------- |
169
+ | `--help` | string | This option displays a summary of twitterscraper. | |
170
+ | `--type` | string | Specify a search type. | search(default) or user |
171
+ | `--query` | string | Specify a keyword used during the search. | |
172
+ | `--start_date` | string | Used as "since:yyyy-mm-dd for your query. This means "since the date". | |
173
+ | `--end_date` | string | Used as "until:yyyy-mm-dd for your query. This means "before the date". | |
174
+ | `--lang` | string | Retrieve tweets written in a specific language. | |
175
+ | `--limit` | integer | Stop scraping when *at least* the number of tweets indicated with --limit is scraped. | 100 |
176
+ | `--order` | string | Sort a order of the results. | desc(default) or asc |
177
+ | `--threads` | integer | Set the number of threads twitterscraper-ruby should initiate while scraping for your query. | 2 |
178
+ | `--threads_granularity` | string | | auto |
179
+ | `--proxy` | boolean | Scrape https://twitter.com/search via proxies. | true(default) or false |
180
+ | `--cache` | boolean | Enable caching. | true(default) or false |
181
+ | `--format` | string | The format of the output. | json(default) or html |
182
+ | `--output` | string | The name of the output file. | tweets.json |
183
+ | `--verbose` | | Print debug messages. | |
151
184
 
152
185
 
153
186
  ## Contributing
@@ -4,6 +4,7 @@ require 'twitterscraper/http'
4
4
  require 'twitterscraper/lang'
5
5
  require 'twitterscraper/cache'
6
6
  require 'twitterscraper/query'
7
+ require 'twitterscraper/type'
7
8
  require 'twitterscraper/client'
8
9
  require 'twitterscraper/tweet'
9
10
  require 'twitterscraper/template'
@@ -4,7 +4,7 @@ require 'digest/md5'
4
4
  module Twitterscraper
5
5
  class Cache
6
6
  def initialize()
7
- @ttl = 3600 # 1 hour
7
+ @ttl = 86400 # 1 day
8
8
  @dir = 'cache'
9
9
  Dir.mkdir(@dir) unless File.exist?(@dir)
10
10
  end
@@ -25,6 +25,12 @@ module Twitterscraper
25
25
  File.write(file, entry.to_json)
26
26
  end
27
27
 
28
+ def delete(key)
29
+ key = cache_key(key)
30
+ file = File.join(@dir, key)
31
+ File.delete(file) if File.exist?(file)
32
+ end
33
+
28
34
  def fetch(key, &block)
29
35
  if (value = read(key))
30
36
  value
@@ -16,6 +16,7 @@ module Twitterscraper
16
16
  print_version || return if print_version?
17
17
 
18
18
  query_options = {
19
+ type: options['type'],
19
20
  start_date: options['start_date'],
20
21
  end_date: options['end_date'],
21
22
  lang: options['lang'],
@@ -23,19 +24,20 @@ module Twitterscraper
23
24
  daily_limit: options['daily_limit'],
24
25
  order: options['order'],
25
26
  threads: options['threads'],
27
+ threads_granularity: options['threads_granularity'],
26
28
  }
27
29
  client = Twitterscraper::Client.new(cache: options['cache'], proxy: options['proxy'])
28
30
  tweets = client.query_tweets(options['query'], query_options)
29
- export(tweets) unless tweets.empty?
31
+ export(options['query'], tweets) unless tweets.empty?
30
32
  end
31
33
 
32
- def export(tweets)
34
+ def export(name, tweets)
33
35
  write_json = lambda { File.write(options['output'], generate_json(tweets)) }
34
36
 
35
37
  if options['format'] == 'json'
36
38
  write_json.call
37
39
  elsif options['format'] == 'html'
38
- File.write('tweets.html', Template.tweets_embedded_html(tweets))
40
+ File.write(options['output'], Template.new.tweets_embedded_html(name, tweets, options))
39
41
  else
40
42
  write_json.call
41
43
  end
@@ -59,6 +61,7 @@ module Twitterscraper
59
61
  'help',
60
62
  'v',
61
63
  'version',
64
+ 'type:',
62
65
  'query:',
63
66
  'start_date:',
64
67
  'end_date:',
@@ -67,6 +70,7 @@ module Twitterscraper
67
70
  'daily_limit:',
68
71
  'order:',
69
72
  'threads:',
73
+ 'threads_granularity:',
70
74
  'output:',
71
75
  'format:',
72
76
  'cache:',
@@ -75,14 +79,16 @@ module Twitterscraper
75
79
  'verbose',
76
80
  )
77
81
 
82
+ options['type'] ||= 'search'
78
83
  options['start_date'] = Query::OLDEST_DATE if options['start_date'] == 'oldest'
79
84
  options['lang'] ||= ''
80
85
  options['limit'] = (options['limit'] || 100).to_i
81
86
  options['daily_limit'] = options['daily_limit'].to_i if options['daily_limit']
82
- options['threads'] = (options['threads'] || 2).to_i
87
+ options['threads'] = (options['threads'] || 10).to_i
88
+ options['threads_granularity'] ||= 'auto'
83
89
  options['format'] ||= 'json'
84
90
  options['order'] ||= 'desc'
85
- options['output'] ||= "tweets.#{options['format']}"
91
+ options['output'] ||= build_output_name(options)
86
92
 
87
93
  options['cache'] = options['cache'] != 'false'
88
94
  options['proxy'] = options['proxy'] != 'false'
@@ -90,6 +96,12 @@ module Twitterscraper
90
96
  options
91
97
  end
92
98
 
99
+ def build_output_name(options)
100
+ query = ERB::Util.url_encode(options['query'])
101
+ date = [options['start_date'], options['end_date']].select { |val| val && !val.empty? }.join('_')
102
+ [options['type'], 'tweets', date, query].compact.join('_') + '.' + options['format']
103
+ end
104
+
93
105
  def initialize_logger
94
106
  Twitterscraper.logger.level = ::Logger::DEBUG if options['verbose']
95
107
  end
@@ -22,23 +22,24 @@ module Twitterscraper
22
22
  RELOAD_URL = 'https://twitter.com/i/search/timeline?f=tweets&vertical=' +
23
23
  'default&include_available_features=1&include_entities=1&' +
24
24
  'reset_error_state=false&src=typd&max_position=__POS__&q=__QUERY__&l=__LANG__'
25
- INIT_URL_USER = 'https://twitter.com/{u}'
26
- RELOAD_URL_USER = 'https://twitter.com/i/profiles/show/{u}/timeline/tweets?' +
25
+ INIT_URL_USER = 'https://twitter.com/__USER__'
26
+ RELOAD_URL_USER = 'https://twitter.com/i/profiles/show/__USER__/timeline/tweets?' +
27
27
  'include_available_features=1&include_entities=1&' +
28
- 'max_position={pos}&reset_error_state=false'
29
-
30
- def build_query_url(query, lang, pos, from_user = false)
31
- # if from_user
32
- # if !pos
33
- # INIT_URL_USER.format(u = query)
34
- # else
35
- # RELOAD_URL_USER.format(u = query, pos = pos)
36
- # end
37
- # end
38
- if pos
39
- RELOAD_URL.sub('__QUERY__', query).sub('__LANG__', lang.to_s).sub('__POS__', pos)
28
+ 'max_position=__POS__&reset_error_state=false'
29
+
30
+ def build_query_url(query, lang, type, pos)
31
+ if type.user?
32
+ if pos
33
+ RELOAD_URL_USER.sub('__USER__', query).sub('__POS__', pos.to_s)
34
+ else
35
+ INIT_URL_USER.sub('__USER__', query)
36
+ end
40
37
  else
41
- INIT_URL.sub('__QUERY__', query).sub('__LANG__', lang.to_s)
38
+ if pos
39
+ RELOAD_URL.sub('__QUERY__', query).sub('__LANG__', lang.to_s).sub('__POS__', pos)
40
+ else
41
+ INIT_URL.sub('__QUERY__', query).sub('__LANG__', lang.to_s)
42
+ end
42
43
  end
43
44
  end
44
45
 
@@ -50,7 +51,7 @@ module Twitterscraper
50
51
  end
51
52
  Http.get(url, headers, proxy, timeout)
52
53
  rescue => e
53
- logger.debug "query_single_page: #{e.inspect}"
54
+ logger.debug "get_single_page: #{e.inspect}"
54
55
  if (retries -= 1) > 0
55
56
  logger.info "Retrying... (Attempts left: #{retries - 1})"
56
57
  retry
@@ -68,26 +69,25 @@ module Twitterscraper
68
69
  else
69
70
  json_resp = JSON.parse(text)
70
71
  items_html = json_resp['items_html'] || ''
71
- logger.warn json_resp['message'] if json_resp['message'] # Sorry, you are rate limited.
72
72
  end
73
73
 
74
74
  [items_html, json_resp]
75
75
  end
76
76
 
77
- def query_single_page(query, lang, pos, from_user = false, headers: [], proxies: [])
77
+ def query_single_page(query, lang, type, pos, headers: [], proxies: [])
78
78
  logger.info "Querying #{query}"
79
79
  query = ERB::Util.url_encode(query)
80
80
 
81
- url = build_query_url(query, lang, pos, from_user)
81
+ url = build_query_url(query, lang, type, pos)
82
82
  http_request = lambda do
83
- logger.debug "Scraping tweets from #{url}"
83
+ logger.debug "Scraping tweets from url=#{url}"
84
84
  get_single_page(url, headers, proxies)
85
85
  end
86
86
 
87
87
  if cache_enabled?
88
88
  client = Cache.new
89
89
  if (response = client.read(url))
90
- logger.debug 'Fetching tweets from cache'
90
+ logger.debug "Fetching tweets from cache url=#{url}"
91
91
  else
92
92
  response = http_request.call
93
93
  client.write(url, response) unless stop_requested?
@@ -99,6 +99,12 @@ module Twitterscraper
99
99
 
100
100
  html, json_resp = parse_single_page(response, pos.nil?)
101
101
 
102
+ if json_resp && json_resp['message']
103
+ logger.warn json_resp['message'] # Sorry, you are rate limited.
104
+ @stop_requested = true
105
+ Cache.new.delete(url) if cache_enabled?
106
+ end
107
+
102
108
  tweets = Tweet.from_html(html)
103
109
 
104
110
  if tweets.empty?
@@ -107,8 +113,8 @@ module Twitterscraper
107
113
 
108
114
  if json_resp
109
115
  [tweets, json_resp['min_position']]
110
- elsif from_user
111
- raise NotImplementedError
116
+ elsif type.user?
117
+ [tweets, tweets[-1].tweet_id]
112
118
  else
113
119
  [tweets, "TWEET-#{tweets[-1].tweet_id}-#{tweets[0].tweet_id}"]
114
120
  end
@@ -116,7 +122,7 @@ module Twitterscraper
116
122
 
117
123
  OLDEST_DATE = Date.parse('2006-03-21')
118
124
 
119
- def validate_options!(queries, start_date:, end_date:, lang:, limit:, threads:)
125
+ def validate_options!(queries, type:, start_date:, end_date:, lang:, limit:, threads:)
120
126
  query = queries[0]
121
127
  if query.nil? || query == ''
122
128
  raise Error.new('Please specify a search query.')
@@ -139,19 +145,33 @@ module Twitterscraper
139
145
  raise Error.new(":start_date must be greater than or equal to #{OLDEST_DATE}")
140
146
  end
141
147
  end
142
-
143
- if end_date
144
- today = Date.today
145
- if end_date > Date.today
146
- raise Error.new(":end_date must be less than or equal to today(#{today})")
147
- end
148
- end
149
148
  end
150
149
 
151
- def build_queries(query, start_date, end_date)
150
+ def build_queries(query, start_date, end_date, threads_granularity)
152
151
  if start_date && end_date
153
- date_range = start_date.upto(end_date - 1)
154
- date_range.map { |date| query + " since:#{date} until:#{date + 1}" }
152
+ if threads_granularity == 'auto'
153
+ threads_granularity = start_date.upto(end_date - 1).to_a.size >= 28 ? 'day' : 'hour'
154
+ end
155
+
156
+ if threads_granularity == 'day'
157
+ date_range = start_date.upto(end_date - 1)
158
+ queries = date_range.map { |date| query + " since:#{date} until:#{date + 1}" }
159
+ else
160
+ time = Time.utc(start_date.year, start_date.month, start_date.day, 0, 0, 0)
161
+ end_time = Time.utc(end_date.year, end_date.month, end_date.day, 0, 0, 0)
162
+ queries = []
163
+
164
+ while true
165
+ if time < Time.now.utc
166
+ queries << (query + " since:#{time.strftime('%Y-%m-%d_%H:00:00')}_UTC until:#{(time + 3600).strftime('%Y-%m-%d_%H:00:00')}_UTC")
167
+ end
168
+ time += 3600
169
+ break if time >= end_time
170
+ end
171
+ end
172
+
173
+ queries
174
+
155
175
  elsif start_date
156
176
  [query + " since:#{start_date}"]
157
177
  elsif end_date
@@ -161,12 +181,12 @@ module Twitterscraper
161
181
  end
162
182
  end
163
183
 
164
- def main_loop(query, lang, limit, daily_limit, headers, proxies)
184
+ def main_loop(query, lang, type, limit, daily_limit, headers, proxies)
165
185
  pos = nil
166
186
  daily_tweets = []
167
187
 
168
188
  while true
169
- new_tweets, new_pos = query_single_page(query, lang, pos, headers: headers, proxies: proxies)
189
+ new_tweets, new_pos = query_single_page(query, lang, type, pos, headers: headers, proxies: proxies)
170
190
  unless new_tweets.empty?
171
191
  daily_tweets.concat(new_tweets)
172
192
  daily_tweets.uniq! { |t| t.tweet_id }
@@ -195,12 +215,18 @@ module Twitterscraper
195
215
  @stop_requested
196
216
  end
197
217
 
198
- def query_tweets(query, start_date: nil, end_date: nil, lang: '', limit: 100, daily_limit: nil, order: 'desc', threads: 2)
199
- start_date = Date.parse(start_date) if start_date && start_date.is_a?(String)
200
- end_date = Date.parse(end_date) if end_date && end_date.is_a?(String)
201
- queries = build_queries(query, start_date, end_date)
218
+ def query_tweets(query, type: 'search', start_date: nil, end_date: nil, lang: nil, limit: 100, daily_limit: nil, order: 'desc', threads: 10, threads_granularity: 'auto')
219
+ type = Type.new(type)
220
+ if type.search?
221
+ start_date = Date.parse(start_date) if start_date && start_date.is_a?(String)
222
+ end_date = Date.parse(end_date) if end_date && end_date.is_a?(String)
223
+ elsif type.user?
224
+ start_date = nil
225
+ end_date = nil
226
+ end
227
+
228
+ queries = build_queries(query, start_date, end_date, threads_granularity)
202
229
  if threads > queries.size
203
- logger.warn 'The maximum number of :threads is the number of dates between :start_date and :end_date.'
204
230
  threads = queries.size
205
231
  end
206
232
  if proxy_enabled?
@@ -212,9 +238,9 @@ module Twitterscraper
212
238
  end
213
239
  logger.debug "Cache #{cache_enabled? ? 'enabled' : 'disabled'}"
214
240
 
241
+ validate_options!(queries, type: type, start_date: start_date, end_date: end_date, lang: lang, limit: limit, threads: threads)
215
242
 
216
- validate_options!(queries, start_date: start_date, end_date: end_date, lang: lang, limit: limit, threads: threads)
217
-
243
+ logger.info "The number of queries #{queries.size}"
218
244
  logger.info "The number of threads #{threads}"
219
245
 
220
246
  headers = {'User-Agent': USER_AGENT_LIST.sample, 'X-Requested-With': 'XMLHttpRequest'}
@@ -229,17 +255,27 @@ module Twitterscraper
229
255
  logger.debug "Set 'Thread.abort_on_exception' to true"
230
256
 
231
257
  Parallel.each(queries, in_threads: threads) do |query|
232
- main_loop(query, lang, limit, daily_limit, headers, proxies)
258
+ main_loop(query, lang, type, limit, daily_limit, headers, proxies)
233
259
  raise Parallel::Break if stop_requested?
234
260
  end
235
261
  else
236
262
  queries.each do |query|
237
- main_loop(query, lang, limit, daily_limit, headers, proxies)
263
+ main_loop(query, lang, type, limit, daily_limit, headers, proxies)
238
264
  break if stop_requested?
239
265
  end
240
266
  end
241
267
 
268
+ logger.info "Return #{@all_tweets.size} tweets"
269
+
242
270
  @all_tweets.sort_by { |tweet| (order == 'desc' ? -1 : 1) * tweet.created_at.to_i }
243
271
  end
272
+
273
+ def search(query, start_date: nil, end_date: nil, lang: '', limit: 100, daily_limit: nil, order: 'desc', threads: 10, threads_granularity: 'auto')
274
+ query_tweets(query, type: 'search', start_date: start_date, end_date: end_date, lang: lang, limit: limit, daily_limit: daily_limit, order: order, threads: threads, threads_granularity: threads_granularity)
275
+ end
276
+
277
+ def user_timeline(screen_name, limit: 100, order: 'desc')
278
+ query_tweets(screen_name, type: 'user', start_date: nil, end_date: nil, lang: nil, limit: limit, daily_limit: nil, order: order, threads: 1, threads_granularity: nil)
279
+ end
244
280
  end
245
281
  end
@@ -1,48 +1,30 @@
1
1
  module Twitterscraper
2
- module Template
3
- module_function
2
+ class Template
3
+ def tweets_embedded_html(name, tweets, options)
4
+ path = File.join(File.dirname(__FILE__), 'template/tweets.html.erb')
5
+ template = ERB.new(File.read(path))
4
6
 
5
- def tweets_embedded_html(tweets)
6
- tweets_html = tweets.map { |t| EMBED_TWEET_HTML.sub('__TWEET_URL__', t.tweet_url) }
7
- EMBED_TWEETS_HTML.sub('__TWEETS__', tweets_html.join)
7
+ template.result_with_hash(
8
+ chart_name: name,
9
+ chart_data: chart_data(tweets).to_json,
10
+ first_tweet: tweets.sort_by { |t| t.created_at.to_i }[0],
11
+ last_tweet: tweets.sort_by { |t| t.created_at.to_i }[-1],
12
+ tweets: tweets,
13
+ convert_limit: 30,
14
+ )
8
15
  end
9
16
 
10
- EMBED_TWEET_HTML = <<~'HTML'
11
- <blockquote class="twitter-tweet">
12
- <a href="__TWEET_URL__"></a>
13
- </blockquote>
14
- HTML
17
+ def chart_data(tweets)
18
+ data = tweets.each_with_object(Hash.new(0)) do |tweet, memo|
19
+ t = tweet.created_at
20
+ min = (t.min.to_f / 5).floor * 5
21
+ time = Time.new(t.year, t.month, t.day, t.hour, min, 0, '+00:00')
22
+ memo[time.to_i] += 1
23
+ end
15
24
 
16
- EMBED_TWEETS_HTML = <<~'HTML'
17
- <html>
18
- <head>
19
- <style type=text/css>
20
- .twitter-tweet {
21
- margin: 30px auto 0 auto !important;
22
- }
23
- </style>
24
- <script>
25
- window.twttr = (function(d, s, id) {
26
- var js, fjs = d.getElementsByTagName(s)[0], t = window.twttr || {};
27
- if (d.getElementById(id)) return t;
28
- js = d.createElement(s);
29
- js.id = id;
30
- js.src = "https://platform.twitter.com/widgets.js";
31
- fjs.parentNode.insertBefore(js, fjs);
32
-
33
- t._e = [];
34
- t.ready = function(f) {
35
- t._e.push(f);
36
- };
37
-
38
- return t;
39
- }(document, "script", "twitter-wjs"));
40
- </script>
41
- </head>
42
- <body>
43
- __TWEETS__
44
- </body>
45
- </html>
46
- HTML
25
+ data.sort_by { |k, v| k }.map do |timestamp, count|
26
+ [timestamp * 1000, count]
27
+ end
28
+ end
47
29
  end
48
30
  end
@@ -0,0 +1,98 @@
1
+ <html>
2
+ <head>
3
+ <script src="https://cdnjs.cloudflare.com/ajax/libs/moment.js/2.27.0/moment.min.js" integrity="sha512-rmZcZsyhe0/MAjquhTgiUcb4d9knaFc7b5xAfju483gbEXTkeJRUMIPk6s3ySZMYUHEcjKbjLjyddGWMrNEvZg==" crossorigin="anonymous"></script>
4
+ <script src="https://cdnjs.cloudflare.com/ajax/libs/moment-timezone/0.5.31/moment-timezone-with-data.min.js" integrity="sha512-HZcf3uHWA+Y2P5KNv+F/xa87/flKVP92kUTe/KXjU8URPshczF1Dx+cL5bw0VBGhmqWAK0UbhcqxBbyiNtAnWQ==" crossorigin="anonymous"></script>
5
+ <script src="https://code.highcharts.com/stock/highstock.js"></script>
6
+ <script>
7
+ function updateTweets() {
8
+ window.twttr = (function (d, s, id) {
9
+ var js, fjs = d.getElementsByTagName(s)[0], t = window.twttr || {};
10
+ if (d.getElementById(id)) return t;
11
+ js = d.createElement(s);
12
+ js.id = id;
13
+ js.src = "https://platform.twitter.com/widgets.js";
14
+ fjs.parentNode.insertBefore(js, fjs);
15
+
16
+ t._e = [];
17
+ t.ready = function (f) {
18
+ t._e.push(f);
19
+ };
20
+
21
+ return t;
22
+ }(document, "script", "twitter-wjs"));
23
+ }
24
+
25
+ function drawChart() {
26
+ var data = <%= chart_data %>;
27
+ Highcharts.setOptions({
28
+ time: {
29
+ timezone: moment.tz.guess()
30
+ }
31
+ });
32
+
33
+ Highcharts.stockChart('chart', {
34
+ title: {
35
+ text: '<%= tweets.size %> tweets of <%= chart_name %>'
36
+ },
37
+ subtitle: {
38
+ text: 'since:<%= first_tweet.created_at.localtime.strftime('%Y-%m-%d %H:%M') %> until:<%= last_tweet.created_at.localtime.strftime('%Y-%m-%d %H:%M') %>'
39
+ },
40
+ series: [{
41
+ data: data
42
+ }],
43
+ rangeSelector: {enabled: false},
44
+ scrollbar: {enabled: false},
45
+ navigator: {enabled: false},
46
+ exporting: {enabled: false},
47
+ credits: {enabled: false}
48
+ });
49
+ }
50
+
51
+ document.addEventListener("DOMContentLoaded", function () {
52
+ drawChart();
53
+ updateTweets();
54
+ });
55
+ </script>
56
+
57
+ <style type=text/css>
58
+ .tweets-container {
59
+ max-width: 550px;
60
+ margin: 0 auto 0 auto;
61
+ }
62
+
63
+ .twitter-tweet {
64
+ margin: 15px 0 15px 0 !important;
65
+ }
66
+ </style>
67
+ </head>
68
+ <body>
69
+ <div id="chart" style="width: 100vw; height: 400px;"></div>
70
+
71
+ <div class="tweets-container">
72
+ <% tweets.each.with_index do |tweet, i| %>
73
+ <% tweet_time = tweet.created_at.localtime.strftime('%Y-%m-%d %H:%M') %>
74
+ <% if i < convert_limit %>
75
+ <blockquote class="twitter-tweet">
76
+ <% else %>
77
+ <div class="twitter-tweet" style="border: 1px solid rgb(204, 214, 221);">
78
+ <% end %>
79
+
80
+ <div style="display: grid; grid-template-rows: 24px 24px; grid-template-columns: 48px 1fr;">
81
+ <div style="grid-row: 1/3; grid-column: 1/2;"><img src="<%= tweet.profile_image_url %>" width="48" height="48" loading="lazy"></div>
82
+ <div style="grid-row: 1/2; grid-column: 2/3;"><%= tweet.name %></div>
83
+ <div style="grid-row: 2/3; grid-column: 2/3;"><a href="https://twitter.com/<%= tweet.screen_name %>">@<%= tweet.screen_name %></a></div>
84
+ </div>
85
+
86
+ <div><%= tweet.text %></div>
87
+ <div><a href="<%= tweet.tweet_url %>"><small><%= tweet_time %></small></a></div>
88
+
89
+ <% if i < convert_limit %>
90
+ </blockquote>
91
+ <% else %>
92
+ </div>
93
+ <% end %>
94
+ <% end %>
95
+ </div>
96
+
97
+ </body>
98
+ </html>
@@ -6,6 +6,7 @@ module Twitterscraper
6
6
  :screen_name,
7
7
  :name,
8
8
  :user_id,
9
+ :profile_image_url,
9
10
  :tweet_id,
10
11
  :text,
11
12
  :links,
@@ -51,6 +52,11 @@ module Twitterscraper
51
52
  end
52
53
  end
53
54
 
55
+ # .js-stream-item
56
+ # .js-stream-tweet{data: {screen-name:, tweet-id:}}
57
+ # .stream-item-header
58
+ # .js-tweet-text-container
59
+ # .stream-item-footer
54
60
  def from_html(text)
55
61
  html = Nokogiri::HTML(text)
56
62
  from_tweets_html(html.xpath("//li[@class[contains(., 'js-stream-item')]]/div[@class[contains(., 'js-stream-tweet')]]"))
@@ -72,6 +78,8 @@ module Twitterscraper
72
78
  end
73
79
 
74
80
  inner_html = Nokogiri::HTML(html.inner_html)
81
+
82
+ profile_image_url = inner_html.xpath("//img[@class[contains(., 'js-action-profile-avatar')]]").first.attr('src').gsub(/_bigger/, '')
75
83
  text = inner_html.xpath("//div[@class[contains(., 'js-tweet-text-container')]]/p[@class[contains(., 'js-tweet-text')]]").first.text
76
84
  links = inner_html.xpath("//a[@class[contains(., 'twitter-timeline-link')]]").map { |elem| elem.attr('data-expanded-url') }.select { |link| link && !link.include?('pic.twitter') }
77
85
  image_urls = inner_html.xpath("//div[@class[contains(., 'AdaptiveMedia-photoContainer')]]").map { |elem| elem.attr('data-image-url') }
@@ -99,6 +107,7 @@ module Twitterscraper
99
107
  screen_name: screen_name,
100
108
  name: html.attr('data-name'),
101
109
  user_id: html.attr('data-user-id').to_i,
110
+ profile_image_url: profile_image_url,
102
111
  tweet_id: tweet_id,
103
112
  text: text,
104
113
  links: links,
@@ -0,0 +1,15 @@
1
+ module Twitterscraper
2
+ class Type
3
+ def initialize(value)
4
+ @value = value
5
+ end
6
+
7
+ def search?
8
+ @value == 'search'
9
+ end
10
+
11
+ def user?
12
+ @value == 'user'
13
+ end
14
+ end
15
+ end
@@ -1,3 +1,3 @@
1
1
  module Twitterscraper
2
- VERSION = '0.14.0'
2
+ VERSION = '0.17.0'
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: twitterscraper-ruby
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.14.0
4
+ version: 0.17.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - ts-3156
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-07-16 00:00:00.000000000 Z
11
+ date: 2020-07-18 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
@@ -72,7 +72,9 @@ files:
72
72
  - lib/twitterscraper/proxy.rb
73
73
  - lib/twitterscraper/query.rb
74
74
  - lib/twitterscraper/template.rb
75
+ - lib/twitterscraper/template/tweets.html.erb
75
76
  - lib/twitterscraper/tweet.rb
77
+ - lib/twitterscraper/type.rb
76
78
  - lib/version.rb
77
79
  - twitterscraper-ruby.gemspec
78
80
  homepage: https://github.com/ts-3156/twitterscraper-ruby