twitterscraper-ruby 0.5.0 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: e6701ff59f3eb13db9e3b2d024ea983264b528194d64f4f03a95f3576338ed77
4
- data.tar.gz: 74106816dd406ef1b355b4d4fc94b1baf4509465f6d5bf1ea8f7c654e518eec0
3
+ metadata.gz: c2429cf6172b5f19caede64ac35f5c796a7c8a67e76fff8dd2f08677fb15406b
4
+ data.tar.gz: 0f32ca6b559a18c4e3aac3205f6503149e372d4d7d1976b1e83db26036d9ff17
5
5
  SHA512:
6
- metadata.gz: 9710fb74c90dcbc17a22dd613cfe4dce75106951f1e55cd9cfa94a825ecf0b6773a2851ff1cca842f83b5207d3744ac63bcce29031061b0a0ac84cc12d62b8a3
7
- data.tar.gz: f0e7cd90ecb773a1837be9245b83f51d60f25d48cab716e3584a8d3e1b6f0fe4951eadaf034850599205209d2bf8d7cd4ebfda97f9f862a528c393a2f81887a7
6
+ metadata.gz: a36ce6c91a363b64b36deeb3abbaaaebb725f3449f280b70be92532497a94dc5915ba449926acfacfc0d852d52471d258d41140a8891e64b6040bf262d0c347f
7
+ data.tar.gz: a737c7db151190a1493b1a2a92bea304cfcf7512b2ee03fc13c6f25794f5dc727fe548e52cb39eccc2a63261fee0d58fc005920a0e7cd7650d20600e184d79cb
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- twitterscraper-ruby (0.5.0)
4
+ twitterscraper-ruby (0.10.0)
5
5
  nokogiri
6
6
  parallel
7
7
 
data/README.md CHANGED
@@ -1,46 +1,162 @@
1
1
  # twitterscraper-ruby
2
2
 
3
- Welcome to your new gem! In this directory, you'll find the files you need to be able to package up your Ruby library into a gem. Put your Ruby code in the file `lib/twitterscraper/ruby`. To experiment with that code, run `bin/console` for an interactive prompt.
3
+ [![Gem Version](https://badge.fury.io/rb/twitterscraper-ruby.svg)](https://badge.fury.io/rb/twitterscraper-ruby)
4
4
 
5
- TODO: Delete this and the text above, and describe your gem
5
+ A gem to scrape https://twitter.com/search. This gem is inspired by [taspinar/twitterscraper](https://github.com/taspinar/twitterscraper).
6
6
 
7
- ## Installation
8
7
 
9
- Add this line to your application's Gemfile:
8
+ ## Twitter Search API vs. twitterscraper-ruby
10
9
 
11
- ```ruby
12
- gem 'twitterscraper-ruby'
13
- ```
10
+ ### Twitter Search API
14
11
 
15
- And then execute:
12
+ - The number of tweets: 180 - 450 requests/15 minutes (18,000 - 45,000 tweets/15 minutes)
13
+ - The time window: the past 7 days
16
14
 
17
- $ bundle install
15
+ ### twitterscraper-ruby
18
16
 
19
- Or install it yourself as:
17
+ - The number of tweets: Unlimited
18
+ - The time window: from 2006-3-21 to today
20
19
 
21
- $ gem install twitterscraper-ruby
20
+
21
+ ## Installation
22
+
23
+ First install the library:
24
+
25
+ ```shell script
26
+ $ gem install twitterscraper-ruby
27
+ ````
28
+
22
29
 
23
30
  ## Usage
24
31
 
32
+ Command-line interface:
33
+
34
+ ```shell script
35
+ $ twitterscraper --query KEYWORD --start_date 2020-06-01 --end_date 2020-06-30 --lang ja \
36
+ --limit 100 --threads 10 --proxy --output output.json
37
+ ```
38
+
39
+ From Within Ruby:
40
+
25
41
  ```ruby
26
42
  require 'twitterscraper'
43
+
44
+ options = {
45
+ start_date: '2020-06-01',
46
+ end_date: '2020-06-30',
47
+ lang: 'ja',
48
+ limit: 100,
49
+ threads: 10,
50
+ proxy: true
51
+ }
52
+
53
+ client = Twitterscraper::Client.new
54
+ tweets = client.query_tweets(KEYWORD, options)
55
+
56
+ tweets.each do |tweet|
57
+ puts tweet.tweet_id
58
+ puts tweet.text
59
+ puts tweet.tweet_url
60
+ puts tweet.created_at
61
+
62
+ hash = tweet.attrs
63
+ puts hash.keys
64
+ end
27
65
  ```
28
66
 
29
- ## Development
30
67
 
31
- After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
68
+ ## Attributes
69
+
70
+ ### Tweet
71
+
72
+ - screen_name
73
+ - name
74
+ - user_id
75
+ - tweet_id
76
+ - text
77
+ - links
78
+ - hashtags
79
+ - image_urls
80
+ - video_url
81
+ - has_media
82
+ - likes
83
+ - retweets
84
+ - replies
85
+ - is_replied
86
+ - is_reply_to
87
+ - parent_tweet_id
88
+ - reply_to_users
89
+ - tweet_url
90
+ - created_at
91
+
92
+
93
+ ## Search operators
94
+
95
+ | Operator | Finds Tweets... |
96
+ | ------------- | ------------- |
97
+ | watching now | containing both "watching" and "now". This is the default operator. |
98
+ | "happy hour" | containing the exact phrase "happy hour". |
99
+ | love OR hate | containing either "love" or "hate" (or both). |
100
+ | beer -root | containing "beer" but not "root". |
101
+ | #haiku | containing the hashtag "haiku". |
102
+ | from:interior | sent from Twitter account "interior". |
103
+ | to:NASA | a Tweet authored in reply to Twitter account "NASA". |
104
+ | @NASA | mentioning Twitter account "NASA". |
105
+ | puppy filter:media | containing "puppy" and an image or video. |
106
+ | puppy -filter:retweets | containing "puppy", filtering out retweets |
107
+ | superhero since:2015-12-21 | containing "superhero" and sent since date "2015-12-21" (year-month-day). |
108
+ | puppy until:2015-12-21 | containing "puppy" and sent before the date "2015-12-21". |
109
+
110
+ Search operators documentation is in [Standard search operators](https://developer.twitter.com/en/docs/tweets/rules-and-filtering/overview/standard-operators).
111
+
112
+
113
+ ## Examples
114
+
115
+ ```shell script
116
+ $ twitterscraper --query twitter --limit 1000
117
+ $ cat tweets.json | jq . | less
118
+ ```
119
+
120
+ ```json
121
+ [
122
+ {
123
+ "screen_name": "@screenname",
124
+ "name": "name",
125
+ "user_id": 1194529546483000000,
126
+ "tweet_id": 1282659891992000000,
127
+ "tweet_url": "https://twitter.com/screenname/status/1282659891992000000",
128
+ "created_at": "2020-07-13 12:00:00 +0000",
129
+ "text": "Thanks Twitter!"
130
+ }
131
+ ]
132
+ ```
133
+
134
+ ## CLI Options
135
+
136
+ | Option | Description | Default |
137
+ | ------------- | ------------- | ------------- |
138
+ | `-h`, `--help` | This option displays a summary of twitterscraper. | |
139
+ | `--query` | Specify a keyword used during the search. | |
140
+ | `--start_date` | Set the date from which twitterscraper-ruby should start scraping for your query. | |
141
+ | `--end_date` | Set the enddate which twitterscraper-ruby should use to stop scraping for your query. | |
142
+ | `--lang` | Retrieve tweets written in a specific language. | |
143
+ | `--limit` | Stop scraping when *at least* the number of tweets indicated with --limit is scraped. | 100 |
144
+ | `--threads` | Set the number of threads twitterscraper-ruby should initiate while scraping for your query. | 2 |
145
+ | `--proxy` | Scrape https://twitter.com/search via proxies. | false |
146
+ | `--format` | The format of the output. | json |
147
+ | `--output` | The name of the output file. | tweets.json |
32
148
 
33
- To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
34
149
 
35
150
  ## Contributing
36
151
 
37
- Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/twitterscraper-ruby. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/[USERNAME]/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
152
+ Bug reports and pull requests are welcome on GitHub at https://github.com/ts-3156/twitterscraper-ruby. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/ts-3156/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
38
153
 
39
154
 
40
155
  ## License
41
156
 
42
157
  The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
43
158
 
159
+
44
160
  ## Code of Conduct
45
161
 
46
- Everyone interacting in the Twitterscraper::Ruby project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
162
+ Everyone interacting in the twitterscraper-ruby project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/ts-3156/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
@@ -5,14 +5,14 @@ require 'twitterscraper/lang'
5
5
  require 'twitterscraper/query'
6
6
  require 'twitterscraper/client'
7
7
  require 'twitterscraper/tweet'
8
+ require 'twitterscraper/template'
8
9
  require 'version'
9
10
 
10
11
  module Twitterscraper
11
12
  class Error < StandardError; end
12
- # Your code goes here...
13
13
 
14
14
  def self.logger
15
- @logger ||= ::Logger.new(STDOUT)
15
+ @logger ||= ::Logger.new(STDOUT, level: ::Logger::INFO)
16
16
  end
17
17
 
18
18
  def self.logger=(logger)
@@ -8,18 +8,36 @@ module Twitterscraper
8
8
  class Cli
9
9
  def parse
10
10
  @options = parse_options(ARGV)
11
+ initialize_logger
11
12
  end
12
13
 
13
14
  def run
15
+ print_help || return if print_help?
16
+ print_version || return if print_version?
17
+
18
+ query_options = {
19
+ start_date: options['start_date'],
20
+ end_date: options['end_date'],
21
+ lang: options['lang'],
22
+ limit: options['limit'],
23
+ threads: options['threads'],
24
+ proxy: options['proxy']
25
+ }
14
26
  client = Twitterscraper::Client.new
15
- limit = options['limit'] ? options['limit'].to_i : 100
16
- threads = options['threads'] ? options['threads'].to_i : 2
17
- tweets = client.query_tweets(options['query'], limit: limit, threads: threads, start_date: options['start_date'], end_date: options['end_date'])
18
- File.write('tweets.json', generate_json(tweets))
27
+ tweets = client.query_tweets(options['query'], query_options)
28
+ export(tweets) unless tweets.empty?
19
29
  end
20
30
 
21
- def options
22
- @options
31
+ def export(tweets)
32
+ write_json = lambda { File.write(options['output'], generate_json(tweets)) }
33
+
34
+ if options['format'] == 'json'
35
+ write_json.call
36
+ elsif options['format'] == 'html'
37
+ File.write('tweets.html', Template.tweets_embedded_html(tweets))
38
+ else
39
+ write_json.call
40
+ end
23
41
  end
24
42
 
25
43
  def generate_json(tweets)
@@ -30,16 +48,59 @@ module Twitterscraper
30
48
  end
31
49
  end
32
50
 
51
+ def options
52
+ @options
53
+ end
54
+
33
55
  def parse_options(argv)
34
- argv.getopts(
56
+ options = argv.getopts(
35
57
  'h',
58
+ 'help',
59
+ 'v',
60
+ 'version',
36
61
  'query:',
37
- 'limit:',
38
62
  'start_date:',
39
63
  'end_date:',
64
+ 'lang:',
65
+ 'limit:',
40
66
  'threads:',
67
+ 'output:',
68
+ 'format:',
69
+ 'proxy',
41
70
  'pretty',
71
+ 'verbose',
42
72
  )
73
+
74
+ options['lang'] ||= ''
75
+ options['limit'] = (options['limit'] || 100).to_i
76
+ options['threads'] = (options['threads'] || 2).to_i
77
+ options['format'] ||= 'json'
78
+ options['output'] ||= "tweets.#{options['format']}"
79
+
80
+ options
81
+ end
82
+
83
+ def initialize_logger
84
+ Twitterscraper.logger.level = ::Logger::DEBUG if options['verbose']
85
+ end
86
+
87
+ def print_help?
88
+ options['h'] || options['help']
89
+ end
90
+
91
+ def print_help
92
+ puts <<~'SHELL'
93
+ Usage:
94
+ twitterscraper --query KEYWORD --limit 100 --threads 10 --start_date 2020-07-01 --end_date 2020-07-10 --lang ja --proxy --output output.json
95
+ SHELL
96
+ end
97
+
98
+ def print_version?
99
+ options['v'] || options['version']
100
+ end
101
+
102
+ def print_version
103
+ puts "twitterscraper-#{Twitterscraper::VERSION}"
43
104
  end
44
105
  end
45
106
  end
@@ -6,9 +6,9 @@ module Twitterscraper
6
6
  class RetryExhausted < StandardError
7
7
  end
8
8
 
9
- class Result
10
- def initialize(items)
11
- @items = items
9
+ class Pool
10
+ def initialize
11
+ @items = Proxy.get_proxies
12
12
  @cur_index = 0
13
13
  end
14
14
 
@@ -31,7 +31,6 @@ module Twitterscraper
31
31
  def reload
32
32
  @items = Proxy.get_proxies
33
33
  @cur_index = 0
34
- Twitterscraper.logger.debug "Reload #{proxies.size} proxies"
35
34
  end
36
35
  end
37
36
 
@@ -46,13 +45,14 @@ module Twitterscraper
46
45
 
47
46
  table.xpath('tbody/tr').each do |tr|
48
47
  cells = tr.xpath('td')
49
- ip, port, https = [0, 1, 6].map { |i| cells[i].text.strip }
48
+ ip, port, anonymity, https = [0, 1, 4, 6].map { |i| cells[i].text.strip }
49
+ next unless ['elite proxy', 'anonymous'].include?(anonymity)
50
50
  next if https == 'no'
51
51
  proxies << ip + ':' + port
52
52
  end
53
53
 
54
54
  Twitterscraper.logger.debug "Fetch #{proxies.size} proxies"
55
- Result.new(proxies.shuffle)
55
+ proxies.shuffle
56
56
  rescue => e
57
57
  if (retries -= 1) > 0
58
58
  retry
@@ -3,6 +3,7 @@ require 'net/http'
3
3
  require 'nokogiri'
4
4
  require 'date'
5
5
  require 'json'
6
+ require 'erb'
6
7
  require 'parallel'
7
8
 
8
9
  module Twitterscraper
@@ -41,7 +42,8 @@ module Twitterscraper
41
42
  end
42
43
  end
43
44
 
44
- def get_single_page(url, headers, proxies, timeout = 10, retries = 30)
45
+ def get_single_page(url, headers, proxies, timeout = 6, retries = 30)
46
+ return nil if stop_requested?
45
47
  Twitterscraper::Http.get(url, headers, proxies.sample, timeout)
46
48
  rescue => e
47
49
  logger.debug "query_single_page: #{e.inspect}"
@@ -54,6 +56,8 @@ module Twitterscraper
54
56
  end
55
57
 
56
58
  def parse_single_page(text, html = true)
59
+ return [nil, nil] if text.nil? || text == ''
60
+
57
61
  if html
58
62
  json_resp = nil
59
63
  items_html = text
@@ -68,12 +72,14 @@ module Twitterscraper
68
72
 
69
73
  def query_single_page(query, lang, pos, from_user = false, headers: [], proxies: [])
70
74
  logger.info("Querying #{query}")
71
- query = query.gsub(' ', '%20').gsub('#', '%23').gsub(':', '%3A').gsub('&', '%26')
75
+ query = ERB::Util.url_encode(query)
72
76
 
73
77
  url = build_query_url(query, lang, pos, from_user)
74
78
  logger.debug("Scraping tweets from #{url}")
75
79
 
76
80
  response = get_single_page(url, headers, proxies)
81
+ return [], nil if response.nil?
82
+
77
83
  html, json_resp = parse_single_page(response, pos.nil?)
78
84
 
79
85
  tweets = Tweet.from_html(html)
@@ -91,54 +97,112 @@ module Twitterscraper
91
97
  end
92
98
  end
93
99
 
94
- def query_tweets(query, start_date: nil, end_date: nil, limit: 100, threads: 2, lang: '')
95
- start_date = start_date ? Date.parse(start_date) : Date.parse('2006-3-21')
96
- end_date = end_date ? Date.parse(end_date) : Date.today
97
- if start_date == end_date
98
- raise 'Please specify different values for :start_date and :end_date.'
99
- elsif start_date > end_date
100
- raise ':start_date must occur before :end_date.'
101
- end
100
+ OLDEST_DATE = Date.parse('2006-03-21')
102
101
 
103
- proxies = Twitterscraper::Proxy.get_proxies
102
+ def validate_options!(query, start_date:, end_date:, lang:, limit:, threads:, proxy:)
103
+ if query.nil? || query == ''
104
+ raise 'Please specify a search query.'
105
+ end
104
106
 
105
- date_range = start_date.upto(end_date - 1)
106
- queries = date_range.map { |date| query + " since:#{date} until:#{date + 1}" }
107
- threads = queries.size if threads > queries.size
108
- logger.info("Threads #{threads}")
107
+ if ERB::Util.url_encode(query).length >= 500
108
+ raise ':query must be a UTF-8, URL-encoded search query of 500 characters maximum, including operators.'
109
+ end
109
110
 
110
- all_tweets = []
111
- mutex = Mutex.new
111
+ if start_date && end_date
112
+ if start_date == end_date
113
+ raise 'Please specify different values for :start_date and :end_date.'
114
+ elsif start_date > end_date
115
+ raise ':start_date must occur before :end_date.'
116
+ end
117
+ end
112
118
 
113
- Parallel.each(queries, in_threads: threads) do |query|
114
- headers = {'User-Agent': USER_AGENT_LIST.sample, 'X-Requested-With': 'XMLHttpRequest'}
115
- logger.info("Headers #{headers}")
119
+ if start_date
120
+ if start_date < OLDEST_DATE
121
+ raise ":start_date must be greater than or equal to #{OLDEST_DATE}"
122
+ end
123
+ end
116
124
 
117
- pos = nil
125
+ if end_date
126
+ today = Date.today
127
+ if end_date > Date.today
128
+ raise ":end_date must be less than or equal to today(#{today})"
129
+ end
130
+ end
131
+ end
118
132
 
119
- while true
120
- new_tweets, new_pos = query_single_page(query, lang, pos, headers: headers, proxies: proxies)
121
- unless new_tweets.empty?
122
- mutex.synchronize {
123
- all_tweets.concat(new_tweets)
124
- all_tweets.uniq! { |t| t.tweet_id }
125
- }
126
- end
127
- logger.info("Got #{new_tweets.size} tweets (total #{all_tweets.size}) worker=#{Parallel.worker_number}")
133
+ def build_queries(query, start_date, end_date)
134
+ if start_date && end_date
135
+ date_range = start_date.upto(end_date - 1)
136
+ date_range.map { |date| query + " since:#{date} until:#{date + 1}" }
137
+ elsif start_date
138
+ [query + " since:#{start_date}"]
139
+ elsif end_date
140
+ [query + " until:#{end_date}"]
141
+ else
142
+ [query]
143
+ end
144
+ end
128
145
 
129
- break unless new_pos
130
- break if all_tweets.size >= limit
146
+ def main_loop(query, lang, limit, headers, proxies)
147
+ pos = nil
131
148
 
132
- pos = new_pos
149
+ while true
150
+ new_tweets, new_pos = query_single_page(query, lang, pos, headers: headers, proxies: proxies)
151
+ unless new_tweets.empty?
152
+ @mutex.synchronize {
153
+ @all_tweets.concat(new_tweets)
154
+ @all_tweets.uniq! { |t| t.tweet_id }
155
+ }
133
156
  end
157
+ logger.info("Got #{new_tweets.size} tweets (total #{@all_tweets.size})")
134
158
 
135
- if all_tweets.size >= limit
136
- logger.info("Reached limit #{all_tweets.size}")
137
- raise Parallel::Break
159
+ break unless new_pos
160
+ break if @all_tweets.size >= limit
161
+
162
+ pos = new_pos
163
+ end
164
+
165
+ if @all_tweets.size >= limit
166
+ logger.info("Limit reached #{@all_tweets.size}")
167
+ @stop_requested = true
168
+ end
169
+ end
170
+
171
+ def stop_requested?
172
+ @stop_requested
173
+ end
174
+
175
+ def query_tweets(query, start_date: nil, end_date: nil, lang: '', limit: 100, threads: 2, proxy: false)
176
+ start_date = Date.parse(start_date) if start_date && start_date.is_a?(String)
177
+ end_date = Date.parse(end_date) if end_date && end_date.is_a?(String)
178
+ queries = build_queries(query, start_date, end_date)
179
+ threads = queries.size if threads > queries.size
180
+ proxies = proxy ? Twitterscraper::Proxy::Pool.new : []
181
+
182
+ validate_options!(queries[0], start_date: start_date, end_date: end_date, lang: lang, limit: limit, threads: threads, proxy: proxy)
183
+
184
+ logger.info("The number of threads #{threads}")
185
+
186
+ headers = {'User-Agent': USER_AGENT_LIST.sample, 'X-Requested-With': 'XMLHttpRequest'}
187
+ logger.info("Headers #{headers}")
188
+
189
+ @all_tweets = []
190
+ @mutex = Mutex.new
191
+ @stop_requested = false
192
+
193
+ if threads > 1
194
+ Parallel.each(queries, in_threads: threads) do |query|
195
+ main_loop(query, lang, limit, headers, proxies)
196
+ raise Parallel::Break if stop_requested?
197
+ end
198
+ else
199
+ queries.each do |query|
200
+ main_loop(query, lang, limit, headers, proxies)
201
+ break if stop_requested?
138
202
  end
139
203
  end
140
204
 
141
- all_tweets.sort_by { |tweet| -tweet.created_at.to_i }
205
+ @all_tweets.sort_by { |tweet| -tweet.created_at.to_i }
142
206
  end
143
207
  end
144
208
  end
@@ -0,0 +1,48 @@
1
+ module Twitterscraper
2
+ module Template
3
+ module_function
4
+
5
+ def tweets_embedded_html(tweets)
6
+ tweets_html = tweets.map { |t| EMBED_TWEET_HTML.sub('__TWEET_URL__', t.tweet_url) }
7
+ EMBED_TWEETS_HTML.sub('__TWEETS__', tweets_html.join)
8
+ end
9
+
10
+ EMBED_TWEET_HTML = <<~'HTML'
11
+ <blockquote class="twitter-tweet">
12
+ <a href="__TWEET_URL__"></a>
13
+ </blockquote>
14
+ HTML
15
+
16
+ EMBED_TWEETS_HTML = <<~'HTML'
17
+ <html>
18
+ <head>
19
+ <style type=text/css>
20
+ .twitter-tweet {
21
+ margin: 30px auto 0 auto !important;
22
+ }
23
+ </style>
24
+ <script>
25
+ window.twttr = (function(d, s, id) {
26
+ var js, fjs = d.getElementsByTagName(s)[0], t = window.twttr || {};
27
+ if (d.getElementById(id)) return t;
28
+ js = d.createElement(s);
29
+ js.id = id;
30
+ js.src = "https://platform.twitter.com/widgets.js";
31
+ fjs.parentNode.insertBefore(js, fjs);
32
+
33
+ t._e = [];
34
+ t.ready = function(f) {
35
+ t._e.push(f);
36
+ };
37
+
38
+ return t;
39
+ }(document, "script", "twitter-wjs"));
40
+ </script>
41
+ </head>
42
+ <body>
43
+ __TWEETS__
44
+ </body>
45
+ </html>
46
+ HTML
47
+ end
48
+ end
@@ -2,7 +2,28 @@ require 'time'
2
2
 
3
3
  module Twitterscraper
4
4
  class Tweet
5
- KEYS = [:screen_name, :name, :user_id, :tweet_id, :tweet_url, :created_at, :text]
5
+ KEYS = [
6
+ :screen_name,
7
+ :name,
8
+ :user_id,
9
+ :tweet_id,
10
+ :text,
11
+ :links,
12
+ :hashtags,
13
+ :image_urls,
14
+ :video_url,
15
+ :has_media,
16
+ :likes,
17
+ :retweets,
18
+ :replies,
19
+ :is_replied,
20
+ :is_reply_to,
21
+ :parent_tweet_id,
22
+ :reply_to_users,
23
+ :tweet_url,
24
+ :timestamp,
25
+ :created_at,
26
+ ]
6
27
  attr_reader *KEYS
7
28
 
8
29
  def initialize(attrs)
@@ -11,10 +32,14 @@ module Twitterscraper
11
32
  end
12
33
  end
13
34
 
14
- def to_json(options = {})
35
+ def attrs
15
36
  KEYS.map do |key|
16
37
  [key, send(key)]
17
- end.to_h.to_json
38
+ end.to_h
39
+ end
40
+
41
+ def to_json(options = {})
42
+ attrs.to_json
18
43
  end
19
44
 
20
45
  class << self
@@ -31,15 +56,51 @@ module Twitterscraper
31
56
 
32
57
  def from_tweet_html(html)
33
58
  inner_html = Nokogiri::HTML(html.inner_html)
59
+ tweet_id = html.attr('data-tweet-id').to_i
60
+ text = inner_html.xpath("//div[@class[contains(., 'js-tweet-text-container')]]/p[@class[contains(., 'js-tweet-text')]]").first.text
61
+ links = inner_html.xpath("//a[@class[contains(., 'twitter-timeline-link')]]").map { |elem| elem.attr('data-expanded-url') }.select { |link| link && !link.include?('pic.twitter') }
62
+ image_urls = inner_html.xpath("//div[@class[contains(., 'AdaptiveMedia-photoContainer')]]").map { |elem| elem.attr('data-image-url') }
63
+ video_url = inner_html.xpath("//div[@class[contains(., 'PlayableMedia-container')]]/a").map { |elem| elem.attr('href') }[0]
64
+ has_media = !image_urls.empty? || (video_url && !video_url.empty?)
65
+
66
+ actions = inner_html.xpath("//div[@class[contains(., 'ProfileTweet-actionCountList')]]")
67
+ likes = actions.xpath("//span[@class[contains(., 'ProfileTweet-action--favorite')]]/span[@class[contains(., 'ProfileTweet-actionCount')]]").first.attr('data-tweet-stat-count').to_i || 0
68
+ retweets = actions.xpath("//span[@class[contains(., 'ProfileTweet-action--retweet')]]/span[@class[contains(., 'ProfileTweet-actionCount')]]").first.attr('data-tweet-stat-count').to_i || 0
69
+ replies = actions.xpath("//span[@class[contains(., 'ProfileTweet-action--reply u-hiddenVisually')]]/span[@class[contains(., 'ProfileTweet-actionCount')]]").first.attr('data-tweet-stat-count').to_i || 0
70
+ is_replied = replies != 0
71
+
72
+ parent_tweet_id = inner_html.xpath('//*[@data-conversation-id]').first.attr('data-conversation-id').to_i
73
+ if tweet_id == parent_tweet_id
74
+ is_reply_to = false
75
+ parent_tweet_id = nil
76
+ reply_to_users = []
77
+ else
78
+ is_reply_to = true
79
+ reply_to_users = inner_html.xpath("//div[@class[contains(., 'ReplyingToContextBelowAuthor')]]/a").map { |user| {screen_name: user.text.delete_prefix('@'), user_id: user.attr('data-user-id')} }
80
+ end
81
+
34
82
  timestamp = inner_html.xpath("//span[@class[contains(., 'js-short-timestamp')]]").first.attr('data-time').to_i
35
83
  new(
36
84
  screen_name: html.attr('data-screen-name'),
37
85
  name: html.attr('data-name'),
38
86
  user_id: html.attr('data-user-id').to_i,
39
- tweet_id: html.attr('data-tweet-id').to_i,
87
+ tweet_id: tweet_id,
88
+ text: text,
89
+ links: links,
90
+ hashtags: text.scan(/#\w+/).map { |tag| tag.delete_prefix('#') },
91
+ image_urls: image_urls,
92
+ video_url: video_url,
93
+ has_media: has_media,
94
+ likes: likes,
95
+ retweets: retweets,
96
+ replies: replies,
97
+ is_replied: is_replied,
98
+ is_reply_to: is_reply_to,
99
+ parent_tweet_id: parent_tweet_id,
100
+ reply_to_users: reply_to_users,
40
101
  tweet_url: 'https://twitter.com' + html.attr('data-permalink-path'),
102
+ timestamp: timestamp,
41
103
  created_at: Time.at(timestamp, in: '+00:00'),
42
- text: inner_html.xpath("//div[@class[contains(., 'js-tweet-text-container')]]/p[@class[contains(., 'js-tweet-text')]]").first.text,
43
104
  )
44
105
  end
45
106
  end
@@ -1,3 +1,3 @@
1
1
  module Twitterscraper
2
- VERSION = '0.5.0'
2
+ VERSION = '0.10.0'
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: twitterscraper-ruby
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.5.0
4
+ version: 0.10.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - ts-3156
@@ -68,6 +68,7 @@ files:
68
68
  - lib/twitterscraper/logger.rb
69
69
  - lib/twitterscraper/proxy.rb
70
70
  - lib/twitterscraper/query.rb
71
+ - lib/twitterscraper/template.rb
71
72
  - lib/twitterscraper/tweet.rb
72
73
  - lib/version.rb
73
74
  - twitterscraper-ruby.gemspec