twitterscraper-ruby 0.5.0 → 0.10.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: e6701ff59f3eb13db9e3b2d024ea983264b528194d64f4f03a95f3576338ed77
4
- data.tar.gz: 74106816dd406ef1b355b4d4fc94b1baf4509465f6d5bf1ea8f7c654e518eec0
3
+ metadata.gz: c2429cf6172b5f19caede64ac35f5c796a7c8a67e76fff8dd2f08677fb15406b
4
+ data.tar.gz: 0f32ca6b559a18c4e3aac3205f6503149e372d4d7d1976b1e83db26036d9ff17
5
5
  SHA512:
6
- metadata.gz: 9710fb74c90dcbc17a22dd613cfe4dce75106951f1e55cd9cfa94a825ecf0b6773a2851ff1cca842f83b5207d3744ac63bcce29031061b0a0ac84cc12d62b8a3
7
- data.tar.gz: f0e7cd90ecb773a1837be9245b83f51d60f25d48cab716e3584a8d3e1b6f0fe4951eadaf034850599205209d2bf8d7cd4ebfda97f9f862a528c393a2f81887a7
6
+ metadata.gz: a36ce6c91a363b64b36deeb3abbaaaebb725f3449f280b70be92532497a94dc5915ba449926acfacfc0d852d52471d258d41140a8891e64b6040bf262d0c347f
7
+ data.tar.gz: a737c7db151190a1493b1a2a92bea304cfcf7512b2ee03fc13c6f25794f5dc727fe548e52cb39eccc2a63261fee0d58fc005920a0e7cd7650d20600e184d79cb
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- twitterscraper-ruby (0.5.0)
4
+ twitterscraper-ruby (0.10.0)
5
5
  nokogiri
6
6
  parallel
7
7
 
data/README.md CHANGED
@@ -1,46 +1,162 @@
1
1
  # twitterscraper-ruby
2
2
 
3
- Welcome to your new gem! In this directory, you'll find the files you need to be able to package up your Ruby library into a gem. Put your Ruby code in the file `lib/twitterscraper/ruby`. To experiment with that code, run `bin/console` for an interactive prompt.
3
+ [![Gem Version](https://badge.fury.io/rb/twitterscraper-ruby.svg)](https://badge.fury.io/rb/twitterscraper-ruby)
4
4
 
5
- TODO: Delete this and the text above, and describe your gem
5
+ A gem to scrape https://twitter.com/search. This gem is inspired by [taspinar/twitterscraper](https://github.com/taspinar/twitterscraper).
6
6
 
7
- ## Installation
8
7
 
9
- Add this line to your application's Gemfile:
8
+ ## Twitter Search API vs. twitterscraper-ruby
10
9
 
11
- ```ruby
12
- gem 'twitterscraper-ruby'
13
- ```
10
+ ### Twitter Search API
14
11
 
15
- And then execute:
12
+ - The number of tweets: 180 - 450 requests/15 minutes (18,000 - 45,000 tweets/15 minutes)
13
+ - The time window: the past 7 days
16
14
 
17
- $ bundle install
15
+ ### twitterscraper-ruby
18
16
 
19
- Or install it yourself as:
17
+ - The number of tweets: Unlimited
18
+ - The time window: from 2006-3-21 to today
20
19
 
21
- $ gem install twitterscraper-ruby
20
+
21
+ ## Installation
22
+
23
+ First install the library:
24
+
25
+ ```shell script
26
+ $ gem install twitterscraper-ruby
27
+ ````
28
+
22
29
 
23
30
  ## Usage
24
31
 
32
+ Command-line interface:
33
+
34
+ ```shell script
35
+ $ twitterscraper --query KEYWORD --start_date 2020-06-01 --end_date 2020-06-30 --lang ja \
36
+ --limit 100 --threads 10 --proxy --output output.json
37
+ ```
38
+
39
+ From Within Ruby:
40
+
25
41
  ```ruby
26
42
  require 'twitterscraper'
43
+
44
+ options = {
45
+ start_date: '2020-06-01',
46
+ end_date: '2020-06-30',
47
+ lang: 'ja',
48
+ limit: 100,
49
+ threads: 10,
50
+ proxy: true
51
+ }
52
+
53
+ client = Twitterscraper::Client.new
54
+ tweets = client.query_tweets(KEYWORD, options)
55
+
56
+ tweets.each do |tweet|
57
+ puts tweet.tweet_id
58
+ puts tweet.text
59
+ puts tweet.tweet_url
60
+ puts tweet.created_at
61
+
62
+ hash = tweet.attrs
63
+ puts hash.keys
64
+ end
27
65
  ```
28
66
 
29
- ## Development
30
67
 
31
- After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
68
+ ## Attributes
69
+
70
+ ### Tweet
71
+
72
+ - screen_name
73
+ - name
74
+ - user_id
75
+ - tweet_id
76
+ - text
77
+ - links
78
+ - hashtags
79
+ - image_urls
80
+ - video_url
81
+ - has_media
82
+ - likes
83
+ - retweets
84
+ - replies
85
+ - is_replied
86
+ - is_reply_to
87
+ - parent_tweet_id
88
+ - reply_to_users
89
+ - tweet_url
90
+ - created_at
91
+
92
+
93
+ ## Search operators
94
+
95
+ | Operator | Finds Tweets... |
96
+ | ------------- | ------------- |
97
+ | watching now | containing both "watching" and "now". This is the default operator. |
98
+ | "happy hour" | containing the exact phrase "happy hour". |
99
+ | love OR hate | containing either "love" or "hate" (or both). |
100
+ | beer -root | containing "beer" but not "root". |
101
+ | #haiku | containing the hashtag "haiku". |
102
+ | from:interior | sent from Twitter account "interior". |
103
+ | to:NASA | a Tweet authored in reply to Twitter account "NASA". |
104
+ | @NASA | mentioning Twitter account "NASA". |
105
+ | puppy filter:media | containing "puppy" and an image or video. |
106
+ | puppy -filter:retweets | containing "puppy", filtering out retweets |
107
+ | superhero since:2015-12-21 | containing "superhero" and sent since date "2015-12-21" (year-month-day). |
108
+ | puppy until:2015-12-21 | containing "puppy" and sent before the date "2015-12-21". |
109
+
110
+ Search operators documentation is in [Standard search operators](https://developer.twitter.com/en/docs/tweets/rules-and-filtering/overview/standard-operators).
111
+
112
+
113
+ ## Examples
114
+
115
+ ```shell script
116
+ $ twitterscraper --query twitter --limit 1000
117
+ $ cat tweets.json | jq . | less
118
+ ```
119
+
120
+ ```json
121
+ [
122
+ {
123
+ "screen_name": "@screenname",
124
+ "name": "name",
125
+ "user_id": 1194529546483000000,
126
+ "tweet_id": 1282659891992000000,
127
+ "tweet_url": "https://twitter.com/screenname/status/1282659891992000000",
128
+ "created_at": "2020-07-13 12:00:00 +0000",
129
+ "text": "Thanks Twitter!"
130
+ }
131
+ ]
132
+ ```
133
+
134
+ ## CLI Options
135
+
136
+ | Option | Description | Default |
137
+ | ------------- | ------------- | ------------- |
138
+ | `-h`, `--help` | This option displays a summary of twitterscraper. | |
139
+ | `--query` | Specify a keyword used during the search. | |
140
+ | `--start_date` | Set the date from which twitterscraper-ruby should start scraping for your query. | |
141
+ | `--end_date` | Set the enddate which twitterscraper-ruby should use to stop scraping for your query. | |
142
+ | `--lang` | Retrieve tweets written in a specific language. | |
143
+ | `--limit` | Stop scraping when *at least* the number of tweets indicated with --limit is scraped. | 100 |
144
+ | `--threads` | Set the number of threads twitterscraper-ruby should initiate while scraping for your query. | 2 |
145
+ | `--proxy` | Scrape https://twitter.com/search via proxies. | false |
146
+ | `--format` | The format of the output. | json |
147
+ | `--output` | The name of the output file. | tweets.json |
32
148
 
33
- To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
34
149
 
35
150
  ## Contributing
36
151
 
37
- Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/twitterscraper-ruby. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/[USERNAME]/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
152
+ Bug reports and pull requests are welcome on GitHub at https://github.com/ts-3156/twitterscraper-ruby. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/ts-3156/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
38
153
 
39
154
 
40
155
  ## License
41
156
 
42
157
  The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
43
158
 
159
+
44
160
  ## Code of Conduct
45
161
 
46
- Everyone interacting in the Twitterscraper::Ruby project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
162
+ Everyone interacting in the twitterscraper-ruby project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/ts-3156/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
@@ -5,14 +5,14 @@ require 'twitterscraper/lang'
5
5
  require 'twitterscraper/query'
6
6
  require 'twitterscraper/client'
7
7
  require 'twitterscraper/tweet'
8
+ require 'twitterscraper/template'
8
9
  require 'version'
9
10
 
10
11
  module Twitterscraper
11
12
  class Error < StandardError; end
12
- # Your code goes here...
13
13
 
14
14
  def self.logger
15
- @logger ||= ::Logger.new(STDOUT)
15
+ @logger ||= ::Logger.new(STDOUT, level: ::Logger::INFO)
16
16
  end
17
17
 
18
18
  def self.logger=(logger)
@@ -8,18 +8,36 @@ module Twitterscraper
8
8
  class Cli
9
9
  def parse
10
10
  @options = parse_options(ARGV)
11
+ initialize_logger
11
12
  end
12
13
 
13
14
  def run
15
+ print_help || return if print_help?
16
+ print_version || return if print_version?
17
+
18
+ query_options = {
19
+ start_date: options['start_date'],
20
+ end_date: options['end_date'],
21
+ lang: options['lang'],
22
+ limit: options['limit'],
23
+ threads: options['threads'],
24
+ proxy: options['proxy']
25
+ }
14
26
  client = Twitterscraper::Client.new
15
- limit = options['limit'] ? options['limit'].to_i : 100
16
- threads = options['threads'] ? options['threads'].to_i : 2
17
- tweets = client.query_tweets(options['query'], limit: limit, threads: threads, start_date: options['start_date'], end_date: options['end_date'])
18
- File.write('tweets.json', generate_json(tweets))
27
+ tweets = client.query_tweets(options['query'], query_options)
28
+ export(tweets) unless tweets.empty?
19
29
  end
20
30
 
21
- def options
22
- @options
31
+ def export(tweets)
32
+ write_json = lambda { File.write(options['output'], generate_json(tweets)) }
33
+
34
+ if options['format'] == 'json'
35
+ write_json.call
36
+ elsif options['format'] == 'html'
37
+ File.write('tweets.html', Template.tweets_embedded_html(tweets))
38
+ else
39
+ write_json.call
40
+ end
23
41
  end
24
42
 
25
43
  def generate_json(tweets)
@@ -30,16 +48,59 @@ module Twitterscraper
30
48
  end
31
49
  end
32
50
 
51
+ def options
52
+ @options
53
+ end
54
+
33
55
  def parse_options(argv)
34
- argv.getopts(
56
+ options = argv.getopts(
35
57
  'h',
58
+ 'help',
59
+ 'v',
60
+ 'version',
36
61
  'query:',
37
- 'limit:',
38
62
  'start_date:',
39
63
  'end_date:',
64
+ 'lang:',
65
+ 'limit:',
40
66
  'threads:',
67
+ 'output:',
68
+ 'format:',
69
+ 'proxy',
41
70
  'pretty',
71
+ 'verbose',
42
72
  )
73
+
74
+ options['lang'] ||= ''
75
+ options['limit'] = (options['limit'] || 100).to_i
76
+ options['threads'] = (options['threads'] || 2).to_i
77
+ options['format'] ||= 'json'
78
+ options['output'] ||= "tweets.#{options['format']}"
79
+
80
+ options
81
+ end
82
+
83
+ def initialize_logger
84
+ Twitterscraper.logger.level = ::Logger::DEBUG if options['verbose']
85
+ end
86
+
87
+ def print_help?
88
+ options['h'] || options['help']
89
+ end
90
+
91
+ def print_help
92
+ puts <<~'SHELL'
93
+ Usage:
94
+ twitterscraper --query KEYWORD --limit 100 --threads 10 --start_date 2020-07-01 --end_date 2020-07-10 --lang ja --proxy --output output.json
95
+ SHELL
96
+ end
97
+
98
+ def print_version?
99
+ options['v'] || options['version']
100
+ end
101
+
102
+ def print_version
103
+ puts "twitterscraper-#{Twitterscraper::VERSION}"
43
104
  end
44
105
  end
45
106
  end
@@ -6,9 +6,9 @@ module Twitterscraper
6
6
  class RetryExhausted < StandardError
7
7
  end
8
8
 
9
- class Result
10
- def initialize(items)
11
- @items = items
9
+ class Pool
10
+ def initialize
11
+ @items = Proxy.get_proxies
12
12
  @cur_index = 0
13
13
  end
14
14
 
@@ -31,7 +31,6 @@ module Twitterscraper
31
31
  def reload
32
32
  @items = Proxy.get_proxies
33
33
  @cur_index = 0
34
- Twitterscraper.logger.debug "Reload #{proxies.size} proxies"
35
34
  end
36
35
  end
37
36
 
@@ -46,13 +45,14 @@ module Twitterscraper
46
45
 
47
46
  table.xpath('tbody/tr').each do |tr|
48
47
  cells = tr.xpath('td')
49
- ip, port, https = [0, 1, 6].map { |i| cells[i].text.strip }
48
+ ip, port, anonymity, https = [0, 1, 4, 6].map { |i| cells[i].text.strip }
49
+ next unless ['elite proxy', 'anonymous'].include?(anonymity)
50
50
  next if https == 'no'
51
51
  proxies << ip + ':' + port
52
52
  end
53
53
 
54
54
  Twitterscraper.logger.debug "Fetch #{proxies.size} proxies"
55
- Result.new(proxies.shuffle)
55
+ proxies.shuffle
56
56
  rescue => e
57
57
  if (retries -= 1) > 0
58
58
  retry
@@ -3,6 +3,7 @@ require 'net/http'
3
3
  require 'nokogiri'
4
4
  require 'date'
5
5
  require 'json'
6
+ require 'erb'
6
7
  require 'parallel'
7
8
 
8
9
  module Twitterscraper
@@ -41,7 +42,8 @@ module Twitterscraper
41
42
  end
42
43
  end
43
44
 
44
- def get_single_page(url, headers, proxies, timeout = 10, retries = 30)
45
+ def get_single_page(url, headers, proxies, timeout = 6, retries = 30)
46
+ return nil if stop_requested?
45
47
  Twitterscraper::Http.get(url, headers, proxies.sample, timeout)
46
48
  rescue => e
47
49
  logger.debug "query_single_page: #{e.inspect}"
@@ -54,6 +56,8 @@ module Twitterscraper
54
56
  end
55
57
 
56
58
  def parse_single_page(text, html = true)
59
+ return [nil, nil] if text.nil? || text == ''
60
+
57
61
  if html
58
62
  json_resp = nil
59
63
  items_html = text
@@ -68,12 +72,14 @@ module Twitterscraper
68
72
 
69
73
  def query_single_page(query, lang, pos, from_user = false, headers: [], proxies: [])
70
74
  logger.info("Querying #{query}")
71
- query = query.gsub(' ', '%20').gsub('#', '%23').gsub(':', '%3A').gsub('&', '%26')
75
+ query = ERB::Util.url_encode(query)
72
76
 
73
77
  url = build_query_url(query, lang, pos, from_user)
74
78
  logger.debug("Scraping tweets from #{url}")
75
79
 
76
80
  response = get_single_page(url, headers, proxies)
81
+ return [], nil if response.nil?
82
+
77
83
  html, json_resp = parse_single_page(response, pos.nil?)
78
84
 
79
85
  tweets = Tweet.from_html(html)
@@ -91,54 +97,112 @@ module Twitterscraper
91
97
  end
92
98
  end
93
99
 
94
- def query_tweets(query, start_date: nil, end_date: nil, limit: 100, threads: 2, lang: '')
95
- start_date = start_date ? Date.parse(start_date) : Date.parse('2006-3-21')
96
- end_date = end_date ? Date.parse(end_date) : Date.today
97
- if start_date == end_date
98
- raise 'Please specify different values for :start_date and :end_date.'
99
- elsif start_date > end_date
100
- raise ':start_date must occur before :end_date.'
101
- end
100
+ OLDEST_DATE = Date.parse('2006-03-21')
102
101
 
103
- proxies = Twitterscraper::Proxy.get_proxies
102
+ def validate_options!(query, start_date:, end_date:, lang:, limit:, threads:, proxy:)
103
+ if query.nil? || query == ''
104
+ raise 'Please specify a search query.'
105
+ end
104
106
 
105
- date_range = start_date.upto(end_date - 1)
106
- queries = date_range.map { |date| query + " since:#{date} until:#{date + 1}" }
107
- threads = queries.size if threads > queries.size
108
- logger.info("Threads #{threads}")
107
+ if ERB::Util.url_encode(query).length >= 500
108
+ raise ':query must be a UTF-8, URL-encoded search query of 500 characters maximum, including operators.'
109
+ end
109
110
 
110
- all_tweets = []
111
- mutex = Mutex.new
111
+ if start_date && end_date
112
+ if start_date == end_date
113
+ raise 'Please specify different values for :start_date and :end_date.'
114
+ elsif start_date > end_date
115
+ raise ':start_date must occur before :end_date.'
116
+ end
117
+ end
112
118
 
113
- Parallel.each(queries, in_threads: threads) do |query|
114
- headers = {'User-Agent': USER_AGENT_LIST.sample, 'X-Requested-With': 'XMLHttpRequest'}
115
- logger.info("Headers #{headers}")
119
+ if start_date
120
+ if start_date < OLDEST_DATE
121
+ raise ":start_date must be greater than or equal to #{OLDEST_DATE}"
122
+ end
123
+ end
116
124
 
117
- pos = nil
125
+ if end_date
126
+ today = Date.today
127
+ if end_date > Date.today
128
+ raise ":end_date must be less than or equal to today(#{today})"
129
+ end
130
+ end
131
+ end
118
132
 
119
- while true
120
- new_tweets, new_pos = query_single_page(query, lang, pos, headers: headers, proxies: proxies)
121
- unless new_tweets.empty?
122
- mutex.synchronize {
123
- all_tweets.concat(new_tweets)
124
- all_tweets.uniq! { |t| t.tweet_id }
125
- }
126
- end
127
- logger.info("Got #{new_tweets.size} tweets (total #{all_tweets.size}) worker=#{Parallel.worker_number}")
133
+ def build_queries(query, start_date, end_date)
134
+ if start_date && end_date
135
+ date_range = start_date.upto(end_date - 1)
136
+ date_range.map { |date| query + " since:#{date} until:#{date + 1}" }
137
+ elsif start_date
138
+ [query + " since:#{start_date}"]
139
+ elsif end_date
140
+ [query + " until:#{end_date}"]
141
+ else
142
+ [query]
143
+ end
144
+ end
128
145
 
129
- break unless new_pos
130
- break if all_tweets.size >= limit
146
+ def main_loop(query, lang, limit, headers, proxies)
147
+ pos = nil
131
148
 
132
- pos = new_pos
149
+ while true
150
+ new_tweets, new_pos = query_single_page(query, lang, pos, headers: headers, proxies: proxies)
151
+ unless new_tweets.empty?
152
+ @mutex.synchronize {
153
+ @all_tweets.concat(new_tweets)
154
+ @all_tweets.uniq! { |t| t.tweet_id }
155
+ }
133
156
  end
157
+ logger.info("Got #{new_tweets.size} tweets (total #{@all_tweets.size})")
134
158
 
135
- if all_tweets.size >= limit
136
- logger.info("Reached limit #{all_tweets.size}")
137
- raise Parallel::Break
159
+ break unless new_pos
160
+ break if @all_tweets.size >= limit
161
+
162
+ pos = new_pos
163
+ end
164
+
165
+ if @all_tweets.size >= limit
166
+ logger.info("Limit reached #{@all_tweets.size}")
167
+ @stop_requested = true
168
+ end
169
+ end
170
+
171
+ def stop_requested?
172
+ @stop_requested
173
+ end
174
+
175
+ def query_tweets(query, start_date: nil, end_date: nil, lang: '', limit: 100, threads: 2, proxy: false)
176
+ start_date = Date.parse(start_date) if start_date && start_date.is_a?(String)
177
+ end_date = Date.parse(end_date) if end_date && end_date.is_a?(String)
178
+ queries = build_queries(query, start_date, end_date)
179
+ threads = queries.size if threads > queries.size
180
+ proxies = proxy ? Twitterscraper::Proxy::Pool.new : []
181
+
182
+ validate_options!(queries[0], start_date: start_date, end_date: end_date, lang: lang, limit: limit, threads: threads, proxy: proxy)
183
+
184
+ logger.info("The number of threads #{threads}")
185
+
186
+ headers = {'User-Agent': USER_AGENT_LIST.sample, 'X-Requested-With': 'XMLHttpRequest'}
187
+ logger.info("Headers #{headers}")
188
+
189
+ @all_tweets = []
190
+ @mutex = Mutex.new
191
+ @stop_requested = false
192
+
193
+ if threads > 1
194
+ Parallel.each(queries, in_threads: threads) do |query|
195
+ main_loop(query, lang, limit, headers, proxies)
196
+ raise Parallel::Break if stop_requested?
197
+ end
198
+ else
199
+ queries.each do |query|
200
+ main_loop(query, lang, limit, headers, proxies)
201
+ break if stop_requested?
138
202
  end
139
203
  end
140
204
 
141
- all_tweets.sort_by { |tweet| -tweet.created_at.to_i }
205
+ @all_tweets.sort_by { |tweet| -tweet.created_at.to_i }
142
206
  end
143
207
  end
144
208
  end
@@ -0,0 +1,48 @@
1
+ module Twitterscraper
2
+ module Template
3
+ module_function
4
+
5
+ def tweets_embedded_html(tweets)
6
+ tweets_html = tweets.map { |t| EMBED_TWEET_HTML.sub('__TWEET_URL__', t.tweet_url) }
7
+ EMBED_TWEETS_HTML.sub('__TWEETS__', tweets_html.join)
8
+ end
9
+
10
+ EMBED_TWEET_HTML = <<~'HTML'
11
+ <blockquote class="twitter-tweet">
12
+ <a href="__TWEET_URL__"></a>
13
+ </blockquote>
14
+ HTML
15
+
16
+ EMBED_TWEETS_HTML = <<~'HTML'
17
+ <html>
18
+ <head>
19
+ <style type=text/css>
20
+ .twitter-tweet {
21
+ margin: 30px auto 0 auto !important;
22
+ }
23
+ </style>
24
+ <script>
25
+ window.twttr = (function(d, s, id) {
26
+ var js, fjs = d.getElementsByTagName(s)[0], t = window.twttr || {};
27
+ if (d.getElementById(id)) return t;
28
+ js = d.createElement(s);
29
+ js.id = id;
30
+ js.src = "https://platform.twitter.com/widgets.js";
31
+ fjs.parentNode.insertBefore(js, fjs);
32
+
33
+ t._e = [];
34
+ t.ready = function(f) {
35
+ t._e.push(f);
36
+ };
37
+
38
+ return t;
39
+ }(document, "script", "twitter-wjs"));
40
+ </script>
41
+ </head>
42
+ <body>
43
+ __TWEETS__
44
+ </body>
45
+ </html>
46
+ HTML
47
+ end
48
+ end
@@ -2,7 +2,28 @@ require 'time'
2
2
 
3
3
  module Twitterscraper
4
4
  class Tweet
5
- KEYS = [:screen_name, :name, :user_id, :tweet_id, :tweet_url, :created_at, :text]
5
+ KEYS = [
6
+ :screen_name,
7
+ :name,
8
+ :user_id,
9
+ :tweet_id,
10
+ :text,
11
+ :links,
12
+ :hashtags,
13
+ :image_urls,
14
+ :video_url,
15
+ :has_media,
16
+ :likes,
17
+ :retweets,
18
+ :replies,
19
+ :is_replied,
20
+ :is_reply_to,
21
+ :parent_tweet_id,
22
+ :reply_to_users,
23
+ :tweet_url,
24
+ :timestamp,
25
+ :created_at,
26
+ ]
6
27
  attr_reader *KEYS
7
28
 
8
29
  def initialize(attrs)
@@ -11,10 +32,14 @@ module Twitterscraper
11
32
  end
12
33
  end
13
34
 
14
- def to_json(options = {})
35
+ def attrs
15
36
  KEYS.map do |key|
16
37
  [key, send(key)]
17
- end.to_h.to_json
38
+ end.to_h
39
+ end
40
+
41
+ def to_json(options = {})
42
+ attrs.to_json
18
43
  end
19
44
 
20
45
  class << self
@@ -31,15 +56,51 @@ module Twitterscraper
31
56
 
32
57
  def from_tweet_html(html)
33
58
  inner_html = Nokogiri::HTML(html.inner_html)
59
+ tweet_id = html.attr('data-tweet-id').to_i
60
+ text = inner_html.xpath("//div[@class[contains(., 'js-tweet-text-container')]]/p[@class[contains(., 'js-tweet-text')]]").first.text
61
+ links = inner_html.xpath("//a[@class[contains(., 'twitter-timeline-link')]]").map { |elem| elem.attr('data-expanded-url') }.select { |link| link && !link.include?('pic.twitter') }
62
+ image_urls = inner_html.xpath("//div[@class[contains(., 'AdaptiveMedia-photoContainer')]]").map { |elem| elem.attr('data-image-url') }
63
+ video_url = inner_html.xpath("//div[@class[contains(., 'PlayableMedia-container')]]/a").map { |elem| elem.attr('href') }[0]
64
+ has_media = !image_urls.empty? || (video_url && !video_url.empty?)
65
+
66
+ actions = inner_html.xpath("//div[@class[contains(., 'ProfileTweet-actionCountList')]]")
67
+ likes = actions.xpath("//span[@class[contains(., 'ProfileTweet-action--favorite')]]/span[@class[contains(., 'ProfileTweet-actionCount')]]").first.attr('data-tweet-stat-count').to_i || 0
68
+ retweets = actions.xpath("//span[@class[contains(., 'ProfileTweet-action--retweet')]]/span[@class[contains(., 'ProfileTweet-actionCount')]]").first.attr('data-tweet-stat-count').to_i || 0
69
+ replies = actions.xpath("//span[@class[contains(., 'ProfileTweet-action--reply u-hiddenVisually')]]/span[@class[contains(., 'ProfileTweet-actionCount')]]").first.attr('data-tweet-stat-count').to_i || 0
70
+ is_replied = replies != 0
71
+
72
+ parent_tweet_id = inner_html.xpath('//*[@data-conversation-id]').first.attr('data-conversation-id').to_i
73
+ if tweet_id == parent_tweet_id
74
+ is_reply_to = false
75
+ parent_tweet_id = nil
76
+ reply_to_users = []
77
+ else
78
+ is_reply_to = true
79
+ reply_to_users = inner_html.xpath("//div[@class[contains(., 'ReplyingToContextBelowAuthor')]]/a").map { |user| {screen_name: user.text.delete_prefix('@'), user_id: user.attr('data-user-id')} }
80
+ end
81
+
34
82
  timestamp = inner_html.xpath("//span[@class[contains(., 'js-short-timestamp')]]").first.attr('data-time').to_i
35
83
  new(
36
84
  screen_name: html.attr('data-screen-name'),
37
85
  name: html.attr('data-name'),
38
86
  user_id: html.attr('data-user-id').to_i,
39
- tweet_id: html.attr('data-tweet-id').to_i,
87
+ tweet_id: tweet_id,
88
+ text: text,
89
+ links: links,
90
+ hashtags: text.scan(/#\w+/).map { |tag| tag.delete_prefix('#') },
91
+ image_urls: image_urls,
92
+ video_url: video_url,
93
+ has_media: has_media,
94
+ likes: likes,
95
+ retweets: retweets,
96
+ replies: replies,
97
+ is_replied: is_replied,
98
+ is_reply_to: is_reply_to,
99
+ parent_tweet_id: parent_tweet_id,
100
+ reply_to_users: reply_to_users,
40
101
  tweet_url: 'https://twitter.com' + html.attr('data-permalink-path'),
102
+ timestamp: timestamp,
41
103
  created_at: Time.at(timestamp, in: '+00:00'),
42
- text: inner_html.xpath("//div[@class[contains(., 'js-tweet-text-container')]]/p[@class[contains(., 'js-tweet-text')]]").first.text,
43
104
  )
44
105
  end
45
106
  end
@@ -1,3 +1,3 @@
1
1
  module Twitterscraper
2
- VERSION = '0.5.0'
2
+ VERSION = '0.10.0'
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: twitterscraper-ruby
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.5.0
4
+ version: 0.10.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - ts-3156
@@ -68,6 +68,7 @@ files:
68
68
  - lib/twitterscraper/logger.rb
69
69
  - lib/twitterscraper/proxy.rb
70
70
  - lib/twitterscraper/query.rb
71
+ - lib/twitterscraper/template.rb
71
72
  - lib/twitterscraper/tweet.rb
72
73
  - lib/version.rb
73
74
  - twitterscraper-ruby.gemspec