twitterscraper-ruby 0.6.0 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: fe6db831d59218f3e701e0211487d79ef2354524610f8338a1a17e4cc426e437
4
- data.tar.gz: 34cd890b8d2837bcacb3ab7b03fb43845294eecb9d7b0c4891c048eacedbe233
3
+ metadata.gz: 9cfd03782734642da8ac29839788f142399d2a3f4ec601e8b6f47ae1ca38c17f
4
+ data.tar.gz: 07a398e51fd2fbdc735ae27008d9a23e97dc390632179738045db4c81bd4fcad
5
5
  SHA512:
6
- metadata.gz: 6c16c89ca290cc3c9ed5fd245c5aa26e5386c95011cfa14277e774e860359495cafec1624fba0af55de98ebbb34abb599e75e210bdbb18b3b11e49bc1527b643
7
- data.tar.gz: d54e25e0294eddf8226c0e27a1d46c6128e9066c9d04edc429b382498c0de1af7ccf2e5c1333ad2031210bbefd7f9b7edc73b9a0e79ab2bd3673674b2e648f3c
6
+ metadata.gz: 6f417fe3379a3d9d134c308a9ea9d4e01b458018c9c5a3f8508a85e7f5890d01991838cfcabe87b8246f69edf4458c66d17924359798017907862071353f643d
7
+ data.tar.gz: 758bcb55ded936c3696f99647f64bc9921386b3cb0c783c218510c0e36991ae6b95a9d08fa071e02072c8b727bbadb6674ceeb19a74e356a842d62c1ec4c038f
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- twitterscraper-ruby (0.6.0)
4
+ twitterscraper-ruby (0.7.0)
5
5
  nokogiri
6
6
  parallel
7
7
 
data/README.md CHANGED
@@ -1,46 +1,127 @@
1
1
  # twitterscraper-ruby
2
2
 
3
- Welcome to your new gem! In this directory, you'll find the files you need to be able to package up your Ruby library into a gem. Put your Ruby code in the file `lib/twitterscraper/ruby`. To experiment with that code, run `bin/console` for an interactive prompt.
3
+ [![Gem Version](https://badge.fury.io/rb/twitterscraper-ruby.svg)](https://badge.fury.io/rb/twitterscraper-ruby)
4
4
 
5
- TODO: Delete this and the text above, and describe your gem
5
+ A gem to scrape https://twitter.com/search. This gem is inspired by [taspinar/twitterscraper](https://github.com/taspinar/twitterscraper).
6
6
 
7
- ## Installation
8
7
 
9
- Add this line to your application's Gemfile:
8
+ ## Twitter Search API vs. twitterscraper-ruby
10
9
 
11
- ```ruby
12
- gem 'twitterscraper-ruby'
13
- ```
10
+ ### Twitter Search API
11
+
12
+ - The number of tweets: 180 - 450 requests/15 minutes (18,000 - 45,000 tweets/15 minutes)
13
+ - The time window: the past 7 days
14
+
15
+ ### twitterscraper-ruby
16
+
17
+ - The number of tweets: Unlimited
18
+ - The time window: from 2006-3-21 to today
14
19
 
15
- And then execute:
16
20
 
17
- $ bundle install
21
+ ## Installation
18
22
 
19
- Or install it yourself as:
23
+ First install the library:
20
24
 
21
- $ gem install twitterscraper-ruby
25
+ ```shell script
26
+ $ gem install twitterscraper-ruby
27
+ ````
28
+
22
29
 
23
30
  ## Usage
24
31
 
32
+ Command-line interface:
33
+
34
+ ```shell script
35
+ $ twitterscraper --query KEYWORD --start_date 2020-06-01 --end_date 2020-06-30 --lang ja \
36
+ --limit 100 --threads 10 --proxy --output output.json
37
+ ```
38
+
39
+ From Within Ruby:
40
+
25
41
  ```ruby
26
42
  require 'twitterscraper'
43
+
44
+ options = {
45
+ start_date: '2020-06-01',
46
+ end_date: '2020-06-30',
47
+ lang: 'ja',
48
+ limit: 100,
49
+ threads: 10,
50
+ proxy: true
51
+ }
52
+
53
+ client = Twitterscraper::Client.new
54
+ tweets = client.query_tweets(KEYWORD, options)
55
+
56
+ tweets.each do |tweet|
57
+ puts tweet.tweet_id
58
+ puts tweet.text
59
+ puts tweet.created_at
60
+ puts tweet.tweet_url
61
+ end
27
62
  ```
28
63
 
29
- ## Development
30
64
 
31
- After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
65
+ ## Examples
66
+
67
+ ```shell script
68
+ $ twitterscraper --query twitter --limit 1000
69
+ $ cat tweets.json | jq . | less
70
+ ```
71
+
72
+ ```json
73
+ [
74
+ {
75
+ "screen_name": "@screenname",
76
+ "name": "name",
77
+ "user_id": 1194529546483000000,
78
+ "tweet_id": 1282659891992000000,
79
+ "tweet_url": "https://twitter.com/screenname/status/1282659891992000000",
80
+ "created_at": "2020-07-13 12:00:00 +0000",
81
+ "text": "Thanks Twitter!"
82
+ },
83
+ ...
84
+ ]
85
+ ```
86
+
87
+ ## Attributes
88
+
89
+ ### Tweet
90
+
91
+ - tweet_id
92
+ - text
93
+ - user_id
94
+ - screen_name
95
+ - name
96
+ - tweet_url
97
+ - created_at
98
+
99
+
100
+ ## CLI Options
101
+
102
+ | Option | Description | Default |
103
+ | ------------- | ------------- | ------------- |
104
+ | `-h`, `--help` | This option displays a summary of twitterscraper. | |
105
+ | `--query` | Specify a keyword used during the search. | |
106
+ | `--start_date` | Set the date from which twitterscraper-ruby should start scraping for your query. | |
107
+ | `--end_date` | Set the enddate which twitterscraper-ruby should use to stop scraping for your query. | |
108
+ | `--lang` | Retrieve tweets written in a specific language. | |
109
+ | `--limit` | Stop scraping when *at least* the number of tweets indicated with --limit is scraped. | 100 |
110
+ | `--threads` | Set the number of threads twitterscraper-ruby should initiate while scraping for your query. | 2 |
111
+ | `--proxy` | Scrape https://twitter.com/search via proxies. | false |
112
+ | `--output` | The name of the output file. | tweets.json |
32
113
 
33
- To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
34
114
 
35
115
  ## Contributing
36
116
 
37
- Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/twitterscraper-ruby. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/[USERNAME]/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
117
+ Bug reports and pull requests are welcome on GitHub at https://github.com/ts-3156/twitterscraper-ruby. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/ts-3156/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
38
118
 
39
119
 
40
120
  ## License
41
121
 
42
122
  The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
43
123
 
124
+
44
125
  ## Code of Conduct
45
126
 
46
- Everyone interacting in the Twitterscraper::Ruby project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
127
+ Everyone interacting in the twitterscraper-ruby project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/ts-3156/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
@@ -15,7 +15,6 @@ module Twitterscraper
15
15
  print_help || return if print_help?
16
16
  print_version || return if print_version?
17
17
 
18
- client = Twitterscraper::Client.new
19
18
  query_options = {
20
19
  start_date: options['start_date'],
21
20
  end_date: options['end_date'],
@@ -24,6 +23,7 @@ module Twitterscraper
24
23
  threads: options['threads'],
25
24
  proxy: options['proxy']
26
25
  }
26
+ client = Twitterscraper::Client.new
27
27
  tweets = client.query_tweets(options['query'], query_options)
28
28
  File.write(options['output'], generate_json(tweets))
29
29
  end
@@ -3,6 +3,7 @@ require 'net/http'
3
3
  require 'nokogiri'
4
4
  require 'date'
5
5
  require 'json'
6
+ require 'erb'
6
7
  require 'parallel'
7
8
 
8
9
  module Twitterscraper
@@ -41,7 +42,8 @@ module Twitterscraper
41
42
  end
42
43
  end
43
44
 
44
- def get_single_page(url, headers, proxies, timeout = 10, retries = 30)
45
+ def get_single_page(url, headers, proxies, timeout = 6, retries = 30)
46
+ return nil if stop_requested?
45
47
  Twitterscraper::Http.get(url, headers, proxies.sample, timeout)
46
48
  rescue => e
47
49
  logger.debug "query_single_page: #{e.inspect}"
@@ -54,6 +56,8 @@ module Twitterscraper
54
56
  end
55
57
 
56
58
  def parse_single_page(text, html = true)
59
+ return [nil, nil] if text.nil? || text == ''
60
+
57
61
  if html
58
62
  json_resp = nil
59
63
  items_html = text
@@ -68,12 +72,14 @@ module Twitterscraper
68
72
 
69
73
  def query_single_page(query, lang, pos, from_user = false, headers: [], proxies: [])
70
74
  logger.info("Querying #{query}")
71
- query = query.gsub(' ', '%20').gsub('#', '%23').gsub(':', '%3A').gsub('&', '%26')
75
+ query = ERB::Util.url_encode(query)
72
76
 
73
77
  url = build_query_url(query, lang, pos, from_user)
74
78
  logger.debug("Scraping tweets from #{url}")
75
79
 
76
80
  response = get_single_page(url, headers, proxies)
81
+ return [], nil if response.nil?
82
+
77
83
  html, json_resp = parse_single_page(response, pos.nil?)
78
84
 
79
85
  tweets = Tweet.from_html(html)
@@ -91,55 +97,112 @@ module Twitterscraper
91
97
  end
92
98
  end
93
99
 
94
- def query_tweets(query, start_date: nil, end_date: nil, lang: '', limit: 100, threads: 2, proxy: false)
95
- start_date = start_date ? Date.parse(start_date) : Date.parse('2006-3-21')
96
- end_date = end_date ? Date.parse(end_date) : Date.today
97
- if start_date == end_date
98
- raise 'Please specify different values for :start_date and :end_date.'
99
- elsif start_date > end_date
100
- raise ':start_date must occur before :end_date.'
100
+ OLDEST_DATE = Date.parse('2006-3-21')
101
+
102
+ def validate_options!(query, start_date:, end_date:, lang:, limit:, threads:, proxy:)
103
+ if query.nil? || query == ''
104
+ raise 'Please specify a search query.'
101
105
  end
102
106
 
103
- proxies = proxy ? Twitterscraper::Proxy::Pool.new : []
107
+ if ERB::Util.url_encode(query).length >= 500
108
+ raise ':query must be a UTF-8, URL-encoded search query of 500 characters maximum, including operators.'
109
+ end
104
110
 
105
- date_range = start_date.upto(end_date - 1)
106
- queries = date_range.map { |date| query + " since:#{date} until:#{date + 1}" }
107
- threads = queries.size if threads > queries.size
108
- logger.info("Threads #{threads}")
111
+ if start_date && end_date
112
+ if start_date == end_date
113
+ raise 'Please specify different values for :start_date and :end_date.'
114
+ elsif start_date > end_date
115
+ raise ':start_date must occur before :end_date.'
116
+ end
117
+ end
109
118
 
110
- headers = {'User-Agent': USER_AGENT_LIST.sample, 'X-Requested-With': 'XMLHttpRequest'}
111
- logger.info("Headers #{headers}")
119
+ if start_date
120
+ if start_date < OLDEST_DATE
121
+ raise ":start_date must be greater than or equal to #{OLDEST_DATE}"
122
+ end
123
+ end
112
124
 
113
- all_tweets = []
114
- mutex = Mutex.new
125
+ if end_date
126
+ today = Date.today
127
+ if end_date > Date.today
128
+ raise ":end_date must be less than or equal to today(#{today})"
129
+ end
130
+ end
131
+ end
115
132
 
116
- Parallel.each(queries, in_threads: threads) do |query|
133
+ def build_queries(query, start_date, end_date)
134
+ if start_date && end_date
135
+ date_range = start_date.upto(end_date - 1)
136
+ date_range.map { |date| query + " since:#{date} until:#{date + 1}" }
137
+ elsif start_date
138
+ [query + " since:#{start_date}"]
139
+ elsif end_date
140
+ [query + " until:#{end_date}"]
141
+ else
142
+ [query]
143
+ end
144
+ end
117
145
 
118
- pos = nil
146
+ def main_loop(query, lang, limit, headers, proxies)
147
+ pos = nil
119
148
 
120
- while true
121
- new_tweets, new_pos = query_single_page(query, lang, pos, headers: headers, proxies: proxies)
122
- unless new_tweets.empty?
123
- mutex.synchronize {
124
- all_tweets.concat(new_tweets)
125
- all_tweets.uniq! { |t| t.tweet_id }
126
- }
127
- end
128
- logger.info("Got #{new_tweets.size} tweets (total #{all_tweets.size}) worker=#{Parallel.worker_number}")
149
+ while true
150
+ new_tweets, new_pos = query_single_page(query, lang, pos, headers: headers, proxies: proxies)
151
+ unless new_tweets.empty?
152
+ @mutex.synchronize {
153
+ @all_tweets.concat(new_tweets)
154
+ @all_tweets.uniq! { |t| t.tweet_id }
155
+ }
156
+ end
157
+ logger.info("Got #{new_tweets.size} tweets (total #{@all_tweets.size})")
129
158
 
130
- break unless new_pos
131
- break if all_tweets.size >= limit
159
+ break unless new_pos
160
+ break if @all_tweets.size >= limit
132
161
 
133
- pos = new_pos
134
- end
162
+ pos = new_pos
163
+ end
164
+
165
+ if @all_tweets.size >= limit
166
+ logger.info("Limit reached #{@all_tweets.size}")
167
+ @stop_requested = true
168
+ end
169
+ end
135
170
 
136
- if all_tweets.size >= limit
137
- logger.info("Reached limit #{all_tweets.size}")
138
- raise Parallel::Break
171
+ def stop_requested?
172
+ @stop_requested
173
+ end
174
+
175
+ def query_tweets(query, start_date: nil, end_date: nil, lang: '', limit: 100, threads: 2, proxy: false)
176
+ start_date = Date.parse(start_date) if start_date && start_date.is_a?(String)
177
+ end_date = Date.parse(end_date) if end_date && end_date.is_a?(String)
178
+ queries = build_queries(query, start_date, end_date)
179
+ threads = queries.size if threads > queries.size
180
+ proxies = proxy ? Twitterscraper::Proxy::Pool.new : []
181
+
182
+ validate_options!(queries[0], start_date: start_date, end_date: end_date, lang: lang, limit: limit, threads: threads, proxy: proxy)
183
+
184
+ logger.info("The number of threads #{threads}")
185
+
186
+ headers = {'User-Agent': USER_AGENT_LIST.sample, 'X-Requested-With': 'XMLHttpRequest'}
187
+ logger.info("Headers #{headers}")
188
+
189
+ @all_tweets = []
190
+ @mutex = Mutex.new
191
+ @stop_requested = false
192
+
193
+ if threads > 1
194
+ Parallel.each(queries, in_threads: threads) do |query|
195
+ main_loop(query, lang, limit, headers, proxies)
196
+ raise Parallel::Break if stop_requested?
197
+ end
198
+ else
199
+ queries.each do |query|
200
+ main_loop(query, lang, limit, headers, proxies)
201
+ break if stop_requested?
139
202
  end
140
203
  end
141
204
 
142
- all_tweets.sort_by { |tweet| -tweet.created_at.to_i }
205
+ @all_tweets.sort_by { |tweet| -tweet.created_at.to_i }
143
206
  end
144
207
  end
145
208
  end
@@ -1,3 +1,3 @@
1
1
  module Twitterscraper
2
- VERSION = '0.6.0'
2
+ VERSION = '0.7.0'
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: twitterscraper-ruby
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.6.0
4
+ version: 0.7.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - ts-3156