twitterscraper-ruby 0.6.0 → 0.7.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: fe6db831d59218f3e701e0211487d79ef2354524610f8338a1a17e4cc426e437
4
- data.tar.gz: 34cd890b8d2837bcacb3ab7b03fb43845294eecb9d7b0c4891c048eacedbe233
3
+ metadata.gz: 9cfd03782734642da8ac29839788f142399d2a3f4ec601e8b6f47ae1ca38c17f
4
+ data.tar.gz: 07a398e51fd2fbdc735ae27008d9a23e97dc390632179738045db4c81bd4fcad
5
5
  SHA512:
6
- metadata.gz: 6c16c89ca290cc3c9ed5fd245c5aa26e5386c95011cfa14277e774e860359495cafec1624fba0af55de98ebbb34abb599e75e210bdbb18b3b11e49bc1527b643
7
- data.tar.gz: d54e25e0294eddf8226c0e27a1d46c6128e9066c9d04edc429b382498c0de1af7ccf2e5c1333ad2031210bbefd7f9b7edc73b9a0e79ab2bd3673674b2e648f3c
6
+ metadata.gz: 6f417fe3379a3d9d134c308a9ea9d4e01b458018c9c5a3f8508a85e7f5890d01991838cfcabe87b8246f69edf4458c66d17924359798017907862071353f643d
7
+ data.tar.gz: 758bcb55ded936c3696f99647f64bc9921386b3cb0c783c218510c0e36991ae6b95a9d08fa071e02072c8b727bbadb6674ceeb19a74e356a842d62c1ec4c038f
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- twitterscraper-ruby (0.6.0)
4
+ twitterscraper-ruby (0.7.0)
5
5
  nokogiri
6
6
  parallel
7
7
 
data/README.md CHANGED
@@ -1,46 +1,127 @@
1
1
  # twitterscraper-ruby
2
2
 
3
- Welcome to your new gem! In this directory, you'll find the files you need to be able to package up your Ruby library into a gem. Put your Ruby code in the file `lib/twitterscraper/ruby`. To experiment with that code, run `bin/console` for an interactive prompt.
3
+ [![Gem Version](https://badge.fury.io/rb/twitterscraper-ruby.svg)](https://badge.fury.io/rb/twitterscraper-ruby)
4
4
 
5
- TODO: Delete this and the text above, and describe your gem
5
+ A gem to scrape https://twitter.com/search. This gem is inspired by [taspinar/twitterscraper](https://github.com/taspinar/twitterscraper).
6
6
 
7
- ## Installation
8
7
 
9
- Add this line to your application's Gemfile:
8
+ ## Twitter Search API vs. twitterscraper-ruby
10
9
 
11
- ```ruby
12
- gem 'twitterscraper-ruby'
13
- ```
10
+ ### Twitter Search API
11
+
12
+ - The number of tweets: 180 - 450 requests/15 minutes (18,000 - 45,000 tweets/15 minutes)
13
+ - The time window: the past 7 days
14
+
15
+ ### twitterscraper-ruby
16
+
17
+ - The number of tweets: Unlimited
18
+ - The time window: from 2006-3-21 to today
14
19
 
15
- And then execute:
16
20
 
17
- $ bundle install
21
+ ## Installation
18
22
 
19
- Or install it yourself as:
23
+ First install the library:
20
24
 
21
- $ gem install twitterscraper-ruby
25
+ ```shell script
26
+ $ gem install twitterscraper-ruby
27
+ ````
28
+
22
29
 
23
30
  ## Usage
24
31
 
32
+ Command-line interface:
33
+
34
+ ```shell script
35
+ $ twitterscraper --query KEYWORD --start_date 2020-06-01 --end_date 2020-06-30 --lang ja \
36
+ --limit 100 --threads 10 --proxy --output output.json
37
+ ```
38
+
39
+ From Within Ruby:
40
+
25
41
  ```ruby
26
42
  require 'twitterscraper'
43
+
44
+ options = {
45
+ start_date: '2020-06-01',
46
+ end_date: '2020-06-30',
47
+ lang: 'ja',
48
+ limit: 100,
49
+ threads: 10,
50
+ proxy: true
51
+ }
52
+
53
+ client = Twitterscraper::Client.new
54
+ tweets = client.query_tweets(KEYWORD, options)
55
+
56
+ tweets.each do |tweet|
57
+ puts tweet.tweet_id
58
+ puts tweet.text
59
+ puts tweet.created_at
60
+ puts tweet.tweet_url
61
+ end
27
62
  ```
28
63
 
29
- ## Development
30
64
 
31
- After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
65
+ ## Examples
66
+
67
+ ```shell script
68
+ $ twitterscraper --query twitter --limit 1000
69
+ $ cat tweets.json | jq . | less
70
+ ```
71
+
72
+ ```json
73
+ [
74
+ {
75
+ "screen_name": "@screenname",
76
+ "name": "name",
77
+ "user_id": 1194529546483000000,
78
+ "tweet_id": 1282659891992000000,
79
+ "tweet_url": "https://twitter.com/screenname/status/1282659891992000000",
80
+ "created_at": "2020-07-13 12:00:00 +0000",
81
+ "text": "Thanks Twitter!"
82
+ },
83
+ ...
84
+ ]
85
+ ```
86
+
87
+ ## Attributes
88
+
89
+ ### Tweet
90
+
91
+ - tweet_id
92
+ - text
93
+ - user_id
94
+ - screen_name
95
+ - name
96
+ - tweet_url
97
+ - created_at
98
+
99
+
100
+ ## CLI Options
101
+
102
+ | Option | Description | Default |
103
+ | ------------- | ------------- | ------------- |
104
+ | `-h`, `--help` | This option displays a summary of twitterscraper. | |
105
+ | `--query` | Specify a keyword used during the search. | |
106
+ | `--start_date` | Set the date from which twitterscraper-ruby should start scraping for your query. | |
107
+ | `--end_date` | Set the enddate which twitterscraper-ruby should use to stop scraping for your query. | |
108
+ | `--lang` | Retrieve tweets written in a specific language. | |
109
+ | `--limit` | Stop scraping when *at least* the number of tweets indicated with --limit is scraped. | 100 |
110
+ | `--threads` | Set the number of threads twitterscraper-ruby should initiate while scraping for your query. | 2 |
111
+ | `--proxy` | Scrape https://twitter.com/search via proxies. | false |
112
+ | `--output` | The name of the output file. | tweets.json |
32
113
 
33
- To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
34
114
 
35
115
  ## Contributing
36
116
 
37
- Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/twitterscraper-ruby. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/[USERNAME]/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
117
+ Bug reports and pull requests are welcome on GitHub at https://github.com/ts-3156/twitterscraper-ruby. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/ts-3156/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
38
118
 
39
119
 
40
120
  ## License
41
121
 
42
122
  The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
43
123
 
124
+
44
125
  ## Code of Conduct
45
126
 
46
- Everyone interacting in the Twitterscraper::Ruby project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
127
+ Everyone interacting in the twitterscraper-ruby project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/ts-3156/twitterscraper-ruby/blob/master/CODE_OF_CONDUCT.md).
@@ -15,7 +15,6 @@ module Twitterscraper
15
15
  print_help || return if print_help?
16
16
  print_version || return if print_version?
17
17
 
18
- client = Twitterscraper::Client.new
19
18
  query_options = {
20
19
  start_date: options['start_date'],
21
20
  end_date: options['end_date'],
@@ -24,6 +23,7 @@ module Twitterscraper
24
23
  threads: options['threads'],
25
24
  proxy: options['proxy']
26
25
  }
26
+ client = Twitterscraper::Client.new
27
27
  tweets = client.query_tweets(options['query'], query_options)
28
28
  File.write(options['output'], generate_json(tweets))
29
29
  end
@@ -3,6 +3,7 @@ require 'net/http'
3
3
  require 'nokogiri'
4
4
  require 'date'
5
5
  require 'json'
6
+ require 'erb'
6
7
  require 'parallel'
7
8
 
8
9
  module Twitterscraper
@@ -41,7 +42,8 @@ module Twitterscraper
41
42
  end
42
43
  end
43
44
 
44
- def get_single_page(url, headers, proxies, timeout = 10, retries = 30)
45
+ def get_single_page(url, headers, proxies, timeout = 6, retries = 30)
46
+ return nil if stop_requested?
45
47
  Twitterscraper::Http.get(url, headers, proxies.sample, timeout)
46
48
  rescue => e
47
49
  logger.debug "query_single_page: #{e.inspect}"
@@ -54,6 +56,8 @@ module Twitterscraper
54
56
  end
55
57
 
56
58
  def parse_single_page(text, html = true)
59
+ return [nil, nil] if text.nil? || text == ''
60
+
57
61
  if html
58
62
  json_resp = nil
59
63
  items_html = text
@@ -68,12 +72,14 @@ module Twitterscraper
68
72
 
69
73
  def query_single_page(query, lang, pos, from_user = false, headers: [], proxies: [])
70
74
  logger.info("Querying #{query}")
71
- query = query.gsub(' ', '%20').gsub('#', '%23').gsub(':', '%3A').gsub('&', '%26')
75
+ query = ERB::Util.url_encode(query)
72
76
 
73
77
  url = build_query_url(query, lang, pos, from_user)
74
78
  logger.debug("Scraping tweets from #{url}")
75
79
 
76
80
  response = get_single_page(url, headers, proxies)
81
+ return [], nil if response.nil?
82
+
77
83
  html, json_resp = parse_single_page(response, pos.nil?)
78
84
 
79
85
  tweets = Tweet.from_html(html)
@@ -91,55 +97,112 @@ module Twitterscraper
91
97
  end
92
98
  end
93
99
 
94
- def query_tweets(query, start_date: nil, end_date: nil, lang: '', limit: 100, threads: 2, proxy: false)
95
- start_date = start_date ? Date.parse(start_date) : Date.parse('2006-3-21')
96
- end_date = end_date ? Date.parse(end_date) : Date.today
97
- if start_date == end_date
98
- raise 'Please specify different values for :start_date and :end_date.'
99
- elsif start_date > end_date
100
- raise ':start_date must occur before :end_date.'
100
+ OLDEST_DATE = Date.parse('2006-3-21')
101
+
102
+ def validate_options!(query, start_date:, end_date:, lang:, limit:, threads:, proxy:)
103
+ if query.nil? || query == ''
104
+ raise 'Please specify a search query.'
101
105
  end
102
106
 
103
- proxies = proxy ? Twitterscraper::Proxy::Pool.new : []
107
+ if ERB::Util.url_encode(query).length >= 500
108
+ raise ':query must be a UTF-8, URL-encoded search query of 500 characters maximum, including operators.'
109
+ end
104
110
 
105
- date_range = start_date.upto(end_date - 1)
106
- queries = date_range.map { |date| query + " since:#{date} until:#{date + 1}" }
107
- threads = queries.size if threads > queries.size
108
- logger.info("Threads #{threads}")
111
+ if start_date && end_date
112
+ if start_date == end_date
113
+ raise 'Please specify different values for :start_date and :end_date.'
114
+ elsif start_date > end_date
115
+ raise ':start_date must occur before :end_date.'
116
+ end
117
+ end
109
118
 
110
- headers = {'User-Agent': USER_AGENT_LIST.sample, 'X-Requested-With': 'XMLHttpRequest'}
111
- logger.info("Headers #{headers}")
119
+ if start_date
120
+ if start_date < OLDEST_DATE
121
+ raise ":start_date must be greater than or equal to #{OLDEST_DATE}"
122
+ end
123
+ end
112
124
 
113
- all_tweets = []
114
- mutex = Mutex.new
125
+ if end_date
126
+ today = Date.today
127
+ if end_date > Date.today
128
+ raise ":end_date must be less than or equal to today(#{today})"
129
+ end
130
+ end
131
+ end
115
132
 
116
- Parallel.each(queries, in_threads: threads) do |query|
133
+ def build_queries(query, start_date, end_date)
134
+ if start_date && end_date
135
+ date_range = start_date.upto(end_date - 1)
136
+ date_range.map { |date| query + " since:#{date} until:#{date + 1}" }
137
+ elsif start_date
138
+ [query + " since:#{start_date}"]
139
+ elsif end_date
140
+ [query + " until:#{end_date}"]
141
+ else
142
+ [query]
143
+ end
144
+ end
117
145
 
118
- pos = nil
146
+ def main_loop(query, lang, limit, headers, proxies)
147
+ pos = nil
119
148
 
120
- while true
121
- new_tweets, new_pos = query_single_page(query, lang, pos, headers: headers, proxies: proxies)
122
- unless new_tweets.empty?
123
- mutex.synchronize {
124
- all_tweets.concat(new_tweets)
125
- all_tweets.uniq! { |t| t.tweet_id }
126
- }
127
- end
128
- logger.info("Got #{new_tweets.size} tweets (total #{all_tweets.size}) worker=#{Parallel.worker_number}")
149
+ while true
150
+ new_tweets, new_pos = query_single_page(query, lang, pos, headers: headers, proxies: proxies)
151
+ unless new_tweets.empty?
152
+ @mutex.synchronize {
153
+ @all_tweets.concat(new_tweets)
154
+ @all_tweets.uniq! { |t| t.tweet_id }
155
+ }
156
+ end
157
+ logger.info("Got #{new_tweets.size} tweets (total #{@all_tweets.size})")
129
158
 
130
- break unless new_pos
131
- break if all_tweets.size >= limit
159
+ break unless new_pos
160
+ break if @all_tweets.size >= limit
132
161
 
133
- pos = new_pos
134
- end
162
+ pos = new_pos
163
+ end
164
+
165
+ if @all_tweets.size >= limit
166
+ logger.info("Limit reached #{@all_tweets.size}")
167
+ @stop_requested = true
168
+ end
169
+ end
135
170
 
136
- if all_tweets.size >= limit
137
- logger.info("Reached limit #{all_tweets.size}")
138
- raise Parallel::Break
171
+ def stop_requested?
172
+ @stop_requested
173
+ end
174
+
175
+ def query_tweets(query, start_date: nil, end_date: nil, lang: '', limit: 100, threads: 2, proxy: false)
176
+ start_date = Date.parse(start_date) if start_date && start_date.is_a?(String)
177
+ end_date = Date.parse(end_date) if end_date && end_date.is_a?(String)
178
+ queries = build_queries(query, start_date, end_date)
179
+ threads = queries.size if threads > queries.size
180
+ proxies = proxy ? Twitterscraper::Proxy::Pool.new : []
181
+
182
+ validate_options!(queries[0], start_date: start_date, end_date: end_date, lang: lang, limit: limit, threads: threads, proxy: proxy)
183
+
184
+ logger.info("The number of threads #{threads}")
185
+
186
+ headers = {'User-Agent': USER_AGENT_LIST.sample, 'X-Requested-With': 'XMLHttpRequest'}
187
+ logger.info("Headers #{headers}")
188
+
189
+ @all_tweets = []
190
+ @mutex = Mutex.new
191
+ @stop_requested = false
192
+
193
+ if threads > 1
194
+ Parallel.each(queries, in_threads: threads) do |query|
195
+ main_loop(query, lang, limit, headers, proxies)
196
+ raise Parallel::Break if stop_requested?
197
+ end
198
+ else
199
+ queries.each do |query|
200
+ main_loop(query, lang, limit, headers, proxies)
201
+ break if stop_requested?
139
202
  end
140
203
  end
141
204
 
142
- all_tweets.sort_by { |tweet| -tweet.created_at.to_i }
205
+ @all_tweets.sort_by { |tweet| -tweet.created_at.to_i }
143
206
  end
144
207
  end
145
208
  end
@@ -1,3 +1,3 @@
1
1
  module Twitterscraper
2
- VERSION = '0.6.0'
2
+ VERSION = '0.7.0'
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: twitterscraper-ruby
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.6.0
4
+ version: 0.7.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - ts-3156