twitterscraper-ruby 0.11.0 → 0.15.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: f4382801b03a5384095aad6a955caea438787fa2eed96e3e001237df368925a2
4
- data.tar.gz: 6722b4edce7242b3006e5c097dd78847f36e2da7edea009e2d7b89b09f5b25ff
3
+ metadata.gz: 7f04cb0ba394884918271b5485b596c07203b7a6e9f4fec42d074ef4f02b6a0a
4
+ data.tar.gz: a4f618df53d1e8b54954619e87d383e43dbe5a63bbf83b33ee38f975998f2678
5
5
  SHA512:
6
- metadata.gz: 4ca72a0bbce553c38061e0362f755a5e82b47a5288108508410c19a7eef9a2514b58682e88ed1bf89654d5b89c84c41edd8a5fa34fd7d1e5fbf92b267402884a
7
- data.tar.gz: 8853b015cb37180d6814710d971a757d08aa4ddd4579af4131e204e34bb10c80ef3139c082f17be92303d9efc2e3f8eb4ba0d15bdf4f264fb4fba0cf87ed42d7
6
+ metadata.gz: fa9f02cf3ef0bf280f45b18ebacaec0b06dbd610477355602fcc59d382b5590c990695297e1e793457fdcff4cb7dd037f076c1f0fa4706eb69c67c3a165243e4
7
+ data.tar.gz: 9c08d9e4d1ee56fa133675bc73a50f502040cc9a2844d9a46a39c38ccdffdf43c15b17c2e4a8b74561f523493ccbc4a055f0add239574d2f5129ee4abe1f5ed9
@@ -0,0 +1,31 @@
1
+ version: 2.1
2
+ orbs:
3
+ ruby: circleci/ruby@0.1.2
4
+
5
+ jobs:
6
+ build:
7
+ docker:
8
+ - image: circleci/ruby:2.6.4-stretch-node
9
+ environment:
10
+ BUNDLER_VERSION: 2.1.4
11
+ executor: ruby/default
12
+ steps:
13
+ - checkout
14
+ - run:
15
+ name: Update bundler
16
+ command: gem update bundler
17
+ - run:
18
+ name: Which bundler?
19
+ command: bundle -v
20
+ - restore_cache:
21
+ keys:
22
+ - gem-cache-v1-{{ arch }}-{{ .Branch }}-{{ checksum "Gemfile.lock" }}
23
+ - gem-cache-v1-{{ arch }}-{{ .Branch }}
24
+ - gem-cache-v1
25
+ - run: bundle install --path vendor/bundle
26
+ - run: bundle clean
27
+ - save_cache:
28
+ key: gem-cache-v1-{{ arch }}-{{ .Branch }}-{{ checksum "Gemfile.lock" }}
29
+ paths:
30
+ - vendor/bundle
31
+ - run: bundle exec rspec
data/.rspec ADDED
@@ -0,0 +1,2 @@
1
+ -fd
2
+ --require spec_helper
data/Gemfile CHANGED
@@ -5,3 +5,4 @@ gemspec
5
5
 
6
6
  gem "rake", "~> 12.0"
7
7
  gem "minitest", "~> 5.0"
8
+ gem "rspec"
@@ -1,19 +1,33 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- twitterscraper-ruby (0.11.0)
4
+ twitterscraper-ruby (0.15.1)
5
5
  nokogiri
6
6
  parallel
7
7
 
8
8
  GEM
9
9
  remote: https://rubygems.org/
10
10
  specs:
11
+ diff-lcs (1.4.4)
11
12
  mini_portile2 (2.4.0)
12
13
  minitest (5.14.1)
13
14
  nokogiri (1.10.10)
14
15
  mini_portile2 (~> 2.4.0)
15
16
  parallel (1.19.2)
16
17
  rake (12.3.3)
18
+ rspec (3.9.0)
19
+ rspec-core (~> 3.9.0)
20
+ rspec-expectations (~> 3.9.0)
21
+ rspec-mocks (~> 3.9.0)
22
+ rspec-core (3.9.2)
23
+ rspec-support (~> 3.9.3)
24
+ rspec-expectations (3.9.2)
25
+ diff-lcs (>= 1.2.0, < 2.0)
26
+ rspec-support (~> 3.9.0)
27
+ rspec-mocks (3.9.1)
28
+ diff-lcs (>= 1.2.0, < 2.0)
29
+ rspec-support (~> 3.9.0)
30
+ rspec-support (3.9.3)
17
31
 
18
32
  PLATFORMS
19
33
  ruby
@@ -21,6 +35,7 @@ PLATFORMS
21
35
  DEPENDENCIES
22
36
  minitest (~> 5.0)
23
37
  rake (~> 12.0)
38
+ rspec
24
39
  twitterscraper-ruby!
25
40
 
26
41
  BUNDLED WITH
data/README.md CHANGED
@@ -1,18 +1,21 @@
1
1
  # twitterscraper-ruby
2
2
 
3
+ [![Build Status](https://circleci.com/gh/ts-3156/twitterscraper-ruby.svg?style=svg)](https://circleci.com/gh/ts-3156/twitterscraper-ruby)
3
4
  [![Gem Version](https://badge.fury.io/rb/twitterscraper-ruby.svg)](https://badge.fury.io/rb/twitterscraper-ruby)
4
5
 
5
6
  A gem to scrape https://twitter.com/search. This gem is inspired by [taspinar/twitterscraper](https://github.com/taspinar/twitterscraper).
6
7
 
8
+ Please feel free to ask [@ts_3156](https://twitter.com/ts_3156) if you have any questions.
9
+
7
10
 
8
11
  ## Twitter Search API vs. twitterscraper-ruby
9
12
 
10
- ### Twitter Search API
13
+ #### Twitter Search API
11
14
 
12
15
  - The number of tweets: 180 - 450 requests/15 minutes (18,000 - 45,000 tweets/15 minutes)
13
16
  - The time window: the past 7 days
14
17
 
15
- ### twitterscraper-ruby
18
+ #### twitterscraper-ruby
16
19
 
17
20
  - The number of tweets: Unlimited
18
21
  - The time window: from 2006-3-21 to today
@@ -29,45 +32,92 @@ $ gem install twitterscraper-ruby
29
32
 
30
33
  ## Usage
31
34
 
32
- Command-line interface:
35
+ #### Command-line interface:
36
+
37
+ Returns a collection of relevant tweets matching a specified query.
33
38
 
34
39
  ```shell script
35
- $ twitterscraper --query KEYWORD --start_date 2020-06-01 --end_date 2020-06-30 --lang ja \
36
- --limit 100 --threads 10 --proxy --cache --output output.json
40
+ $ twitterscraper --type search --query KEYWORD --start_date 2020-06-01 --end_date 2020-06-30 --lang ja \
41
+ --limit 100 --threads 10 --output tweets.json
37
42
  ```
38
43
 
39
- From Within Ruby:
44
+ Returns a collection of the most recent tweets posted by the user indicated by the screen_name
45
+
46
+ ```shell script
47
+ $ twitterscraper --type user --query SCREEN_NAME --limit 100 --output tweets.json
48
+ ```
49
+
50
+ #### From Within Ruby:
40
51
 
41
52
  ```ruby
42
53
  require 'twitterscraper'
54
+ client = Twitterscraper::Client.new(cache: true, proxy: true)
55
+ ```
43
56
 
44
- options = {
45
- start_date: '2020-06-01',
46
- end_date: '2020-06-30',
47
- lang: 'ja',
48
- limit: 100,
49
- threads: 10,
50
- proxy: true
51
- }
57
+ Returns a collection of relevant tweets matching a specified query.
52
58
 
53
- client = Twitterscraper::Client.new
54
- tweets = client.query_tweets(KEYWORD, options)
59
+ ```ruby
60
+ tweets = client.search(KEYWORD, start_date: '2020-06-01', end_date: '2020-06-30', lang: 'ja', limit: 100, threads: 10)
61
+ ```
62
+
63
+ Returns a collection of the most recent tweets posted by the user indicated by the screen_name
64
+
65
+ ```ruby
66
+ tweets = client.user_timeline(SCREEN_NAME, limit: 100)
67
+ ```
55
68
 
69
+
70
+ ## Examples
71
+
72
+ ```shell script
73
+ $ twitterscraper --query twitter --limit 1000
74
+ $ cat tweets.json | jq . | less
75
+ ```
76
+
77
+
78
+ ## Attributes
79
+
80
+ ### Tweet
81
+
82
+ ```ruby
56
83
  tweets.each do |tweet|
57
84
  puts tweet.tweet_id
58
85
  puts tweet.text
59
86
  puts tweet.tweet_url
60
87
  puts tweet.created_at
61
88
 
89
+ attr_names = hash.keys
62
90
  hash = tweet.attrs
63
- puts hash.keys
91
+ json = tweet.to_json
64
92
  end
65
93
  ```
66
94
 
67
-
68
- ## Attributes
69
-
70
- ### Tweet
95
+ ```json
96
+ [
97
+ {
98
+ "screen_name": "@name",
99
+ "name": "Name",
100
+ "user_id": 12340000,
101
+ "tweet_id": 1234000000000000,
102
+ "text": "Thanks Twitter!",
103
+ "links": [],
104
+ "hashtags": [],
105
+ "image_urls": [],
106
+ "video_url": null,
107
+ "has_media": null,
108
+ "likes": 10,
109
+ "retweets": 20,
110
+ "replies": 0,
111
+ "is_replied": false,
112
+ "is_reply_to": false,
113
+ "parent_tweet_id": null,
114
+ "reply_to_users": [],
115
+ "tweet_url": "https://twitter.com/name/status/1234000000000000",
116
+ "timestamp": 1594793000,
117
+ "created_at": "2020-07-15 00:00:00 +0000"
118
+ }
119
+ ]
120
+ ```
71
121
 
72
122
  - screen_name
73
123
  - name
@@ -110,43 +160,24 @@ end
110
160
  Search operators documentation is in [Standard search operators](https://developer.twitter.com/en/docs/tweets/rules-and-filtering/overview/standard-operators).
111
161
 
112
162
 
113
- ## Examples
114
-
115
- ```shell script
116
- $ twitterscraper --query twitter --limit 1000
117
- $ cat tweets.json | jq . | less
118
- ```
119
-
120
- ```json
121
- [
122
- {
123
- "screen_name": "@screenname",
124
- "name": "name",
125
- "user_id": 1194529546483000000,
126
- "tweet_id": 1282659891992000000,
127
- "tweet_url": "https://twitter.com/screenname/status/1282659891992000000",
128
- "created_at": "2020-07-13 12:00:00 +0000",
129
- "text": "Thanks Twitter!"
130
- }
131
- ]
132
- ```
133
-
134
163
  ## CLI Options
135
164
 
136
- | Option | Description | Default |
137
- | ------------- | ------------- | ------------- |
138
- | `-h`, `--help` | This option displays a summary of twitterscraper. | |
139
- | `--query` | Specify a keyword used during the search. | |
140
- | `--start_date` | Set the date from which twitterscraper-ruby should start scraping for your query. | |
141
- | `--end_date` | Set the enddate which twitterscraper-ruby should use to stop scraping for your query. | |
142
- | `--lang` | Retrieve tweets written in a specific language. | |
143
- | `--limit` | Stop scraping when *at least* the number of tweets indicated with --limit is scraped. | 100 |
144
- | `--threads` | Set the number of threads twitterscraper-ruby should initiate while scraping for your query. | 2 |
145
- | `--proxy` | Scrape https://twitter.com/search via proxies. | false |
146
- | `--cache` | Enable caching. | false |
147
- | `--format` | The format of the output. | json |
148
- | `--output` | The name of the output file. | tweets.json |
149
- | `--verbose` | Print debug messages. | tweets.json |
165
+ | Option | Type | Description | Value |
166
+ | ------------- | ------------- | ------------- | ------------- |
167
+ | `--help` | string | This option displays a summary of twitterscraper. | |
168
+ | `--type` | string | Specify a search type. | search(default) or user |
169
+ | `--query` | string | Specify a keyword used during the search. | |
170
+ | `--start_date` | string | Used as "since:yyyy-mm-dd for your query. This means "since the date". | |
171
+ | `--end_date` | string | Used as "until:yyyy-mm-dd for your query. This means "before the date". | |
172
+ | `--lang` | string | Retrieve tweets written in a specific language. | |
173
+ | `--limit` | integer | Stop scraping when *at least* the number of tweets indicated with --limit is scraped. | 100 |
174
+ | `--order` | string | Sort a order of the results. | desc(default) or asc |
175
+ | `--threads` | integer | Set the number of threads twitterscraper-ruby should initiate while scraping for your query. | 2 |
176
+ | `--proxy` | boolean | Scrape https://twitter.com/search via proxies. | true(default) or false |
177
+ | `--cache` | boolean | Enable caching. | true(default) or false |
178
+ | `--format` | string | The format of the output. | json(default) or html |
179
+ | `--output` | string | The name of the output file. | tweets.json |
180
+ | `--verbose` | | Print debug messages. | |
150
181
 
151
182
 
152
183
  ## Contributing
@@ -16,14 +16,16 @@ module Twitterscraper
16
16
  print_version || return if print_version?
17
17
 
18
18
  query_options = {
19
+ type: options['type'],
19
20
  start_date: options['start_date'],
20
21
  end_date: options['end_date'],
21
22
  lang: options['lang'],
22
23
  limit: options['limit'],
24
+ daily_limit: options['daily_limit'],
25
+ order: options['order'],
23
26
  threads: options['threads'],
24
- proxy: options['proxy']
25
27
  }
26
- client = Twitterscraper::Client.new(cache: options['cache'])
28
+ client = Twitterscraper::Client.new(cache: options['cache'], proxy: options['proxy'])
27
29
  tweets = client.query_tweets(options['query'], query_options)
28
30
  export(tweets) unless tweets.empty?
29
31
  end
@@ -58,26 +60,36 @@ module Twitterscraper
58
60
  'help',
59
61
  'v',
60
62
  'version',
63
+ 'type:',
61
64
  'query:',
62
65
  'start_date:',
63
66
  'end_date:',
64
67
  'lang:',
65
68
  'limit:',
69
+ 'daily_limit:',
70
+ 'order:',
66
71
  'threads:',
67
72
  'output:',
68
73
  'format:',
69
- 'cache',
70
- 'proxy',
74
+ 'cache:',
75
+ 'proxy:',
71
76
  'pretty',
72
77
  'verbose',
73
78
  )
74
79
 
80
+ options['type'] ||= 'search'
81
+ options['start_date'] = Query::OLDEST_DATE if options['start_date'] == 'oldest'
75
82
  options['lang'] ||= ''
76
83
  options['limit'] = (options['limit'] || 100).to_i
84
+ options['daily_limit'] = options['daily_limit'].to_i if options['daily_limit']
77
85
  options['threads'] = (options['threads'] || 2).to_i
78
86
  options['format'] ||= 'json'
87
+ options['order'] ||= 'desc'
79
88
  options['output'] ||= "tweets.#{options['format']}"
80
89
 
90
+ options['cache'] = options['cache'] != 'false'
91
+ options['proxy'] = options['proxy'] != 'false'
92
+
81
93
  options
82
94
  end
83
95
 
@@ -101,7 +113,7 @@ module Twitterscraper
101
113
  end
102
114
 
103
115
  def print_version
104
- puts "twitterscraper-#{Twitterscraper::VERSION}"
116
+ puts "twitterscraper-#{VERSION}"
105
117
  end
106
118
  end
107
119
  end
@@ -2,12 +2,17 @@ module Twitterscraper
2
2
  class Client
3
3
  include Query
4
4
 
5
- def initialize(cache:)
5
+ def initialize(cache: true, proxy: true)
6
6
  @cache = cache
7
+ @proxy = proxy
7
8
  end
8
9
 
9
10
  def cache_enabled?
10
11
  @cache
11
12
  end
13
+
14
+ def proxy_enabled?
15
+ @proxy
16
+ end
12
17
  end
13
18
  end
@@ -17,15 +17,17 @@ module Twitterscraper
17
17
  reload
18
18
  end
19
19
  @cur_index += 1
20
- item = @items[@cur_index - 1]
21
- Twitterscraper.logger.info("Using proxy #{item}")
22
- item
20
+ @items[@cur_index - 1]
23
21
  end
24
22
 
25
23
  def size
26
24
  @items.size
27
25
  end
28
26
 
27
+ def empty?
28
+ @items.empty?
29
+ end
30
+
29
31
  private
30
32
 
31
33
  def reload
@@ -51,7 +53,6 @@ module Twitterscraper
51
53
  proxies << ip + ':' + port
52
54
  end
53
55
 
54
- Twitterscraper.logger.debug "Fetch #{proxies.size} proxies"
55
56
  proxies.shuffle
56
57
  rescue => e
57
58
  if (retries -= 1) > 0
@@ -22,36 +22,41 @@ module Twitterscraper
22
22
  RELOAD_URL = 'https://twitter.com/i/search/timeline?f=tweets&vertical=' +
23
23
  'default&include_available_features=1&include_entities=1&' +
24
24
  'reset_error_state=false&src=typd&max_position=__POS__&q=__QUERY__&l=__LANG__'
25
- INIT_URL_USER = 'https://twitter.com/{u}'
26
- RELOAD_URL_USER = 'https://twitter.com/i/profiles/show/{u}/timeline/tweets?' +
25
+ INIT_URL_USER = 'https://twitter.com/__USER__'
26
+ RELOAD_URL_USER = 'https://twitter.com/i/profiles/show/__USER__/timeline/tweets?' +
27
27
  'include_available_features=1&include_entities=1&' +
28
- 'max_position={pos}&reset_error_state=false'
29
-
30
- def build_query_url(query, lang, pos, from_user = false)
31
- # if from_user
32
- # if !pos
33
- # INIT_URL_USER.format(u = query)
34
- # else
35
- # RELOAD_URL_USER.format(u = query, pos = pos)
36
- # end
37
- # end
38
- if pos
39
- RELOAD_URL.sub('__QUERY__', query).sub('__LANG__', lang.to_s).sub('__POS__', pos)
28
+ 'max_position=__POS__&reset_error_state=false'
29
+
30
+ def build_query_url(query, lang, type, pos)
31
+ if type == 'user'
32
+ if pos
33
+ RELOAD_URL_USER.sub('__USER__', query).sub('__POS__', pos.to_s)
34
+ else
35
+ INIT_URL_USER.sub('__USER__', query)
36
+ end
40
37
  else
41
- INIT_URL.sub('__QUERY__', query).sub('__LANG__', lang.to_s)
38
+ if pos
39
+ RELOAD_URL.sub('__QUERY__', query).sub('__LANG__', lang.to_s).sub('__POS__', pos)
40
+ else
41
+ INIT_URL.sub('__QUERY__', query).sub('__LANG__', lang.to_s)
42
+ end
42
43
  end
43
44
  end
44
45
 
45
46
  def get_single_page(url, headers, proxies, timeout = 6, retries = 30)
46
47
  return nil if stop_requested?
47
- Twitterscraper::Http.get(url, headers, proxies.sample, timeout)
48
+ unless proxies.empty?
49
+ proxy = proxies.sample
50
+ logger.info("Using proxy #{proxy}")
51
+ end
52
+ Http.get(url, headers, proxy, timeout)
48
53
  rescue => e
49
- logger.debug "query_single_page: #{e.inspect}"
54
+ logger.debug "get_single_page: #{e.inspect}"
50
55
  if (retries -= 1) > 0
51
- logger.info("Retrying... (Attempts left: #{retries - 1})")
56
+ logger.info "Retrying... (Attempts left: #{retries - 1})"
52
57
  retry
53
58
  else
54
- raise
59
+ raise Error.new("#{e.inspect} url=#{url}")
55
60
  end
56
61
  end
57
62
 
@@ -70,28 +75,28 @@ module Twitterscraper
70
75
  [items_html, json_resp]
71
76
  end
72
77
 
73
- def query_single_page(query, lang, pos, from_user = false, headers: [], proxies: [])
74
- logger.info("Querying #{query}")
78
+ def query_single_page(query, lang, type, pos, headers: [], proxies: [])
79
+ logger.info "Querying #{query}"
75
80
  query = ERB::Util.url_encode(query)
76
81
 
77
- url = build_query_url(query, lang, pos, from_user)
82
+ url = build_query_url(query, lang, type, pos)
78
83
  http_request = lambda do
79
- logger.debug("Scraping tweets from #{url}")
84
+ logger.debug "Scraping tweets from #{url}"
80
85
  get_single_page(url, headers, proxies)
81
86
  end
82
87
 
83
88
  if cache_enabled?
84
89
  client = Cache.new
85
90
  if (response = client.read(url))
86
- logger.debug('Fetching tweets from cache')
91
+ logger.debug 'Fetching tweets from cache'
87
92
  else
88
93
  response = http_request.call
89
- client.write(url, response)
94
+ client.write(url, response) unless stop_requested?
90
95
  end
91
96
  else
92
97
  response = http_request.call
93
98
  end
94
- return [], nil if response.nil?
99
+ return [], nil if response.nil? || response.empty?
95
100
 
96
101
  html, json_resp = parse_single_page(response, pos.nil?)
97
102
 
@@ -103,8 +108,8 @@ module Twitterscraper
103
108
 
104
109
  if json_resp
105
110
  [tweets, json_resp['min_position']]
106
- elsif from_user
107
- raise NotImplementedError
111
+ elsif type
112
+ [tweets, tweets[-1].tweet_id]
108
113
  else
109
114
  [tweets, "TWEET-#{tweets[-1].tweet_id}-#{tweets[0].tweet_id}"]
110
115
  end
@@ -112,33 +117,34 @@ module Twitterscraper
112
117
 
113
118
  OLDEST_DATE = Date.parse('2006-03-21')
114
119
 
115
- def validate_options!(query, start_date:, end_date:, lang:, limit:, threads:, proxy:)
120
+ def validate_options!(queries, type:, start_date:, end_date:, lang:, limit:, threads:)
121
+ query = queries[0]
116
122
  if query.nil? || query == ''
117
- raise 'Please specify a search query.'
123
+ raise Error.new('Please specify a search query.')
118
124
  end
119
125
 
120
126
  if ERB::Util.url_encode(query).length >= 500
121
- raise ':query must be a UTF-8, URL-encoded search query of 500 characters maximum, including operators.'
127
+ raise Error.new(':query must be a UTF-8, URL-encoded search query of 500 characters maximum, including operators.')
122
128
  end
123
129
 
124
130
  if start_date && end_date
125
131
  if start_date == end_date
126
- raise 'Please specify different values for :start_date and :end_date.'
132
+ raise Error.new('Please specify different values for :start_date and :end_date.')
127
133
  elsif start_date > end_date
128
- raise ':start_date must occur before :end_date.'
134
+ raise Error.new(':start_date must occur before :end_date.')
129
135
  end
130
136
  end
131
137
 
132
138
  if start_date
133
139
  if start_date < OLDEST_DATE
134
- raise ":start_date must be greater than or equal to #{OLDEST_DATE}"
140
+ raise Error.new(":start_date must be greater than or equal to #{OLDEST_DATE}")
135
141
  end
136
142
  end
137
143
 
138
144
  if end_date
139
145
  today = Date.today
140
146
  if end_date > Date.today
141
- raise ":end_date must be less than or equal to today(#{today})"
147
+ raise Error.new(":end_date must be less than or equal to today(#{today})")
142
148
  end
143
149
  end
144
150
  end
@@ -156,27 +162,32 @@ module Twitterscraper
156
162
  end
157
163
  end
158
164
 
159
- def main_loop(query, lang, limit, headers, proxies)
165
+ def main_loop(query, lang, type, limit, daily_limit, headers, proxies)
160
166
  pos = nil
167
+ daily_tweets = []
161
168
 
162
169
  while true
163
- new_tweets, new_pos = query_single_page(query, lang, pos, headers: headers, proxies: proxies)
170
+ new_tweets, new_pos = query_single_page(query, lang, type, pos, headers: headers, proxies: proxies)
164
171
  unless new_tweets.empty?
172
+ daily_tweets.concat(new_tweets)
173
+ daily_tweets.uniq! { |t| t.tweet_id }
174
+
165
175
  @mutex.synchronize {
166
176
  @all_tweets.concat(new_tweets)
167
177
  @all_tweets.uniq! { |t| t.tweet_id }
168
178
  }
169
179
  end
170
- logger.info("Got #{new_tweets.size} tweets (total #{@all_tweets.size})")
180
+ logger.info "Got #{new_tweets.size} tweets (total #{@all_tweets.size})"
171
181
 
172
182
  break unless new_pos
183
+ break if daily_limit && daily_tweets.size >= daily_limit
173
184
  break if @all_tweets.size >= limit
174
185
 
175
186
  pos = new_pos
176
187
  end
177
188
 
178
- if @all_tweets.size >= limit
179
- logger.info("Limit reached #{@all_tweets.size}")
189
+ if !@stop_requested && @all_tweets.size >= limit
190
+ logger.warn "The limit you specified has been reached limit=#{limit} tweets=#{@all_tweets.size}"
180
191
  @stop_requested = true
181
192
  end
182
193
  end
@@ -185,37 +196,59 @@ module Twitterscraper
185
196
  @stop_requested
186
197
  end
187
198
 
188
- def query_tweets(query, start_date: nil, end_date: nil, lang: '', limit: 100, threads: 2, proxy: false)
199
+ def query_tweets(query, type: 'search', start_date: nil, end_date: nil, lang: nil, limit: 100, daily_limit: nil, order: 'desc', threads: 2)
189
200
  start_date = Date.parse(start_date) if start_date && start_date.is_a?(String)
190
201
  end_date = Date.parse(end_date) if end_date && end_date.is_a?(String)
191
202
  queries = build_queries(query, start_date, end_date)
192
- threads = queries.size if threads > queries.size
193
- proxies = proxy ? Twitterscraper::Proxy::Pool.new : []
203
+ if threads > queries.size
204
+ logger.warn 'The maximum number of :threads is the number of dates between :start_date and :end_date.'
205
+ threads = queries.size
206
+ end
207
+ if proxy_enabled?
208
+ proxies = Proxy::Pool.new
209
+ logger.debug "Fetch #{proxies.size} proxies"
210
+ else
211
+ proxies = []
212
+ logger.debug 'Proxy disabled'
213
+ end
214
+ logger.debug "Cache #{cache_enabled? ? 'enabled' : 'disabled'}"
194
215
 
195
- validate_options!(queries[0], start_date: start_date, end_date: end_date, lang: lang, limit: limit, threads: threads, proxy: proxy)
196
216
 
197
- logger.info("The number of threads #{threads}")
217
+ validate_options!(queries, type: type, start_date: start_date, end_date: end_date, lang: lang, limit: limit, threads: threads)
218
+
219
+ logger.info "The number of threads #{threads}"
198
220
 
199
221
  headers = {'User-Agent': USER_AGENT_LIST.sample, 'X-Requested-With': 'XMLHttpRequest'}
200
- logger.info("Headers #{headers}")
222
+ logger.info "Headers #{headers}"
201
223
 
202
224
  @all_tweets = []
203
225
  @mutex = Mutex.new
204
226
  @stop_requested = false
205
227
 
206
228
  if threads > 1
229
+ Thread.abort_on_exception = true
230
+ logger.debug "Set 'Thread.abort_on_exception' to true"
231
+
207
232
  Parallel.each(queries, in_threads: threads) do |query|
208
- main_loop(query, lang, limit, headers, proxies)
233
+ main_loop(query, lang, type, limit, daily_limit, headers, proxies)
209
234
  raise Parallel::Break if stop_requested?
210
235
  end
211
236
  else
212
237
  queries.each do |query|
213
- main_loop(query, lang, limit, headers, proxies)
238
+ main_loop(query, lang, type, limit, daily_limit, headers, proxies)
214
239
  break if stop_requested?
215
240
  end
216
241
  end
217
242
 
218
- @all_tweets.sort_by { |tweet| -tweet.created_at.to_i }
243
+ @all_tweets.sort_by { |tweet| (order == 'desc' ? -1 : 1) * tweet.created_at.to_i }
244
+ end
245
+
246
+ def search(query, start_date: nil, end_date: nil, lang: '', limit: 100, daily_limit: nil, order: 'desc', threads: 2)
247
+ query_tweets(query, type: 'search', start_date: start_date, end_date: end_date, lang: lang, limit: limit, daily_limit: daily_limit, order: order, threads: threads)
248
+ end
249
+
250
+ def user_timeline(screen_name, limit: 100, order: 'desc')
251
+ query_tweets(screen_name, type: 'user', start_date: nil, end_date: nil, lang: nil, limit: limit, daily_limit: nil, order: order, threads: 1)
219
252
  end
220
253
  end
221
254
  end
@@ -59,12 +59,19 @@ module Twitterscraper
59
59
  def from_tweets_html(html)
60
60
  html.map do |tweet|
61
61
  from_tweet_html(tweet)
62
- end
62
+ end.compact
63
63
  end
64
64
 
65
65
  def from_tweet_html(html)
66
+ screen_name = html.attr('data-screen-name')
67
+ tweet_id = html.attr('data-tweet-id')&.to_i
68
+
69
+ unless html.to_s.include?('js-tweet-text-container')
70
+ Twitterscraper.logger.warn "html doesn't include div.js-tweet-text-container url=https://twitter.com/#{screen_name}/status/#{tweet_id}"
71
+ return nil
72
+ end
73
+
66
74
  inner_html = Nokogiri::HTML(html.inner_html)
67
- tweet_id = html.attr('data-tweet-id').to_i
68
75
  text = inner_html.xpath("//div[@class[contains(., 'js-tweet-text-container')]]/p[@class[contains(., 'js-tweet-text')]]").first.text
69
76
  links = inner_html.xpath("//a[@class[contains(., 'twitter-timeline-link')]]").map { |elem| elem.attr('data-expanded-url') }.select { |link| link && !link.include?('pic.twitter') }
70
77
  image_urls = inner_html.xpath("//div[@class[contains(., 'AdaptiveMedia-photoContainer')]]").map { |elem| elem.attr('data-image-url') }
@@ -89,7 +96,7 @@ module Twitterscraper
89
96
 
90
97
  timestamp = inner_html.xpath("//span[@class[contains(., 'js-short-timestamp')]]").first.attr('data-time').to_i
91
98
  new(
92
- screen_name: html.attr('data-screen-name'),
99
+ screen_name: screen_name,
93
100
  name: html.attr('data-name'),
94
101
  user_id: html.attr('data-user-id').to_i,
95
102
  tweet_id: tweet_id,
@@ -1,3 +1,3 @@
1
1
  module Twitterscraper
2
- VERSION = '0.11.0'
2
+ VERSION = '0.15.1'
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: twitterscraper-ruby
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.11.0
4
+ version: 0.15.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - ts-3156
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-07-15 00:00:00.000000000 Z
11
+ date: 2020-07-17 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
@@ -46,8 +46,10 @@ executables:
46
46
  extensions: []
47
47
  extra_rdoc_files: []
48
48
  files:
49
+ - ".circleci/config.yml"
49
50
  - ".gitignore"
50
51
  - ".irbrc"
52
+ - ".rspec"
51
53
  - ".ruby-version"
52
54
  - ".travis.yml"
53
55
  - CODE_OF_CONDUCT.md