news_scraper 1.0.0 → 1.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: db7d631f3f6cf73ff2e57b9e472804651b9fe1e0
4
- data.tar.gz: 1045878eb97749d6b264a486ac34bfb89f4796dd
3
+ metadata.gz: 608b90149fbc8977b1fc3b42c923557b128ad4df
4
+ data.tar.gz: 0e0914d81488d9630234860b3a9e732a1471158a
5
5
  SHA512:
6
- metadata.gz: a53423be5dbda33ead7dbb46bc494e40fcf412172a291496128b40512c985fc157c646481e1b8a183be2f709486e2cc7ec27d47a17bb9715c3b19dbe09dd7e42
7
- data.tar.gz: eb43f129a0ca1a9f6eb02f24bfeb891583c96b0cce548afe43cf613231e8908a66e381347aeb16f5be9ea4af3ca96f82dcaf4da005951fb5ae5936d3f3530cbc
6
+ metadata.gz: 1e344051f216c10b320b324db5dbaeccaa9f034a1bd2d94f1905ea8797ed6389d4780db125bcd354dd553bc0279f75aecc507f5de91d30eb6d00098474e69173
7
+ data.tar.gz: 1823fe68329a466385e16dd224d49633c122afec26d6c71cd66d9b428888e81e68a73dfc78133a995ad0a9885248ae03389f2d9b3b192b4d514970bc385990b6
data/Gemfile CHANGED
@@ -1,3 +1,7 @@
1
1
  source "https://rubygems.org"
2
2
 
3
3
  gemspec
4
+
5
+ group :test do
6
+ gem 'hashdiff'
7
+ end
data/README.md CHANGED
@@ -33,14 +33,23 @@ Optionally, you can pass in a block and it will yield the transformed data on a
33
33
  It takes in 1 parameter `query:`.
34
34
 
35
35
  Array notation
36
- ```
36
+ ```ruby
37
37
  article_hashes = NewsScraper::Scraper.new(query: 'Shopify').scrape # [ { author: ... }, { author: ... } ... ]
38
38
  ```
39
39
 
40
+ *Note:* the array notation may raise `NewsScraper::Transformers::ScrapePatternNotDefined` (domain is not in the configuration) or `NewsScraper::ResponseError` (non-200 response), for this reason, it is suggested to use the block notation where this can be handled properly
41
+
40
42
  Block notation
41
- ```
42
- NewsScraper::Scraper.new(query: 'Shopify').scrape do |article_hash|
43
- # { author: ... }
43
+ ```ruby
44
+ NewsScraper::Scraper.new(query: 'Shopify').scrape do |a|
45
+ case a.class.to_s
46
+ when "NewsScraper::Transformers::ScrapePatternNotDefined"
47
+ puts "#{a.root_domain} was not trained"
48
+ when "NewsScraper::ResponseError"
49
+ puts "#{a.url} returned an error: #{a.error_code}-#{a.message}"
50
+ else
51
+ # { author: ... }
52
+ end
44
53
  end
45
54
  ```
46
55
 
@@ -48,12 +57,12 @@ How the `Scraper` extracts and parses for the information is determined by scrap
48
57
 
49
58
  ### Transformed Data
50
59
 
51
- Calling `NewsScraper::Scraper#scrape` with either the array or block notation will yield `transformed_data` hashes. [`article_scrape_patterns.yml`](https://github.com/richardwu/news_scraper/blob/master/config/article_scrape_patterns.yml) defines the data types that will be scraped for.
60
+ Calling `NewsScraper::Scraper#scrape` with either the array or block notation will yield `transformed_data` hashes. [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml) defines the data types that will be scraped for.
52
61
 
53
62
  In addition, the `url` and `root_domain`(hostname) of the article will be returned in the hash too.
54
63
 
55
64
  Example
56
- ```
65
+ ```ruby
57
66
  {
58
67
  author: 'Linus Torvald',
59
68
  body: 'The Linux kernel developed by Linus Torvald has become the backbone of most electronic devices we use to-date. It powers mobile phones, laptops, embedded devices, and even rockets...',
@@ -71,12 +80,34 @@ Example
71
80
 
72
81
  Scrape patterns are xpath or CSS patterns used by Nokogiri to extract relevant HTML elements.
73
82
 
74
- Extracting each `:data_type` (see Example under **Transformed Data**) requires a scrape pattern. A few `:presets` are specified in [`article_scrape_patterns.yml`](https://github.com/richardwu/news_scraper/blob/master/config/article_scrape_patterns.yml).
83
+ Extracting each `:data_type` (see Example under **Transformed Data**) requires a scrape pattern. A few `:presets` are specified in [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml).
75
84
 
76
85
  Since each news site (identified with `:root_domain`) uses a different markup, scrape patterns are defined on a per-`:root_domain` basis.
77
86
 
78
87
  Specifying scrape patterns for new, undefined `:root_domains` is called training (see **Training**).
79
88
 
89
+ #### Customizing Scrape Patterns
90
+
91
+ `NewsScraper.configuration` is the entry point for scrape patterns. By default, it loads the contents of [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml), but you can override this with the `fetch_method` which accepts a proc.
92
+
93
+ For example, to override the domains section we can do this like so:
94
+
95
+ ```ruby
96
+ @default_configuration = NewsScraper.configuration.scrape_patterns.dup
97
+ NewsScraper.configure do |config|
98
+ config.fetch_method = proc do
99
+ @default_configuration['domains'] = { ... }
100
+ @default_configuration
101
+ end
102
+ end
103
+ ```
104
+
105
+ Of course, using this method you can override any part of the configuration individually, or the entire thing. It is fully customizeable.
106
+
107
+ This helps with separate apps which may track domains training itself. If the configuration is not set correctly, a newly trained domain will not be in the configuration and a `NewsScraper::Transformers::ScrapePatternNotDefined` error will be raised.
108
+
109
+ It would be appreciated that any domains you train outside of this gem eventually end up as a pull request back to [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml).
110
+
80
111
  ### Training
81
112
 
82
113
  For each `:root_domain`, it is neccesary to specify a scrape pattern for each of the `:data_type`s. A rake task was written to provide a CLI for appending new `:root_domain`s using `:preset` scrape patterns.
@@ -88,6 +119,44 @@ bundle exec rake scraper:train QUERY=<query>
88
119
 
89
120
  where the CLI will step through the articles and `:root_domain`s of the articles relevant to `<query>`.
90
121
 
122
+ Of course, this will simply create an entry for a `domain` with `domain_entries`, so as long as your application provides the same functionality, this can be overriden in your app. Just provide a domain entry like so:
123
+
124
+ ```yaml
125
+ domains:
126
+ root_domain.com:
127
+ author:
128
+ method: method
129
+ pattern: pattern
130
+ body:
131
+ method: method
132
+ pattern: pattern
133
+ description:
134
+ method: method
135
+ pattern: pattern
136
+ keywords:
137
+ method: method
138
+ pattern: pattern
139
+ section:
140
+ method: method
141
+ pattern: pattern
142
+ datetime:
143
+ method: method
144
+ pattern: pattern
145
+ title:
146
+ method: method
147
+ pattern: pattern
148
+ ```
149
+
150
+ The options using the presets in [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml), can be obtained using this snippet:
151
+ ```ruby
152
+ include NewsScraper::ExtractorsHelpers
153
+
154
+ transformed_data = NewsScraper::Transformers::TrainerArticle.new(
155
+ url: url,
156
+ payload: http_request(url).body
157
+ ).transform
158
+ ```
159
+
91
160
  ## Development
92
161
 
93
162
  After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
@@ -96,7 +165,7 @@ To install this gem onto your local machine, run `bundle exec rake install`. To
96
165
 
97
166
  ## Contributing
98
167
 
99
- Bug reports and pull requests are welcome on GitHub at https://github.com/richardwu/news_scraper. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
168
+ Bug reports and pull requests are welcome on GitHub at https://github.com/news-scraper/news_scraper. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
100
169
 
101
170
 
102
171
  ## License
@@ -45,6 +45,9 @@ presets:
45
45
  og: &og_description
46
46
  method: "xpath"
47
47
  pattern: "//meta[@property='og:description']/@content"
48
+ metainspector: &metainspector_description
49
+ method: 'metainspector'
50
+ pattern: :description
48
51
  keywords:
49
52
  meta: &meta_keywords
50
53
  method: "xpath"
@@ -55,6 +58,9 @@ presets:
55
58
  news_keywords: &news_keywords_keywords
56
59
  method: "xpath"
57
60
  pattern: "//meta[@name='news_keywords']/@content"
61
+ highscore: &highscore_keywords
62
+ method: highscore
63
+ pattern: ""
58
64
  section:
59
65
  meta: &meta_section
60
66
  method: "xpath"
@@ -109,6 +115,9 @@ presets:
109
115
  og: &og_title
110
116
  method: "xpath"
111
117
  pattern: "//meta[@property='og:title']/@content"
118
+ metainspector: &metainspector_title
119
+ method: 'metainspector'
120
+ pattern: :best_title
112
121
 
113
122
  domains:
114
123
  investors.com:
@@ -0,0 +1,459 @@
1
+ ---
2
+ - "-"
3
+ - "--"
4
+ - ":"
5
+ - ":d"
6
+ - 'no'
7
+ - 'off'
8
+ - 'on'
9
+ - about
10
+ - above
11
+ - across
12
+ - after
13
+ - again
14
+ - against
15
+ - ahahaha
16
+ - all
17
+ - almost
18
+ - alone
19
+ - along
20
+ - already
21
+ - also
22
+ - although
23
+ - always
24
+ - am
25
+ - among
26
+ - an
27
+ - and
28
+ - another
29
+ - any
30
+ - anybody
31
+ - anyone
32
+ - anything
33
+ - anywhere
34
+ - ar
35
+ - are
36
+ - area
37
+ - areas
38
+ - around
39
+ - as
40
+ - ask
41
+ - asked
42
+ - asking
43
+ - asks
44
+ - at
45
+ - aw
46
+ - away
47
+ - aww
48
+ - awww
49
+ - back
50
+ - backed
51
+ - backing
52
+ - backs
53
+ - be
54
+ - became
55
+ - because
56
+ - become
57
+ - becomes
58
+ - been
59
+ - before
60
+ - began
61
+ - behind
62
+ - being
63
+ - beings
64
+ - best
65
+ - better
66
+ - between
67
+ - big
68
+ - bit
69
+ - blud
70
+ - both
71
+ - bt
72
+ - but
73
+ - by
74
+ - call
75
+ - came
76
+ - can
77
+ - cannot
78
+ - case
79
+ - cases
80
+ - certain
81
+ - certainly
82
+ - chat
83
+ - clear
84
+ - clearly
85
+ - come
86
+ - comments
87
+ - could
88
+ - d
89
+ - did
90
+ - differ
91
+ - different
92
+ - differently
93
+ - do
94
+ - does
95
+ - done
96
+ - dont
97
+ - down
98
+ - downed
99
+ - downing
100
+ - downs
101
+ - dunno
102
+ - during
103
+ - each
104
+ - early
105
+ - eh
106
+ - either
107
+ - email
108
+ - end
109
+ - ended
110
+ - ending
111
+ - ends
112
+ - enough
113
+ - even
114
+ - evenly
115
+ - ever
116
+ - every
117
+ - everybody
118
+ - everyone
119
+ - everything
120
+ - everywhere
121
+ - face
122
+ - faces
123
+ - fact
124
+ - facts
125
+ - far
126
+ - felt
127
+ - few
128
+ - find
129
+ - finds
130
+ - first
131
+ - for
132
+ - four
133
+ - from
134
+ - full
135
+ - fully
136
+ - further
137
+ - furthered
138
+ - furthering
139
+ - furthers
140
+ - gave
141
+ - general
142
+ - generally
143
+ - get
144
+ - gets
145
+ - give
146
+ - given
147
+ - gives
148
+ - go
149
+ - going
150
+ - good
151
+ - goods
152
+ - got
153
+ - great
154
+ - greater
155
+ - greatest
156
+ - group
157
+ - grouped
158
+ - grouping
159
+ - groups
160
+ - ha
161
+ - haaaa
162
+ - had
163
+ - haha
164
+ - has
165
+ - have
166
+ - having
167
+ - he
168
+ - heh
169
+ - hehe
170
+ - hehehe
171
+ - her
172
+ - here
173
+ - herself
174
+ - high
175
+ - higher
176
+ - highest
177
+ - him
178
+ - himself
179
+ - his
180
+ - hola
181
+ - how
182
+ - however
183
+ - i
184
+ - id
185
+ - if
186
+ - il
187
+ - im
188
+ - important
189
+ - in
190
+ - init
191
+ - interest
192
+ - interested
193
+ - interesting
194
+ - interests
195
+ - into
196
+ - is
197
+ - it
198
+ - its
199
+ - itself
200
+ - iv
201
+ - ive
202
+ - jst
203
+ - just
204
+ - keep
205
+ - keeps
206
+ - kind
207
+ - knew
208
+ - know
209
+ - known
210
+ - knows
211
+ - large
212
+ - largely
213
+ - last
214
+ - later
215
+ - latest
216
+ - least
217
+ - less
218
+ - let
219
+ - lets
220
+ - like
221
+ - likely
222
+ - lol
223
+ - long
224
+ - longer
225
+ - longest
226
+ - lool
227
+ - loool
228
+ - looool
229
+ - made
230
+ - mah
231
+ - make
232
+ - making
233
+ - man
234
+ - many
235
+ - may
236
+ - me
237
+ - member
238
+ - members
239
+ - men
240
+ - might
241
+ - more
242
+ - most
243
+ - mostly
244
+ - mr
245
+ - mrs
246
+ - much
247
+ - must
248
+ - my
249
+ - myself
250
+ - necessary
251
+ - need
252
+ - needed
253
+ - needing
254
+ - needs
255
+ - never
256
+ - new
257
+ - newer
258
+ - newest
259
+ - next
260
+ - nobody
261
+ - non
262
+ - noone
263
+ - not
264
+ - nothing
265
+ - now
266
+ - nowhere
267
+ - number
268
+ - numbers
269
+ - of
270
+ - often
271
+ - oh
272
+ - old
273
+ - older
274
+ - oldest
275
+ - once
276
+ - one
277
+ - only
278
+ - ooh
279
+ - ooo
280
+ - open
281
+ - opened
282
+ - opening
283
+ - opens
284
+ - or
285
+ - order
286
+ - ordered
287
+ - ordering
288
+ - orders
289
+ - other
290
+ - others
291
+ - our
292
+ - out
293
+ - over
294
+ - part
295
+ - parted
296
+ - parting
297
+ - parts
298
+ - per
299
+ - perhaps
300
+ - place
301
+ - places
302
+ - pls
303
+ - point
304
+ - pointed
305
+ - pointing
306
+ - points
307
+ - possible
308
+ - powered
309
+ - present
310
+ - presented
311
+ - presenting
312
+ - presents
313
+ - problem
314
+ - problems
315
+ - put
316
+ - puts
317
+ - quite
318
+ - rather
319
+ - really
320
+ - right
321
+ - room
322
+ - rooms
323
+ - run
324
+ - safe
325
+ - said
326
+ - same
327
+ - saw
328
+ - say
329
+ - says
330
+ - second
331
+ - seconds
332
+ - see
333
+ - seem
334
+ - seemed
335
+ - seeming
336
+ - seems
337
+ - sees
338
+ - several
339
+ - shall
340
+ - she
341
+ - should
342
+ - show
343
+ - showed
344
+ - showing
345
+ - shows
346
+ - side
347
+ - sides
348
+ - since
349
+ - small
350
+ - smaller
351
+ - smallest
352
+ - so
353
+ - some
354
+ - somebody
355
+ - someone
356
+ - something
357
+ - somewhere
358
+ - state
359
+ - states
360
+ - still
361
+ - stop
362
+ - such
363
+ - sure
364
+ - ta
365
+ - tail
366
+ - take
367
+ - taken
368
+ - team
369
+ - than
370
+ - thank
371
+ - thanks
372
+ - that
373
+ - the
374
+ - their
375
+ - them
376
+ - then
377
+ - there
378
+ - therefore
379
+ - theres
380
+ - these
381
+ - they
382
+ - thing
383
+ - things
384
+ - think
385
+ - thinks
386
+ - this
387
+ - those
388
+ - though
389
+ - thought
390
+ - thoughts
391
+ - three
392
+ - through
393
+ - thus
394
+ - to
395
+ - today
396
+ - together
397
+ - too
398
+ - took
399
+ - toward
400
+ - tryna
401
+ - turn
402
+ - turned
403
+ - turning
404
+ - turns
405
+ - two
406
+ - under
407
+ - until
408
+ - up
409
+ - upon
410
+ - ur
411
+ - us
412
+ - use
413
+ - used
414
+ - uses
415
+ - very
416
+ - want
417
+ - wanted
418
+ - wanting
419
+ - wants
420
+ - was
421
+ - way
422
+ - ways
423
+ - we
424
+ - welcome
425
+ - well
426
+ - wells
427
+ - went
428
+ - were
429
+ - what
430
+ - when
431
+ - where
432
+ - whether
433
+ - which
434
+ - while
435
+ - who
436
+ - whole
437
+ - whose
438
+ - why
439
+ - will
440
+ - with
441
+ - within
442
+ - without
443
+ - work
444
+ - worked
445
+ - working
446
+ - works
447
+ - would
448
+ - ya
449
+ - yeah
450
+ - year
451
+ - years
452
+ - yet
453
+ - yo
454
+ - you
455
+ - young
456
+ - younger
457
+ - youngest
458
+ - your
459
+ - yours
@@ -1,26 +1,39 @@
1
1
  module NewsScraper
2
2
  class Configuration
3
3
  DEFAULT_SCRAPE_PATTERNS_FILEPATH = File.expand_path('../../../config/article_scrape_patterns.yml', __FILE__)
4
- attr_accessor :fetch_method, :scrape_patterns_filepath
4
+ STOPWORDS_FILEPATH = File.expand_path('../../../config/stopwords.yml', __FILE__)
5
+ attr_accessor :scrape_patterns_fetch_method, :stopwords_fetch_method, :scrape_patterns_filepath
5
6
 
6
7
  # <code>NewsScraper::Configuration.initialize</code> initializes the scrape_patterns_filepath
7
- # and the fetch_method to the <code>DEFAULT_SCRAPE_PATTERNS_FILEPATH</code>
8
+ # and the scrape_patterns_fetch_method to the <code>DEFAULT_SCRAPE_PATTERNS_FILEPATH</code>.
9
+ # It also sets stopwords to be used during extraction to a default set contained in <code>STOPWORDS_FILEPATH</code>
8
10
  #
9
11
  # Set the <code>scrape_patterns_filepath</code> to <code>nil</code> to disable saving during training
10
12
  #
11
13
  def initialize
12
14
  self.scrape_patterns_filepath = DEFAULT_SCRAPE_PATTERNS_FILEPATH
13
- self.fetch_method = proc { default_scrape_patterns }
15
+ self.scrape_patterns_fetch_method = proc { default_scrape_patterns }
16
+ self.stopwords_fetch_method = proc { YAML.load_file(STOPWORDS_FILEPATH) }
14
17
  end
15
18
 
16
19
  # <code>NewsScraper::Configuration.scrape_patterns</code> proxies scrape_patterns
17
- # requests to <code>fetch_method</code>:
20
+ # requests to <code>scrape_patterns_fetch_method</code>:
18
21
  #
19
22
  # *Returns*
20
- # - The result of calling the <code>fetch_method</code> proc, expected to be a hash
23
+ # - The result of calling the <code>scrape_patterns_fetch_method</code> proc, expected to be a hash
21
24
  #
22
25
  def scrape_patterns
23
- fetch_method.call
26
+ scrape_patterns_fetch_method.call
27
+ end
28
+
29
+ # <code>NewsScraper::Configuration.stopwords</code> proxies stopwords
30
+ # requests to <code>stopwords_fetch_method</code>:
31
+ #
32
+ # *Returns*
33
+ # - The result of calling the <code>stopwords_fetch_method</code> proc, expected to be an array
34
+ #
35
+ def stopwords
36
+ stopwords_fetch_method.call
24
37
  end
25
38
 
26
39
  private
@@ -1,8 +1,11 @@
1
1
  require 'nokogiri'
2
2
  require 'sanitize'
3
+ require 'news_scraper/transformers/nokogiri/functions'
4
+
3
5
  require 'readability'
4
6
  require 'htmlbeautifier'
5
- require 'news_scraper/transformers/nokogiri/functions'
7
+ require 'metainspector'
8
+ require 'news_scraper/transformers/helpers/highscore_parser'
6
9
 
7
10
  module NewsScraper
8
11
  module Transformers
@@ -69,6 +72,11 @@ module NewsScraper
69
72
  # Remove any newlines in the text
70
73
  content = content.squeeze("\n").strip
71
74
  HtmlBeautifier.beautify(content)
75
+ when :metainspector
76
+ page = MetaInspector.new(@url, document: @payload)
77
+ page.respond_to?(scrape_pattern.to_sym) ? page.send(scrape_pattern.to_sym) : nil
78
+ when :highscore
79
+ NewsScraper::Transformers::Helpers::HighScoreParser.keywords(url: @url, payload: @payload)
72
80
  end
73
81
  end
74
82
  end
@@ -0,0 +1,50 @@
1
+ require 'metainspector'
2
+ require 'highscore'
3
+ require 'readability'
4
+
5
+ module NewsScraper
6
+ module Transformers
7
+ module Helpers
8
+ class HighScoreParser
9
+ class << self
10
+ # <code>NewsScraper::Transformers::Helpers::HighScoreParser.keywords</code> parses out keywords
11
+ #
12
+ # *Params*
13
+ # - <code>url:</code>: keyword for the url to parse to a uri
14
+ # - <code>payload:</code>: keyword for the payload from a request to the url (html body)
15
+ #
16
+ # *Returns*
17
+ # - <code>keywords</code>: Top 5 keywords from the body of text
18
+ #
19
+ def keywords(url:, payload:)
20
+ blacklist = Highscore::Blacklist.load(stopwords(url, payload))
21
+ content = Readability::Document.new(payload, emove_empty_nodes: true, tags: [], attributes: []).content
22
+ highscore(content, blacklist)
23
+ end
24
+
25
+ private
26
+
27
+ def highscore(content, blacklist)
28
+ text = Highscore::Content.new(content, blacklist)
29
+ text.configure do
30
+ set :multiplier, 2
31
+ set :upper_case, 3
32
+ set :long_words, 2
33
+ set :long_words_threshold, 15
34
+ set :ignore_case, true
35
+ end
36
+ text.keywords.top(5).collect(&:text)
37
+ end
38
+
39
+ def stopwords(url, payload)
40
+ page = MetaInspector.new(url, document: payload)
41
+ stopwords = NewsScraper.configuration.stopwords
42
+ # Add the site name to the stop words
43
+ stopwords += page.meta['og:site_name'].downcase.split(' ') if page.meta['og:site_name']
44
+ stopwords
45
+ end
46
+ end
47
+ end
48
+ end
49
+ end
50
+ end
@@ -1,3 +1,3 @@
1
1
  module NewsScraper
2
- VERSION = "1.0.0".freeze
2
+ VERSION = "1.1.0".freeze
3
3
  end
data/news_scraper.gemspec CHANGED
@@ -1,3 +1,4 @@
1
+ # rubocop:disable BlockLength
1
2
  # coding: utf-8
2
3
  lib = File.expand_path('../lib', __FILE__)
3
4
  $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
@@ -29,7 +30,9 @@ Gem::Specification.new do |spec|
29
30
  spec.add_dependency 'sanitize', '~> 4.2', '>= 4.2.0'
30
31
  spec.add_dependency 'ruby-readability', '~> 0.7', '>= 0.7.0'
31
32
  spec.add_dependency 'htmlbeautifier', '~> 1.1', '>= 1.1.1'
32
- spec.add_dependency 'terminal-table', '~> 1.5', '>= 1.5.2'
33
+ spec.add_dependency 'terminal-table', '~> 1.7.0', '>= 1.7.0'
34
+ spec.add_dependency 'metainspector', '~> 5.3.0', '>= 5.3.0'
35
+ spec.add_dependency 'highscore', '~> 1.2.0', '>= 1.2.0'
33
36
 
34
37
  spec.add_development_dependency 'bundler', '~> 1.12', '>= 1.12.0'
35
38
  spec.add_development_dependency 'rake', '~> 10.0', '>= 10.0.0'
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: news_scraper
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0
4
+ version: 1.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Richard Wu
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: exe
11
11
  cert_chain: []
12
- date: 2016-09-25 00:00:00.000000000 Z
12
+ date: 2016-10-16 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: nokogiri
@@ -117,20 +117,60 @@ dependencies:
117
117
  requirements:
118
118
  - - "~>"
119
119
  - !ruby/object:Gem::Version
120
- version: '1.5'
120
+ version: 1.7.0
121
121
  - - ">="
122
122
  - !ruby/object:Gem::Version
123
- version: 1.5.2
123
+ version: 1.7.0
124
124
  type: :runtime
125
125
  prerelease: false
126
126
  version_requirements: !ruby/object:Gem::Requirement
127
127
  requirements:
128
128
  - - "~>"
129
129
  - !ruby/object:Gem::Version
130
- version: '1.5'
130
+ version: 1.7.0
131
131
  - - ">="
132
132
  - !ruby/object:Gem::Version
133
- version: 1.5.2
133
+ version: 1.7.0
134
+ - !ruby/object:Gem::Dependency
135
+ name: metainspector
136
+ requirement: !ruby/object:Gem::Requirement
137
+ requirements:
138
+ - - "~>"
139
+ - !ruby/object:Gem::Version
140
+ version: 5.3.0
141
+ - - ">="
142
+ - !ruby/object:Gem::Version
143
+ version: 5.3.0
144
+ type: :runtime
145
+ prerelease: false
146
+ version_requirements: !ruby/object:Gem::Requirement
147
+ requirements:
148
+ - - "~>"
149
+ - !ruby/object:Gem::Version
150
+ version: 5.3.0
151
+ - - ">="
152
+ - !ruby/object:Gem::Version
153
+ version: 5.3.0
154
+ - !ruby/object:Gem::Dependency
155
+ name: highscore
156
+ requirement: !ruby/object:Gem::Requirement
157
+ requirements:
158
+ - - "~>"
159
+ - !ruby/object:Gem::Version
160
+ version: 1.2.0
161
+ - - ">="
162
+ - !ruby/object:Gem::Version
163
+ version: 1.2.0
164
+ type: :runtime
165
+ prerelease: false
166
+ version_requirements: !ruby/object:Gem::Requirement
167
+ requirements:
168
+ - - "~>"
169
+ - !ruby/object:Gem::Version
170
+ version: 1.2.0
171
+ - - ">="
172
+ - !ruby/object:Gem::Version
173
+ version: 1.2.0
134
174
  - !ruby/object:Gem::Dependency
135
175
  name: bundler
136
176
  requirement: !ruby/object:Gem::Requirement
@@ -325,6 +365,7 @@ files:
325
365
  - bin/setup
326
366
  - circle.yml
327
367
  - config/article_scrape_patterns.yml
368
+ - config/stopwords.yml
328
369
  - config/temp_dirs.yml
329
370
  - dev.yml
330
371
  - lib/news_scraper.rb
@@ -340,6 +381,7 @@ files:
340
381
  - lib/news_scraper/trainer/preset_selector.rb
341
382
  - lib/news_scraper/trainer/url_trainer.rb
342
383
  - lib/news_scraper/transformers/article.rb
384
+ - lib/news_scraper/transformers/helpers/highscore_parser.rb
343
385
  - lib/news_scraper/transformers/nokogiri/functions.rb
344
386
  - lib/news_scraper/transformers/trainer_article.rb
345
387
  - lib/news_scraper/uri_parser.rb