news_scraper 1.0.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: db7d631f3f6cf73ff2e57b9e472804651b9fe1e0
4
- data.tar.gz: 1045878eb97749d6b264a486ac34bfb89f4796dd
3
+ metadata.gz: 608b90149fbc8977b1fc3b42c923557b128ad4df
4
+ data.tar.gz: 0e0914d81488d9630234860b3a9e732a1471158a
5
5
  SHA512:
6
- metadata.gz: a53423be5dbda33ead7dbb46bc494e40fcf412172a291496128b40512c985fc157c646481e1b8a183be2f709486e2cc7ec27d47a17bb9715c3b19dbe09dd7e42
7
- data.tar.gz: eb43f129a0ca1a9f6eb02f24bfeb891583c96b0cce548afe43cf613231e8908a66e381347aeb16f5be9ea4af3ca96f82dcaf4da005951fb5ae5936d3f3530cbc
6
+ metadata.gz: 1e344051f216c10b320b324db5dbaeccaa9f034a1bd2d94f1905ea8797ed6389d4780db125bcd354dd553bc0279f75aecc507f5de91d30eb6d00098474e69173
7
+ data.tar.gz: 1823fe68329a466385e16dd224d49633c122afec26d6c71cd66d9b428888e81e68a73dfc78133a995ad0a9885248ae03389f2d9b3b192b4d514970bc385990b6
data/Gemfile CHANGED
@@ -1,3 +1,7 @@
1
1
  source "https://rubygems.org"
2
2
 
3
3
  gemspec
4
+
5
+ group :test do
6
+ gem 'hashdiff'
7
+ end
data/README.md CHANGED
@@ -33,14 +33,23 @@ Optionally, you can pass in a block and it will yield the transformed data on a
33
33
  It takes in 1 parameter `query:`.
34
34
 
35
35
  Array notation
36
- ```
36
+ ```ruby
37
37
  article_hashes = NewsScraper::Scraper.new(query: 'Shopify').scrape # [ { author: ... }, { author: ... } ... ]
38
38
  ```
39
39
 
40
+ *Note:* the array notation may raise `NewsScraper::Transformers::ScrapePatternNotDefined` (domain is not in the configuration) or `NewsScraper::ResponseError` (non-200 response), for this reason, it is suggested to use the block notation where this can be handled properly
41
+
40
42
  Block notation
41
- ```
42
- NewsScraper::Scraper.new(query: 'Shopify').scrape do |article_hash|
43
- # { author: ... }
43
+ ```ruby
44
+ NewsScraper::Scraper.new(query: 'Shopify').scrape do |a|
45
+ case a.class.to_s
46
+ when "NewsScraper::Transformers::ScrapePatternNotDefined"
47
+ puts "#{a.root_domain} was not trained"
48
+ when "NewsScraper::ResponseError"
49
+ puts "#{a.url} returned an error: #{a.error_code}-#{a.message}"
50
+ else
51
+ # { author: ... }
52
+ end
44
53
  end
45
54
  ```
46
55
 
@@ -48,12 +57,12 @@ How the `Scraper` extracts and parses for the information is determined by scrap
48
57
 
49
58
  ### Transformed Data
50
59
 
51
- Calling `NewsScraper::Scraper#scrape` with either the array or block notation will yield `transformed_data` hashes. [`article_scrape_patterns.yml`](https://github.com/richardwu/news_scraper/blob/master/config/article_scrape_patterns.yml) defines the data types that will be scraped for.
60
+ Calling `NewsScraper::Scraper#scrape` with either the array or block notation will yield `transformed_data` hashes. [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml) defines the data types that will be scraped for.
52
61
 
53
62
  In addition, the `url` and `root_domain`(hostname) of the article will be returned in the hash too.
54
63
 
55
64
  Example
56
- ```
65
+ ```ruby
57
66
  {
58
67
  author: 'Linus Torvald',
59
68
  body: 'The Linux kernel developed by Linus Torvald has become the backbone of most electronic devices we use to-date. It powers mobile phones, laptops, embedded devices, and even rockets...',
@@ -71,12 +80,34 @@ Example
71
80
 
72
81
  Scrape patterns are xpath or CSS patterns used by Nokogiri to extract relevant HTML elements.
73
82
 
74
- Extracting each `:data_type` (see Example under **Transformed Data**) requires a scrape pattern. A few `:presets` are specified in [`article_scrape_patterns.yml`](https://github.com/richardwu/news_scraper/blob/master/config/article_scrape_patterns.yml).
83
+ Extracting each `:data_type` (see Example under **Transformed Data**) requires a scrape pattern. A few `:presets` are specified in [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml).
75
84
 
76
85
  Since each news site (identified with `:root_domain`) uses a different markup, scrape patterns are defined on a per-`:root_domain` basis.
77
86
 
78
87
  Specifying scrape patterns for new, undefined `:root_domains` is called training (see **Training**).
79
88
 
89
+ #### Customizing Scrape Patterns
90
+
91
+ `NewsScraper.configuration` is the entry point for scrape patterns. By default, it loads the contents of [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml), but you can override this with the `fetch_method` which accepts a proc.
92
+
93
+ For example, to override the domains section we can do this like so:
94
+
95
+ ```ruby
96
+ @default_configuration = NewsScraper.configuration.scrape_patterns.dup
97
+ NewsScraper.configure do |config|
98
+ config.fetch_method = proc do
99
+ @default_configuration['domains'] = { ... }
100
+ @default_configuration
101
+ end
102
+ end
103
+ ```
104
+
105
+ Of course, using this method you can override any part of the configuration individually, or the entire thing. It is fully customizeable.
106
+
107
+ This helps with separate apps which may track domains training itself. If the configuration is not set correctly, a newly trained domain will not be in the configuration and a `NewsScraper::Transformers::ScrapePatternNotDefined` error will be raised.
108
+
109
+ It would be appreciated that any domains you train outside of this gem eventually end up as a pull request back to [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml).
110
+
80
111
  ### Training
81
112
 
82
113
  For each `:root_domain`, it is neccesary to specify a scrape pattern for each of the `:data_type`s. A rake task was written to provide a CLI for appending new `:root_domain`s using `:preset` scrape patterns.
@@ -88,6 +119,44 @@ bundle exec rake scraper:train QUERY=<query>
88
119
 
89
120
  where the CLI will step through the articles and `:root_domain`s of the articles relevant to `<query>`.
90
121
 
122
+ Of course, this will simply create an entry for a `domain` with `domain_entries`, so as long as your application provides the same functionality, this can be overriden in your app. Just provide a domain entry like so:
123
+
124
+ ```yaml
125
+ domains:
126
+ root_domain.com:
127
+ author:
128
+ method: method
129
+ pattern: pattern
130
+ body:
131
+ method: method
132
+ pattern: pattern
133
+ description:
134
+ method: method
135
+ pattern: pattern
136
+ keywords:
137
+ method: method
138
+ pattern: pattern
139
+ section:
140
+ method: method
141
+ pattern: pattern
142
+ datetime:
143
+ method: method
144
+ pattern: pattern
145
+ title:
146
+ method: method
147
+ pattern: pattern
148
+ ```
149
+
150
+ The options using the presets in [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml), can be obtained using this snippet:
151
+ ```ruby
152
+ include NewsScraper::ExtractorsHelpers
153
+
154
+ transformed_data = NewsScraper::Transformers::TrainerArticle.new(
155
+ url: url,
156
+ payload: http_request(url).body
157
+ ).transform
158
+ ```
159
+
91
160
  ## Development
92
161
 
93
162
  After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
@@ -96,7 +165,7 @@ To install this gem onto your local machine, run `bundle exec rake install`. To
96
165
 
97
166
  ## Contributing
98
167
 
99
- Bug reports and pull requests are welcome on GitHub at https://github.com/richardwu/news_scraper. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
168
+ Bug reports and pull requests are welcome on GitHub at https://github.com/news-scraper/news_scraper. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
100
169
 
101
170
 
102
171
  ## License
@@ -45,6 +45,9 @@ presets:
45
45
  og: &og_description
46
46
  method: "xpath"
47
47
  pattern: "//meta[@property='og:description']/@content"
48
+ metainspector: &metainspector_description
49
+ method: 'metainspector'
50
+ pattern: :description
48
51
  keywords:
49
52
  meta: &meta_keywords
50
53
  method: "xpath"
@@ -55,6 +58,9 @@ presets:
55
58
  news_keywords: &news_keywords_keywords
56
59
  method: "xpath"
57
60
  pattern: "//meta[@name='news_keywords']/@content"
61
+ highscore: &highscore_keywords
62
+ method: highscore
63
+ pattern: ""
58
64
  section:
59
65
  meta: &meta_section
60
66
  method: "xpath"
@@ -109,6 +115,9 @@ presets:
109
115
  og: &og_title
110
116
  method: "xpath"
111
117
  pattern: "//meta[@property='og:title']/@content"
118
+ metainspector: &metainspector_title
119
+ method: 'metainspector'
120
+ pattern: :best_title
112
121
 
113
122
  domains:
114
123
  investors.com:
@@ -0,0 +1,459 @@
1
+ ---
2
+ - "-"
3
+ - "--"
4
+ - ":"
5
+ - ":d"
6
+ - 'no'
7
+ - 'off'
8
+ - 'on'
9
+ - about
10
+ - above
11
+ - across
12
+ - after
13
+ - again
14
+ - against
15
+ - ahahaha
16
+ - all
17
+ - almost
18
+ - alone
19
+ - along
20
+ - already
21
+ - also
22
+ - although
23
+ - always
24
+ - am
25
+ - among
26
+ - an
27
+ - and
28
+ - another
29
+ - any
30
+ - anybody
31
+ - anyone
32
+ - anything
33
+ - anywhere
34
+ - ar
35
+ - are
36
+ - area
37
+ - areas
38
+ - around
39
+ - as
40
+ - ask
41
+ - asked
42
+ - asking
43
+ - asks
44
+ - at
45
+ - aw
46
+ - away
47
+ - aww
48
+ - awww
49
+ - back
50
+ - backed
51
+ - backing
52
+ - backs
53
+ - be
54
+ - became
55
+ - because
56
+ - become
57
+ - becomes
58
+ - been
59
+ - before
60
+ - began
61
+ - behind
62
+ - being
63
+ - beings
64
+ - best
65
+ - better
66
+ - between
67
+ - big
68
+ - bit
69
+ - blud
70
+ - both
71
+ - bt
72
+ - but
73
+ - by
74
+ - call
75
+ - came
76
+ - can
77
+ - cannot
78
+ - case
79
+ - cases
80
+ - certain
81
+ - certainly
82
+ - chat
83
+ - clear
84
+ - clearly
85
+ - come
86
+ - comments
87
+ - could
88
+ - d
89
+ - did
90
+ - differ
91
+ - different
92
+ - differently
93
+ - do
94
+ - does
95
+ - done
96
+ - dont
97
+ - down
98
+ - downed
99
+ - downing
100
+ - downs
101
+ - dunno
102
+ - during
103
+ - each
104
+ - early
105
+ - eh
106
+ - either
107
+ - email
108
+ - end
109
+ - ended
110
+ - ending
111
+ - ends
112
+ - enough
113
+ - even
114
+ - evenly
115
+ - ever
116
+ - every
117
+ - everybody
118
+ - everyone
119
+ - everything
120
+ - everywhere
121
+ - face
122
+ - faces
123
+ - fact
124
+ - facts
125
+ - far
126
+ - felt
127
+ - few
128
+ - find
129
+ - finds
130
+ - first
131
+ - for
132
+ - four
133
+ - from
134
+ - full
135
+ - fully
136
+ - further
137
+ - furthered
138
+ - furthering
139
+ - furthers
140
+ - gave
141
+ - general
142
+ - generally
143
+ - get
144
+ - gets
145
+ - give
146
+ - given
147
+ - gives
148
+ - go
149
+ - going
150
+ - good
151
+ - goods
152
+ - got
153
+ - great
154
+ - greater
155
+ - greatest
156
+ - group
157
+ - grouped
158
+ - grouping
159
+ - groups
160
+ - ha
161
+ - haaaa
162
+ - had
163
+ - haha
164
+ - has
165
+ - have
166
+ - having
167
+ - he
168
+ - heh
169
+ - hehe
170
+ - hehehe
171
+ - her
172
+ - here
173
+ - herself
174
+ - high
175
+ - higher
176
+ - highest
177
+ - him
178
+ - himself
179
+ - his
180
+ - hola
181
+ - how
182
+ - however
183
+ - i
184
+ - id
185
+ - if
186
+ - il
187
+ - im
188
+ - important
189
+ - in
190
+ - init
191
+ - interest
192
+ - interested
193
+ - interesting
194
+ - interests
195
+ - into
196
+ - is
197
+ - it
198
+ - its
199
+ - itself
200
+ - iv
201
+ - ive
202
+ - jst
203
+ - just
204
+ - keep
205
+ - keeps
206
+ - kind
207
+ - knew
208
+ - know
209
+ - known
210
+ - knows
211
+ - large
212
+ - largely
213
+ - last
214
+ - later
215
+ - latest
216
+ - least
217
+ - less
218
+ - let
219
+ - lets
220
+ - like
221
+ - likely
222
+ - lol
223
+ - long
224
+ - longer
225
+ - longest
226
+ - lool
227
+ - loool
228
+ - looool
229
+ - made
230
+ - mah
231
+ - make
232
+ - making
233
+ - man
234
+ - many
235
+ - may
236
+ - me
237
+ - member
238
+ - members
239
+ - men
240
+ - might
241
+ - more
242
+ - most
243
+ - mostly
244
+ - mr
245
+ - mrs
246
+ - much
247
+ - must
248
+ - my
249
+ - myself
250
+ - necessary
251
+ - need
252
+ - needed
253
+ - needing
254
+ - needs
255
+ - never
256
+ - new
257
+ - newer
258
+ - newest
259
+ - next
260
+ - nobody
261
+ - non
262
+ - noone
263
+ - not
264
+ - nothing
265
+ - now
266
+ - nowhere
267
+ - number
268
+ - numbers
269
+ - of
270
+ - often
271
+ - oh
272
+ - old
273
+ - older
274
+ - oldest
275
+ - once
276
+ - one
277
+ - only
278
+ - ooh
279
+ - ooo
280
+ - open
281
+ - opened
282
+ - opening
283
+ - opens
284
+ - or
285
+ - order
286
+ - ordered
287
+ - ordering
288
+ - orders
289
+ - other
290
+ - others
291
+ - our
292
+ - out
293
+ - over
294
+ - part
295
+ - parted
296
+ - parting
297
+ - parts
298
+ - per
299
+ - perhaps
300
+ - place
301
+ - places
302
+ - pls
303
+ - point
304
+ - pointed
305
+ - pointing
306
+ - points
307
+ - possible
308
+ - powered
309
+ - present
310
+ - presented
311
+ - presenting
312
+ - presents
313
+ - problem
314
+ - problems
315
+ - put
316
+ - puts
317
+ - quite
318
+ - rather
319
+ - really
320
+ - right
321
+ - room
322
+ - rooms
323
+ - run
324
+ - safe
325
+ - said
326
+ - same
327
+ - saw
328
+ - say
329
+ - says
330
+ - second
331
+ - seconds
332
+ - see
333
+ - seem
334
+ - seemed
335
+ - seeming
336
+ - seems
337
+ - sees
338
+ - several
339
+ - shall
340
+ - she
341
+ - should
342
+ - show
343
+ - showed
344
+ - showing
345
+ - shows
346
+ - side
347
+ - sides
348
+ - since
349
+ - small
350
+ - smaller
351
+ - smallest
352
+ - so
353
+ - some
354
+ - somebody
355
+ - someone
356
+ - something
357
+ - somewhere
358
+ - state
359
+ - states
360
+ - still
361
+ - stop
362
+ - such
363
+ - sure
364
+ - ta
365
+ - tail
366
+ - take
367
+ - taken
368
+ - team
369
+ - than
370
+ - thank
371
+ - thanks
372
+ - that
373
+ - the
374
+ - their
375
+ - them
376
+ - then
377
+ - there
378
+ - therefore
379
+ - theres
380
+ - these
381
+ - they
382
+ - thing
383
+ - things
384
+ - think
385
+ - thinks
386
+ - this
387
+ - those
388
+ - though
389
+ - thought
390
+ - thoughts
391
+ - three
392
+ - through
393
+ - thus
394
+ - to
395
+ - today
396
+ - together
397
+ - too
398
+ - took
399
+ - toward
400
+ - tryna
401
+ - turn
402
+ - turned
403
+ - turning
404
+ - turns
405
+ - two
406
+ - under
407
+ - until
408
+ - up
409
+ - upon
410
+ - ur
411
+ - us
412
+ - use
413
+ - used
414
+ - uses
415
+ - very
416
+ - want
417
+ - wanted
418
+ - wanting
419
+ - wants
420
+ - was
421
+ - way
422
+ - ways
423
+ - we
424
+ - welcome
425
+ - well
426
+ - wells
427
+ - went
428
+ - were
429
+ - what
430
+ - when
431
+ - where
432
+ - whether
433
+ - which
434
+ - while
435
+ - who
436
+ - whole
437
+ - whose
438
+ - why
439
+ - will
440
+ - with
441
+ - within
442
+ - without
443
+ - work
444
+ - worked
445
+ - working
446
+ - works
447
+ - would
448
+ - ya
449
+ - yeah
450
+ - year
451
+ - years
452
+ - yet
453
+ - yo
454
+ - you
455
+ - young
456
+ - younger
457
+ - youngest
458
+ - your
459
+ - yours
@@ -1,26 +1,39 @@
1
1
  module NewsScraper
2
2
  class Configuration
3
3
  DEFAULT_SCRAPE_PATTERNS_FILEPATH = File.expand_path('../../../config/article_scrape_patterns.yml', __FILE__)
4
- attr_accessor :fetch_method, :scrape_patterns_filepath
4
+ STOPWORDS_FILEPATH = File.expand_path('../../../config/stopwords.yml', __FILE__)
5
+ attr_accessor :scrape_patterns_fetch_method, :stopwords_fetch_method, :scrape_patterns_filepath
5
6
 
6
7
  # <code>NewsScraper::Configuration.initialize</code> initializes the scrape_patterns_filepath
7
- # and the fetch_method to the <code>DEFAULT_SCRAPE_PATTERNS_FILEPATH</code>
8
+ # and the scrape_patterns_fetch_method to the <code>DEFAULT_SCRAPE_PATTERNS_FILEPATH</code>.
9
+ # It also sets stopwords to be used during extraction to a default set contained in <code>STOPWORDS_FILEPATH</code>
8
10
  #
9
11
  # Set the <code>scrape_patterns_filepath</code> to <code>nil</code> to disable saving during training
10
12
  #
11
13
  def initialize
12
14
  self.scrape_patterns_filepath = DEFAULT_SCRAPE_PATTERNS_FILEPATH
13
- self.fetch_method = proc { default_scrape_patterns }
15
+ self.scrape_patterns_fetch_method = proc { default_scrape_patterns }
16
+ self.stopwords_fetch_method = proc { YAML.load_file(STOPWORDS_FILEPATH) }
14
17
  end
15
18
 
16
19
  # <code>NewsScraper::Configuration.scrape_patterns</code> proxies scrape_patterns
17
- # requests to <code>fetch_method</code>:
20
+ # requests to <code>scrape_patterns_fetch_method</code>:
18
21
  #
19
22
  # *Returns*
20
- # - The result of calling the <code>fetch_method</code> proc, expected to be a hash
23
+ # - The result of calling the <code>scrape_patterns_fetch_method</code> proc, expected to be a hash
21
24
  #
22
25
  def scrape_patterns
23
- fetch_method.call
26
+ scrape_patterns_fetch_method.call
27
+ end
28
+
29
+ # <code>NewsScraper::Configuration.stopwords</code> proxies stopwords
30
+ # requests to <code>stopwords_fetch_method</code>:
31
+ #
32
+ # *Returns*
33
+ # - The result of calling the <code>stopwords_fetch_method</code> proc, expected to be an array
34
+ #
35
+ def stopwords
36
+ stopwords_fetch_method.call
24
37
  end
25
38
 
26
39
  private
@@ -1,8 +1,11 @@
1
1
  require 'nokogiri'
2
2
  require 'sanitize'
3
+ require 'news_scraper/transformers/nokogiri/functions'
4
+
3
5
  require 'readability'
4
6
  require 'htmlbeautifier'
5
- require 'news_scraper/transformers/nokogiri/functions'
7
+ require 'metainspector'
8
+ require 'news_scraper/transformers/helpers/highscore_parser'
6
9
 
7
10
  module NewsScraper
8
11
  module Transformers
@@ -69,6 +72,11 @@ module NewsScraper
69
72
  # Remove any newlines in the text
70
73
  content = content.squeeze("\n").strip
71
74
  HtmlBeautifier.beautify(content)
75
+ when :metainspector
76
+ page = MetaInspector.new(@url, document: @payload)
77
+ page.respond_to?(scrape_pattern.to_sym) ? page.send(scrape_pattern.to_sym) : nil
78
+ when :highscore
79
+ NewsScraper::Transformers::Helpers::HighScoreParser.keywords(url: @url, payload: @payload)
72
80
  end
73
81
  end
74
82
  end
@@ -0,0 +1,50 @@
1
+ require 'metainspector'
2
+ require 'highscore'
3
+ require 'readability'
4
+
5
+ module NewsScraper
6
+ module Transformers
7
+ module Helpers
8
+ class HighScoreParser
9
+ class << self
10
+ # <code>NewsScraper::Transformers::Helpers::HighScoreParser.keywords</code> parses out keywords
11
+ #
12
+ # *Params*
13
+ # - <code>url:</code>: keyword for the url to parse to a uri
14
+ # - <code>payload:</code>: keyword for the payload from a request to the url (html body)
15
+ #
16
+ # *Returns*
17
+ # - <code>keywords</code>: Top 5 keywords from the body of text
18
+ #
19
+ def keywords(url:, payload:)
20
+ blacklist = Highscore::Blacklist.load(stopwords(url, payload))
21
+ content = Readability::Document.new(payload, emove_empty_nodes: true, tags: [], attributes: []).content
22
+ highscore(content, blacklist)
23
+ end
24
+
25
+ private
26
+
27
+ def highscore(content, blacklist)
28
+ text = Highscore::Content.new(content, blacklist)
29
+ text.configure do
30
+ set :multiplier, 2
31
+ set :upper_case, 3
32
+ set :long_words, 2
33
+ set :long_words_threshold, 15
34
+ set :ignore_case, true
35
+ end
36
+ text.keywords.top(5).collect(&:text)
37
+ end
38
+
39
+ def stopwords(url, payload)
40
+ page = MetaInspector.new(url, document: payload)
41
+ stopwords = NewsScraper.configuration.stopwords
42
+ # Add the site name to the stop words
43
+ stopwords += page.meta['og:site_name'].downcase.split(' ') if page.meta['og:site_name']
44
+ stopwords
45
+ end
46
+ end
47
+ end
48
+ end
49
+ end
50
+ end
@@ -1,3 +1,3 @@
1
1
  module NewsScraper
2
- VERSION = "1.0.0".freeze
2
+ VERSION = "1.1.0".freeze
3
3
  end
data/news_scraper.gemspec CHANGED
@@ -1,3 +1,4 @@
1
+ # rubocop:disable BlockLength
1
2
  # coding: utf-8
2
3
  lib = File.expand_path('../lib', __FILE__)
3
4
  $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
@@ -29,7 +30,9 @@ Gem::Specification.new do |spec|
29
30
  spec.add_dependency 'sanitize', '~> 4.2', '>= 4.2.0'
30
31
  spec.add_dependency 'ruby-readability', '~> 0.7', '>= 0.7.0'
31
32
  spec.add_dependency 'htmlbeautifier', '~> 1.1', '>= 1.1.1'
32
- spec.add_dependency 'terminal-table', '~> 1.5', '>= 1.5.2'
33
+ spec.add_dependency 'terminal-table', '~> 1.7.0', '>= 1.7.0'
34
+ spec.add_dependency 'metainspector', '~> 5.3.0', '>= 5.3.0'
35
+ spec.add_dependency 'highscore', '~> 1.2.0', '>= 1.2.0'
33
36
 
34
37
  spec.add_development_dependency 'bundler', '~> 1.12', '>= 1.12.0'
35
38
  spec.add_development_dependency 'rake', '~> 10.0', '>= 10.0.0'
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: news_scraper
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0
4
+ version: 1.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Richard Wu
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: exe
11
11
  cert_chain: []
12
- date: 2016-09-25 00:00:00.000000000 Z
12
+ date: 2016-10-16 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: nokogiri
@@ -117,20 +117,60 @@ dependencies:
117
117
  requirements:
118
118
  - - "~>"
119
119
  - !ruby/object:Gem::Version
120
- version: '1.5'
120
+ version: 1.7.0
121
121
  - - ">="
122
122
  - !ruby/object:Gem::Version
123
- version: 1.5.2
123
+ version: 1.7.0
124
124
  type: :runtime
125
125
  prerelease: false
126
126
  version_requirements: !ruby/object:Gem::Requirement
127
127
  requirements:
128
128
  - - "~>"
129
129
  - !ruby/object:Gem::Version
130
- version: '1.5'
130
+ version: 1.7.0
131
131
  - - ">="
132
132
  - !ruby/object:Gem::Version
133
- version: 1.5.2
133
+ version: 1.7.0
134
+ - !ruby/object:Gem::Dependency
135
+ name: metainspector
136
+ requirement: !ruby/object:Gem::Requirement
137
+ requirements:
138
+ - - "~>"
139
+ - !ruby/object:Gem::Version
140
+ version: 5.3.0
141
+ - - ">="
142
+ - !ruby/object:Gem::Version
143
+ version: 5.3.0
144
+ type: :runtime
145
+ prerelease: false
146
+ version_requirements: !ruby/object:Gem::Requirement
147
+ requirements:
148
+ - - "~>"
149
+ - !ruby/object:Gem::Version
150
+ version: 5.3.0
151
+ - - ">="
152
+ - !ruby/object:Gem::Version
153
+ version: 5.3.0
154
+ - !ruby/object:Gem::Dependency
155
+ name: highscore
156
+ requirement: !ruby/object:Gem::Requirement
157
+ requirements:
158
+ - - "~>"
159
+ - !ruby/object:Gem::Version
160
+ version: 1.2.0
161
+ - - ">="
162
+ - !ruby/object:Gem::Version
163
+ version: 1.2.0
164
+ type: :runtime
165
+ prerelease: false
166
+ version_requirements: !ruby/object:Gem::Requirement
167
+ requirements:
168
+ - - "~>"
169
+ - !ruby/object:Gem::Version
170
+ version: 1.2.0
171
+ - - ">="
172
+ - !ruby/object:Gem::Version
173
+ version: 1.2.0
134
174
  - !ruby/object:Gem::Dependency
135
175
  name: bundler
136
176
  requirement: !ruby/object:Gem::Requirement
@@ -325,6 +365,7 @@ files:
325
365
  - bin/setup
326
366
  - circle.yml
327
367
  - config/article_scrape_patterns.yml
368
+ - config/stopwords.yml
328
369
  - config/temp_dirs.yml
329
370
  - dev.yml
330
371
  - lib/news_scraper.rb
@@ -340,6 +381,7 @@ files:
340
381
  - lib/news_scraper/trainer/preset_selector.rb
341
382
  - lib/news_scraper/trainer/url_trainer.rb
342
383
  - lib/news_scraper/transformers/article.rb
384
+ - lib/news_scraper/transformers/helpers/highscore_parser.rb
343
385
  - lib/news_scraper/transformers/nokogiri/functions.rb
344
386
  - lib/news_scraper/transformers/trainer_article.rb
345
387
  - lib/news_scraper/uri_parser.rb