news_scraper 1.0.0 → 1.1.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/Gemfile +4 -0
- data/README.md +77 -8
- data/config/article_scrape_patterns.yml +9 -0
- data/config/stopwords.yml +459 -0
- data/lib/news_scraper/configuration.rb +19 -6
- data/lib/news_scraper/transformers/article.rb +9 -1
- data/lib/news_scraper/transformers/helpers/highscore_parser.rb +50 -0
- data/lib/news_scraper/version.rb +1 -1
- data/news_scraper.gemspec +4 -1
- metadata +48 -6
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 608b90149fbc8977b1fc3b42c923557b128ad4df
|
4
|
+
data.tar.gz: 0e0914d81488d9630234860b3a9e732a1471158a
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 1e344051f216c10b320b324db5dbaeccaa9f034a1bd2d94f1905ea8797ed6389d4780db125bcd354dd553bc0279f75aecc507f5de91d30eb6d00098474e69173
|
7
|
+
data.tar.gz: 1823fe68329a466385e16dd224d49633c122afec26d6c71cd66d9b428888e81e68a73dfc78133a995ad0a9885248ae03389f2d9b3b192b4d514970bc385990b6
|
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -33,14 +33,23 @@ Optionally, you can pass in a block and it will yield the transformed data on a
|
|
33
33
|
It takes in 1 parameter `query:`.
|
34
34
|
|
35
35
|
Array notation
|
36
|
-
```
|
36
|
+
```ruby
|
37
37
|
article_hashes = NewsScraper::Scraper.new(query: 'Shopify').scrape # [ { author: ... }, { author: ... } ... ]
|
38
38
|
```
|
39
39
|
|
40
|
+
*Note:* the array notation may raise `NewsScraper::Transformers::ScrapePatternNotDefined` (domain is not in the configuration) or `NewsScraper::ResponseError` (non-200 response), for this reason, it is suggested to use the block notation where this can be handled properly
|
41
|
+
|
40
42
|
Block notation
|
41
|
-
```
|
42
|
-
NewsScraper::Scraper.new(query: 'Shopify').scrape do |
|
43
|
-
|
43
|
+
```ruby
|
44
|
+
NewsScraper::Scraper.new(query: 'Shopify').scrape do |a|
|
45
|
+
case a.class.to_s
|
46
|
+
when "NewsScraper::Transformers::ScrapePatternNotDefined"
|
47
|
+
puts "#{a.root_domain} was not trained"
|
48
|
+
when "NewsScraper::ResponseError"
|
49
|
+
puts "#{a.url} returned an error: #{a.error_code}-#{a.message}"
|
50
|
+
else
|
51
|
+
# { author: ... }
|
52
|
+
end
|
44
53
|
end
|
45
54
|
```
|
46
55
|
|
@@ -48,12 +57,12 @@ How the `Scraper` extracts and parses for the information is determined by scrap
|
|
48
57
|
|
49
58
|
### Transformed Data
|
50
59
|
|
51
|
-
Calling `NewsScraper::Scraper#scrape` with either the array or block notation will yield `transformed_data` hashes. [`article_scrape_patterns.yml`](https://github.com/
|
60
|
+
Calling `NewsScraper::Scraper#scrape` with either the array or block notation will yield `transformed_data` hashes. [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml) defines the data types that will be scraped for.
|
52
61
|
|
53
62
|
In addition, the `url` and `root_domain`(hostname) of the article will be returned in the hash too.
|
54
63
|
|
55
64
|
Example
|
56
|
-
```
|
65
|
+
```ruby
|
57
66
|
{
|
58
67
|
author: 'Linus Torvald',
|
59
68
|
body: 'The Linux kernel developed by Linus Torvald has become the backbone of most electronic devices we use to-date. It powers mobile phones, laptops, embedded devices, and even rockets...',
|
@@ -71,12 +80,34 @@ Example
|
|
71
80
|
|
72
81
|
Scrape patterns are xpath or CSS patterns used by Nokogiri to extract relevant HTML elements.
|
73
82
|
|
74
|
-
Extracting each `:data_type` (see Example under **Transformed Data**) requires a scrape pattern. A few `:presets` are specified in [`article_scrape_patterns.yml`](https://github.com/
|
83
|
+
Extracting each `:data_type` (see Example under **Transformed Data**) requires a scrape pattern. A few `:presets` are specified in [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml).
|
75
84
|
|
76
85
|
Since each news site (identified with `:root_domain`) uses a different markup, scrape patterns are defined on a per-`:root_domain` basis.
|
77
86
|
|
78
87
|
Specifying scrape patterns for new, undefined `:root_domains` is called training (see **Training**).
|
79
88
|
|
89
|
+
#### Customizing Scrape Patterns
|
90
|
+
|
91
|
+
`NewsScraper.configuration` is the entry point for scrape patterns. By default, it loads the contents of [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml), but you can override this with the `fetch_method` which accepts a proc.
|
92
|
+
|
93
|
+
For example, to override the domains section we can do this like so:
|
94
|
+
|
95
|
+
```ruby
|
96
|
+
@default_configuration = NewsScraper.configuration.scrape_patterns.dup
|
97
|
+
NewsScraper.configure do |config|
|
98
|
+
config.fetch_method = proc do
|
99
|
+
@default_configuration['domains'] = { ... }
|
100
|
+
@default_configuration
|
101
|
+
end
|
102
|
+
end
|
103
|
+
```
|
104
|
+
|
105
|
+
Of course, using this method you can override any part of the configuration individually, or the entire thing. It is fully customizeable.
|
106
|
+
|
107
|
+
This helps with separate apps which may track domains training itself. If the configuration is not set correctly, a newly trained domain will not be in the configuration and a `NewsScraper::Transformers::ScrapePatternNotDefined` error will be raised.
|
108
|
+
|
109
|
+
It would be appreciated that any domains you train outside of this gem eventually end up as a pull request back to [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml).
|
110
|
+
|
80
111
|
### Training
|
81
112
|
|
82
113
|
For each `:root_domain`, it is neccesary to specify a scrape pattern for each of the `:data_type`s. A rake task was written to provide a CLI for appending new `:root_domain`s using `:preset` scrape patterns.
|
@@ -88,6 +119,44 @@ bundle exec rake scraper:train QUERY=<query>
|
|
88
119
|
|
89
120
|
where the CLI will step through the articles and `:root_domain`s of the articles relevant to `<query>`.
|
90
121
|
|
122
|
+
Of course, this will simply create an entry for a `domain` with `domain_entries`, so as long as your application provides the same functionality, this can be overriden in your app. Just provide a domain entry like so:
|
123
|
+
|
124
|
+
```yaml
|
125
|
+
domains:
|
126
|
+
root_domain.com:
|
127
|
+
author:
|
128
|
+
method: method
|
129
|
+
pattern: pattern
|
130
|
+
body:
|
131
|
+
method: method
|
132
|
+
pattern: pattern
|
133
|
+
description:
|
134
|
+
method: method
|
135
|
+
pattern: pattern
|
136
|
+
keywords:
|
137
|
+
method: method
|
138
|
+
pattern: pattern
|
139
|
+
section:
|
140
|
+
method: method
|
141
|
+
pattern: pattern
|
142
|
+
datetime:
|
143
|
+
method: method
|
144
|
+
pattern: pattern
|
145
|
+
title:
|
146
|
+
method: method
|
147
|
+
pattern: pattern
|
148
|
+
```
|
149
|
+
|
150
|
+
The options using the presets in [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml), can be obtained using this snippet:
|
151
|
+
```ruby
|
152
|
+
include NewsScraper::ExtractorsHelpers
|
153
|
+
|
154
|
+
transformed_data = NewsScraper::Transformers::TrainerArticle.new(
|
155
|
+
url: url,
|
156
|
+
payload: http_request(url).body
|
157
|
+
).transform
|
158
|
+
```
|
159
|
+
|
91
160
|
## Development
|
92
161
|
|
93
162
|
After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
|
@@ -96,7 +165,7 @@ To install this gem onto your local machine, run `bundle exec rake install`. To
|
|
96
165
|
|
97
166
|
## Contributing
|
98
167
|
|
99
|
-
Bug reports and pull requests are welcome on GitHub at https://github.com/
|
168
|
+
Bug reports and pull requests are welcome on GitHub at https://github.com/news-scraper/news_scraper. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
|
100
169
|
|
101
170
|
|
102
171
|
## License
|
@@ -45,6 +45,9 @@ presets:
|
|
45
45
|
og: &og_description
|
46
46
|
method: "xpath"
|
47
47
|
pattern: "//meta[@property='og:description']/@content"
|
48
|
+
metainspector: &metainspector_description
|
49
|
+
method: 'metainspector'
|
50
|
+
pattern: :description
|
48
51
|
keywords:
|
49
52
|
meta: &meta_keywords
|
50
53
|
method: "xpath"
|
@@ -55,6 +58,9 @@ presets:
|
|
55
58
|
news_keywords: &news_keywords_keywords
|
56
59
|
method: "xpath"
|
57
60
|
pattern: "//meta[@name='news_keywords']/@content"
|
61
|
+
highscore: &highscore_keywords
|
62
|
+
method: highscore
|
63
|
+
pattern: ""
|
58
64
|
section:
|
59
65
|
meta: &meta_section
|
60
66
|
method: "xpath"
|
@@ -109,6 +115,9 @@ presets:
|
|
109
115
|
og: &og_title
|
110
116
|
method: "xpath"
|
111
117
|
pattern: "//meta[@property='og:title']/@content"
|
118
|
+
metainspector: &metainspector_title
|
119
|
+
method: 'metainspector'
|
120
|
+
pattern: :best_title
|
112
121
|
|
113
122
|
domains:
|
114
123
|
investors.com:
|
@@ -0,0 +1,459 @@
|
|
1
|
+
---
|
2
|
+
- "-"
|
3
|
+
- "--"
|
4
|
+
- ":"
|
5
|
+
- ":d"
|
6
|
+
- 'no'
|
7
|
+
- 'off'
|
8
|
+
- 'on'
|
9
|
+
- about
|
10
|
+
- above
|
11
|
+
- across
|
12
|
+
- after
|
13
|
+
- again
|
14
|
+
- against
|
15
|
+
- ahahaha
|
16
|
+
- all
|
17
|
+
- almost
|
18
|
+
- alone
|
19
|
+
- along
|
20
|
+
- already
|
21
|
+
- also
|
22
|
+
- although
|
23
|
+
- always
|
24
|
+
- am
|
25
|
+
- among
|
26
|
+
- an
|
27
|
+
- and
|
28
|
+
- another
|
29
|
+
- any
|
30
|
+
- anybody
|
31
|
+
- anyone
|
32
|
+
- anything
|
33
|
+
- anywhere
|
34
|
+
- ar
|
35
|
+
- are
|
36
|
+
- area
|
37
|
+
- areas
|
38
|
+
- around
|
39
|
+
- as
|
40
|
+
- ask
|
41
|
+
- asked
|
42
|
+
- asking
|
43
|
+
- asks
|
44
|
+
- at
|
45
|
+
- aw
|
46
|
+
- away
|
47
|
+
- aww
|
48
|
+
- awww
|
49
|
+
- back
|
50
|
+
- backed
|
51
|
+
- backing
|
52
|
+
- backs
|
53
|
+
- be
|
54
|
+
- became
|
55
|
+
- because
|
56
|
+
- become
|
57
|
+
- becomes
|
58
|
+
- been
|
59
|
+
- before
|
60
|
+
- began
|
61
|
+
- behind
|
62
|
+
- being
|
63
|
+
- beings
|
64
|
+
- best
|
65
|
+
- better
|
66
|
+
- between
|
67
|
+
- big
|
68
|
+
- bit
|
69
|
+
- blud
|
70
|
+
- both
|
71
|
+
- bt
|
72
|
+
- but
|
73
|
+
- by
|
74
|
+
- call
|
75
|
+
- came
|
76
|
+
- can
|
77
|
+
- cannot
|
78
|
+
- case
|
79
|
+
- cases
|
80
|
+
- certain
|
81
|
+
- certainly
|
82
|
+
- chat
|
83
|
+
- clear
|
84
|
+
- clearly
|
85
|
+
- come
|
86
|
+
- comments
|
87
|
+
- could
|
88
|
+
- d
|
89
|
+
- did
|
90
|
+
- differ
|
91
|
+
- different
|
92
|
+
- differently
|
93
|
+
- do
|
94
|
+
- does
|
95
|
+
- done
|
96
|
+
- dont
|
97
|
+
- down
|
98
|
+
- downed
|
99
|
+
- downing
|
100
|
+
- downs
|
101
|
+
- dunno
|
102
|
+
- during
|
103
|
+
- each
|
104
|
+
- early
|
105
|
+
- eh
|
106
|
+
- either
|
107
|
+
- email
|
108
|
+
- end
|
109
|
+
- ended
|
110
|
+
- ending
|
111
|
+
- ends
|
112
|
+
- enough
|
113
|
+
- even
|
114
|
+
- evenly
|
115
|
+
- ever
|
116
|
+
- every
|
117
|
+
- everybody
|
118
|
+
- everyone
|
119
|
+
- everything
|
120
|
+
- everywhere
|
121
|
+
- face
|
122
|
+
- faces
|
123
|
+
- fact
|
124
|
+
- facts
|
125
|
+
- far
|
126
|
+
- felt
|
127
|
+
- few
|
128
|
+
- find
|
129
|
+
- finds
|
130
|
+
- first
|
131
|
+
- for
|
132
|
+
- four
|
133
|
+
- from
|
134
|
+
- full
|
135
|
+
- fully
|
136
|
+
- further
|
137
|
+
- furthered
|
138
|
+
- furthering
|
139
|
+
- furthers
|
140
|
+
- gave
|
141
|
+
- general
|
142
|
+
- generally
|
143
|
+
- get
|
144
|
+
- gets
|
145
|
+
- give
|
146
|
+
- given
|
147
|
+
- gives
|
148
|
+
- go
|
149
|
+
- going
|
150
|
+
- good
|
151
|
+
- goods
|
152
|
+
- got
|
153
|
+
- great
|
154
|
+
- greater
|
155
|
+
- greatest
|
156
|
+
- group
|
157
|
+
- grouped
|
158
|
+
- grouping
|
159
|
+
- groups
|
160
|
+
- ha
|
161
|
+
- haaaa
|
162
|
+
- had
|
163
|
+
- haha
|
164
|
+
- has
|
165
|
+
- have
|
166
|
+
- having
|
167
|
+
- he
|
168
|
+
- heh
|
169
|
+
- hehe
|
170
|
+
- hehehe
|
171
|
+
- her
|
172
|
+
- here
|
173
|
+
- herself
|
174
|
+
- high
|
175
|
+
- higher
|
176
|
+
- highest
|
177
|
+
- him
|
178
|
+
- himself
|
179
|
+
- his
|
180
|
+
- hola
|
181
|
+
- how
|
182
|
+
- however
|
183
|
+
- i
|
184
|
+
- id
|
185
|
+
- if
|
186
|
+
- il
|
187
|
+
- im
|
188
|
+
- important
|
189
|
+
- in
|
190
|
+
- init
|
191
|
+
- interest
|
192
|
+
- interested
|
193
|
+
- interesting
|
194
|
+
- interests
|
195
|
+
- into
|
196
|
+
- is
|
197
|
+
- it
|
198
|
+
- its
|
199
|
+
- itself
|
200
|
+
- iv
|
201
|
+
- ive
|
202
|
+
- jst
|
203
|
+
- just
|
204
|
+
- keep
|
205
|
+
- keeps
|
206
|
+
- kind
|
207
|
+
- knew
|
208
|
+
- know
|
209
|
+
- known
|
210
|
+
- knows
|
211
|
+
- large
|
212
|
+
- largely
|
213
|
+
- last
|
214
|
+
- later
|
215
|
+
- latest
|
216
|
+
- least
|
217
|
+
- less
|
218
|
+
- let
|
219
|
+
- lets
|
220
|
+
- like
|
221
|
+
- likely
|
222
|
+
- lol
|
223
|
+
- long
|
224
|
+
- longer
|
225
|
+
- longest
|
226
|
+
- lool
|
227
|
+
- loool
|
228
|
+
- looool
|
229
|
+
- made
|
230
|
+
- mah
|
231
|
+
- make
|
232
|
+
- making
|
233
|
+
- man
|
234
|
+
- many
|
235
|
+
- may
|
236
|
+
- me
|
237
|
+
- member
|
238
|
+
- members
|
239
|
+
- men
|
240
|
+
- might
|
241
|
+
- more
|
242
|
+
- most
|
243
|
+
- mostly
|
244
|
+
- mr
|
245
|
+
- mrs
|
246
|
+
- much
|
247
|
+
- must
|
248
|
+
- my
|
249
|
+
- myself
|
250
|
+
- necessary
|
251
|
+
- need
|
252
|
+
- needed
|
253
|
+
- needing
|
254
|
+
- needs
|
255
|
+
- never
|
256
|
+
- new
|
257
|
+
- newer
|
258
|
+
- newest
|
259
|
+
- next
|
260
|
+
- nobody
|
261
|
+
- non
|
262
|
+
- noone
|
263
|
+
- not
|
264
|
+
- nothing
|
265
|
+
- now
|
266
|
+
- nowhere
|
267
|
+
- number
|
268
|
+
- numbers
|
269
|
+
- of
|
270
|
+
- often
|
271
|
+
- oh
|
272
|
+
- old
|
273
|
+
- older
|
274
|
+
- oldest
|
275
|
+
- once
|
276
|
+
- one
|
277
|
+
- only
|
278
|
+
- ooh
|
279
|
+
- ooo
|
280
|
+
- open
|
281
|
+
- opened
|
282
|
+
- opening
|
283
|
+
- opens
|
284
|
+
- or
|
285
|
+
- order
|
286
|
+
- ordered
|
287
|
+
- ordering
|
288
|
+
- orders
|
289
|
+
- other
|
290
|
+
- others
|
291
|
+
- our
|
292
|
+
- out
|
293
|
+
- over
|
294
|
+
- part
|
295
|
+
- parted
|
296
|
+
- parting
|
297
|
+
- parts
|
298
|
+
- per
|
299
|
+
- perhaps
|
300
|
+
- place
|
301
|
+
- places
|
302
|
+
- pls
|
303
|
+
- point
|
304
|
+
- pointed
|
305
|
+
- pointing
|
306
|
+
- points
|
307
|
+
- possible
|
308
|
+
- powered
|
309
|
+
- present
|
310
|
+
- presented
|
311
|
+
- presenting
|
312
|
+
- presents
|
313
|
+
- problem
|
314
|
+
- problems
|
315
|
+
- put
|
316
|
+
- puts
|
317
|
+
- quite
|
318
|
+
- rather
|
319
|
+
- really
|
320
|
+
- right
|
321
|
+
- room
|
322
|
+
- rooms
|
323
|
+
- run
|
324
|
+
- safe
|
325
|
+
- said
|
326
|
+
- same
|
327
|
+
- saw
|
328
|
+
- say
|
329
|
+
- says
|
330
|
+
- second
|
331
|
+
- seconds
|
332
|
+
- see
|
333
|
+
- seem
|
334
|
+
- seemed
|
335
|
+
- seeming
|
336
|
+
- seems
|
337
|
+
- sees
|
338
|
+
- several
|
339
|
+
- shall
|
340
|
+
- she
|
341
|
+
- should
|
342
|
+
- show
|
343
|
+
- showed
|
344
|
+
- showing
|
345
|
+
- shows
|
346
|
+
- side
|
347
|
+
- sides
|
348
|
+
- since
|
349
|
+
- small
|
350
|
+
- smaller
|
351
|
+
- smallest
|
352
|
+
- so
|
353
|
+
- some
|
354
|
+
- somebody
|
355
|
+
- someone
|
356
|
+
- something
|
357
|
+
- somewhere
|
358
|
+
- state
|
359
|
+
- states
|
360
|
+
- still
|
361
|
+
- stop
|
362
|
+
- such
|
363
|
+
- sure
|
364
|
+
- ta
|
365
|
+
- tail
|
366
|
+
- take
|
367
|
+
- taken
|
368
|
+
- team
|
369
|
+
- than
|
370
|
+
- thank
|
371
|
+
- thanks
|
372
|
+
- that
|
373
|
+
- the
|
374
|
+
- their
|
375
|
+
- them
|
376
|
+
- then
|
377
|
+
- there
|
378
|
+
- therefore
|
379
|
+
- theres
|
380
|
+
- these
|
381
|
+
- they
|
382
|
+
- thing
|
383
|
+
- things
|
384
|
+
- think
|
385
|
+
- thinks
|
386
|
+
- this
|
387
|
+
- those
|
388
|
+
- though
|
389
|
+
- thought
|
390
|
+
- thoughts
|
391
|
+
- three
|
392
|
+
- through
|
393
|
+
- thus
|
394
|
+
- to
|
395
|
+
- today
|
396
|
+
- together
|
397
|
+
- too
|
398
|
+
- took
|
399
|
+
- toward
|
400
|
+
- tryna
|
401
|
+
- turn
|
402
|
+
- turned
|
403
|
+
- turning
|
404
|
+
- turns
|
405
|
+
- two
|
406
|
+
- under
|
407
|
+
- until
|
408
|
+
- up
|
409
|
+
- upon
|
410
|
+
- ur
|
411
|
+
- us
|
412
|
+
- use
|
413
|
+
- used
|
414
|
+
- uses
|
415
|
+
- very
|
416
|
+
- want
|
417
|
+
- wanted
|
418
|
+
- wanting
|
419
|
+
- wants
|
420
|
+
- was
|
421
|
+
- way
|
422
|
+
- ways
|
423
|
+
- we
|
424
|
+
- welcome
|
425
|
+
- well
|
426
|
+
- wells
|
427
|
+
- went
|
428
|
+
- were
|
429
|
+
- what
|
430
|
+
- when
|
431
|
+
- where
|
432
|
+
- whether
|
433
|
+
- which
|
434
|
+
- while
|
435
|
+
- who
|
436
|
+
- whole
|
437
|
+
- whose
|
438
|
+
- why
|
439
|
+
- will
|
440
|
+
- with
|
441
|
+
- within
|
442
|
+
- without
|
443
|
+
- work
|
444
|
+
- worked
|
445
|
+
- working
|
446
|
+
- works
|
447
|
+
- would
|
448
|
+
- ya
|
449
|
+
- yeah
|
450
|
+
- year
|
451
|
+
- years
|
452
|
+
- yet
|
453
|
+
- yo
|
454
|
+
- you
|
455
|
+
- young
|
456
|
+
- younger
|
457
|
+
- youngest
|
458
|
+
- your
|
459
|
+
- yours
|
@@ -1,26 +1,39 @@
|
|
1
1
|
module NewsScraper
|
2
2
|
class Configuration
|
3
3
|
DEFAULT_SCRAPE_PATTERNS_FILEPATH = File.expand_path('../../../config/article_scrape_patterns.yml', __FILE__)
|
4
|
-
|
4
|
+
STOPWORDS_FILEPATH = File.expand_path('../../../config/stopwords.yml', __FILE__)
|
5
|
+
attr_accessor :scrape_patterns_fetch_method, :stopwords_fetch_method, :scrape_patterns_filepath
|
5
6
|
|
6
7
|
# <code>NewsScraper::Configuration.initialize</code> initializes the scrape_patterns_filepath
|
7
|
-
# and the
|
8
|
+
# and the scrape_patterns_fetch_method to the <code>DEFAULT_SCRAPE_PATTERNS_FILEPATH</code>.
|
9
|
+
# It also sets stopwords to be used during extraction to a default set contained in <code>STOPWORDS_FILEPATH</code>
|
8
10
|
#
|
9
11
|
# Set the <code>scrape_patterns_filepath</code> to <code>nil</code> to disable saving during training
|
10
12
|
#
|
11
13
|
def initialize
|
12
14
|
self.scrape_patterns_filepath = DEFAULT_SCRAPE_PATTERNS_FILEPATH
|
13
|
-
self.
|
15
|
+
self.scrape_patterns_fetch_method = proc { default_scrape_patterns }
|
16
|
+
self.stopwords_fetch_method = proc { YAML.load_file(STOPWORDS_FILEPATH) }
|
14
17
|
end
|
15
18
|
|
16
19
|
# <code>NewsScraper::Configuration.scrape_patterns</code> proxies scrape_patterns
|
17
|
-
# requests to <code>
|
20
|
+
# requests to <code>scrape_patterns_fetch_method</code>:
|
18
21
|
#
|
19
22
|
# *Returns*
|
20
|
-
# - The result of calling the <code>
|
23
|
+
# - The result of calling the <code>scrape_patterns_fetch_method</code> proc, expected to be a hash
|
21
24
|
#
|
22
25
|
def scrape_patterns
|
23
|
-
|
26
|
+
scrape_patterns_fetch_method.call
|
27
|
+
end
|
28
|
+
|
29
|
+
# <code>NewsScraper::Configuration.stopwords</code> proxies stopwords
|
30
|
+
# requests to <code>stopwords_fetch_method</code>:
|
31
|
+
#
|
32
|
+
# *Returns*
|
33
|
+
# - The result of calling the <code>stopwords_fetch_method</code> proc, expected to be an array
|
34
|
+
#
|
35
|
+
def stopwords
|
36
|
+
stopwords_fetch_method.call
|
24
37
|
end
|
25
38
|
|
26
39
|
private
|
@@ -1,8 +1,11 @@
|
|
1
1
|
require 'nokogiri'
|
2
2
|
require 'sanitize'
|
3
|
+
require 'news_scraper/transformers/nokogiri/functions'
|
4
|
+
|
3
5
|
require 'readability'
|
4
6
|
require 'htmlbeautifier'
|
5
|
-
require '
|
7
|
+
require 'metainspector'
|
8
|
+
require 'news_scraper/transformers/helpers/highscore_parser'
|
6
9
|
|
7
10
|
module NewsScraper
|
8
11
|
module Transformers
|
@@ -69,6 +72,11 @@ module NewsScraper
|
|
69
72
|
# Remove any newlines in the text
|
70
73
|
content = content.squeeze("\n").strip
|
71
74
|
HtmlBeautifier.beautify(content)
|
75
|
+
when :metainspector
|
76
|
+
page = MetaInspector.new(@url, document: @payload)
|
77
|
+
page.respond_to?(scrape_pattern.to_sym) ? page.send(scrape_pattern.to_sym) : nil
|
78
|
+
when :highscore
|
79
|
+
NewsScraper::Transformers::Helpers::HighScoreParser.keywords(url: @url, payload: @payload)
|
72
80
|
end
|
73
81
|
end
|
74
82
|
end
|
@@ -0,0 +1,50 @@
|
|
1
|
+
require 'metainspector'
|
2
|
+
require 'highscore'
|
3
|
+
require 'readability'
|
4
|
+
|
5
|
+
module NewsScraper
|
6
|
+
module Transformers
|
7
|
+
module Helpers
|
8
|
+
class HighScoreParser
|
9
|
+
class << self
|
10
|
+
# <code>NewsScraper::Transformers::Helpers::HighScoreParser.keywords</code> parses out keywords
|
11
|
+
#
|
12
|
+
# *Params*
|
13
|
+
# - <code>url:</code>: keyword for the url to parse to a uri
|
14
|
+
# - <code>payload:</code>: keyword for the payload from a request to the url (html body)
|
15
|
+
#
|
16
|
+
# *Returns*
|
17
|
+
# - <code>keywords</code>: Top 5 keywords from the body of text
|
18
|
+
#
|
19
|
+
def keywords(url:, payload:)
|
20
|
+
blacklist = Highscore::Blacklist.load(stopwords(url, payload))
|
21
|
+
content = Readability::Document.new(payload, emove_empty_nodes: true, tags: [], attributes: []).content
|
22
|
+
highscore(content, blacklist)
|
23
|
+
end
|
24
|
+
|
25
|
+
private
|
26
|
+
|
27
|
+
def highscore(content, blacklist)
|
28
|
+
text = Highscore::Content.new(content, blacklist)
|
29
|
+
text.configure do
|
30
|
+
set :multiplier, 2
|
31
|
+
set :upper_case, 3
|
32
|
+
set :long_words, 2
|
33
|
+
set :long_words_threshold, 15
|
34
|
+
set :ignore_case, true
|
35
|
+
end
|
36
|
+
text.keywords.top(5).collect(&:text)
|
37
|
+
end
|
38
|
+
|
39
|
+
def stopwords(url, payload)
|
40
|
+
page = MetaInspector.new(url, document: payload)
|
41
|
+
stopwords = NewsScraper.configuration.stopwords
|
42
|
+
# Add the site name to the stop words
|
43
|
+
stopwords += page.meta['og:site_name'].downcase.split(' ') if page.meta['og:site_name']
|
44
|
+
stopwords
|
45
|
+
end
|
46
|
+
end
|
47
|
+
end
|
48
|
+
end
|
49
|
+
end
|
50
|
+
end
|
data/lib/news_scraper/version.rb
CHANGED
data/news_scraper.gemspec
CHANGED
@@ -1,3 +1,4 @@
|
|
1
|
+
# rubocop:disable BlockLength
|
1
2
|
# coding: utf-8
|
2
3
|
lib = File.expand_path('../lib', __FILE__)
|
3
4
|
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
|
@@ -29,7 +30,9 @@ Gem::Specification.new do |spec|
|
|
29
30
|
spec.add_dependency 'sanitize', '~> 4.2', '>= 4.2.0'
|
30
31
|
spec.add_dependency 'ruby-readability', '~> 0.7', '>= 0.7.0'
|
31
32
|
spec.add_dependency 'htmlbeautifier', '~> 1.1', '>= 1.1.1'
|
32
|
-
spec.add_dependency 'terminal-table', '~> 1.
|
33
|
+
spec.add_dependency 'terminal-table', '~> 1.7.0', '>= 1.7.0'
|
34
|
+
spec.add_dependency 'metainspector', '~> 5.3.0', '>= 5.3.0'
|
35
|
+
spec.add_dependency 'highscore', '~> 1.2.0', '>= 1.2.0'
|
33
36
|
|
34
37
|
spec.add_development_dependency 'bundler', '~> 1.12', '>= 1.12.0'
|
35
38
|
spec.add_development_dependency 'rake', '~> 10.0', '>= 10.0.0'
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: news_scraper
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.1.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Richard Wu
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: exe
|
11
11
|
cert_chain: []
|
12
|
-
date: 2016-
|
12
|
+
date: 2016-10-16 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: nokogiri
|
@@ -117,20 +117,60 @@ dependencies:
|
|
117
117
|
requirements:
|
118
118
|
- - "~>"
|
119
119
|
- !ruby/object:Gem::Version
|
120
|
-
version:
|
120
|
+
version: 1.7.0
|
121
121
|
- - ">="
|
122
122
|
- !ruby/object:Gem::Version
|
123
|
-
version: 1.
|
123
|
+
version: 1.7.0
|
124
124
|
type: :runtime
|
125
125
|
prerelease: false
|
126
126
|
version_requirements: !ruby/object:Gem::Requirement
|
127
127
|
requirements:
|
128
128
|
- - "~>"
|
129
129
|
- !ruby/object:Gem::Version
|
130
|
-
version:
|
130
|
+
version: 1.7.0
|
131
131
|
- - ">="
|
132
132
|
- !ruby/object:Gem::Version
|
133
|
-
version: 1.
|
133
|
+
version: 1.7.0
|
134
|
+
- !ruby/object:Gem::Dependency
|
135
|
+
name: metainspector
|
136
|
+
requirement: !ruby/object:Gem::Requirement
|
137
|
+
requirements:
|
138
|
+
- - "~>"
|
139
|
+
- !ruby/object:Gem::Version
|
140
|
+
version: 5.3.0
|
141
|
+
- - ">="
|
142
|
+
- !ruby/object:Gem::Version
|
143
|
+
version: 5.3.0
|
144
|
+
type: :runtime
|
145
|
+
prerelease: false
|
146
|
+
version_requirements: !ruby/object:Gem::Requirement
|
147
|
+
requirements:
|
148
|
+
- - "~>"
|
149
|
+
- !ruby/object:Gem::Version
|
150
|
+
version: 5.3.0
|
151
|
+
- - ">="
|
152
|
+
- !ruby/object:Gem::Version
|
153
|
+
version: 5.3.0
|
154
|
+
- !ruby/object:Gem::Dependency
|
155
|
+
name: highscore
|
156
|
+
requirement: !ruby/object:Gem::Requirement
|
157
|
+
requirements:
|
158
|
+
- - "~>"
|
159
|
+
- !ruby/object:Gem::Version
|
160
|
+
version: 1.2.0
|
161
|
+
- - ">="
|
162
|
+
- !ruby/object:Gem::Version
|
163
|
+
version: 1.2.0
|
164
|
+
type: :runtime
|
165
|
+
prerelease: false
|
166
|
+
version_requirements: !ruby/object:Gem::Requirement
|
167
|
+
requirements:
|
168
|
+
- - "~>"
|
169
|
+
- !ruby/object:Gem::Version
|
170
|
+
version: 1.2.0
|
171
|
+
- - ">="
|
172
|
+
- !ruby/object:Gem::Version
|
173
|
+
version: 1.2.0
|
134
174
|
- !ruby/object:Gem::Dependency
|
135
175
|
name: bundler
|
136
176
|
requirement: !ruby/object:Gem::Requirement
|
@@ -325,6 +365,7 @@ files:
|
|
325
365
|
- bin/setup
|
326
366
|
- circle.yml
|
327
367
|
- config/article_scrape_patterns.yml
|
368
|
+
- config/stopwords.yml
|
328
369
|
- config/temp_dirs.yml
|
329
370
|
- dev.yml
|
330
371
|
- lib/news_scraper.rb
|
@@ -340,6 +381,7 @@ files:
|
|
340
381
|
- lib/news_scraper/trainer/preset_selector.rb
|
341
382
|
- lib/news_scraper/trainer/url_trainer.rb
|
342
383
|
- lib/news_scraper/transformers/article.rb
|
384
|
+
- lib/news_scraper/transformers/helpers/highscore_parser.rb
|
343
385
|
- lib/news_scraper/transformers/nokogiri/functions.rb
|
344
386
|
- lib/news_scraper/transformers/trainer_article.rb
|
345
387
|
- lib/news_scraper/uri_parser.rb
|