news_scraper 1.0.0 → 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/Gemfile +4 -0
- data/README.md +77 -8
- data/config/article_scrape_patterns.yml +9 -0
- data/config/stopwords.yml +459 -0
- data/lib/news_scraper/configuration.rb +19 -6
- data/lib/news_scraper/transformers/article.rb +9 -1
- data/lib/news_scraper/transformers/helpers/highscore_parser.rb +50 -0
- data/lib/news_scraper/version.rb +1 -1
- data/news_scraper.gemspec +4 -1
- metadata +48 -6
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 608b90149fbc8977b1fc3b42c923557b128ad4df
|
4
|
+
data.tar.gz: 0e0914d81488d9630234860b3a9e732a1471158a
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 1e344051f216c10b320b324db5dbaeccaa9f034a1bd2d94f1905ea8797ed6389d4780db125bcd354dd553bc0279f75aecc507f5de91d30eb6d00098474e69173
|
7
|
+
data.tar.gz: 1823fe68329a466385e16dd224d49633c122afec26d6c71cd66d9b428888e81e68a73dfc78133a995ad0a9885248ae03389f2d9b3b192b4d514970bc385990b6
|
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -33,14 +33,23 @@ Optionally, you can pass in a block and it will yield the transformed data on a
|
|
33
33
|
It takes in 1 parameter `query:`.
|
34
34
|
|
35
35
|
Array notation
|
36
|
-
```
|
36
|
+
```ruby
|
37
37
|
article_hashes = NewsScraper::Scraper.new(query: 'Shopify').scrape # [ { author: ... }, { author: ... } ... ]
|
38
38
|
```
|
39
39
|
|
40
|
+
*Note:* the array notation may raise `NewsScraper::Transformers::ScrapePatternNotDefined` (domain is not in the configuration) or `NewsScraper::ResponseError` (non-200 response), for this reason, it is suggested to use the block notation where this can be handled properly
|
41
|
+
|
40
42
|
Block notation
|
41
|
-
```
|
42
|
-
NewsScraper::Scraper.new(query: 'Shopify').scrape do |
|
43
|
-
|
43
|
+
```ruby
|
44
|
+
NewsScraper::Scraper.new(query: 'Shopify').scrape do |a|
|
45
|
+
case a.class.to_s
|
46
|
+
when "NewsScraper::Transformers::ScrapePatternNotDefined"
|
47
|
+
puts "#{a.root_domain} was not trained"
|
48
|
+
when "NewsScraper::ResponseError"
|
49
|
+
puts "#{a.url} returned an error: #{a.error_code}-#{a.message}"
|
50
|
+
else
|
51
|
+
# { author: ... }
|
52
|
+
end
|
44
53
|
end
|
45
54
|
```
|
46
55
|
|
@@ -48,12 +57,12 @@ How the `Scraper` extracts and parses for the information is determined by scrap
|
|
48
57
|
|
49
58
|
### Transformed Data
|
50
59
|
|
51
|
-
Calling `NewsScraper::Scraper#scrape` with either the array or block notation will yield `transformed_data` hashes. [`article_scrape_patterns.yml`](https://github.com/
|
60
|
+
Calling `NewsScraper::Scraper#scrape` with either the array or block notation will yield `transformed_data` hashes. [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml) defines the data types that will be scraped for.
|
52
61
|
|
53
62
|
In addition, the `url` and `root_domain`(hostname) of the article will be returned in the hash too.
|
54
63
|
|
55
64
|
Example
|
56
|
-
```
|
65
|
+
```ruby
|
57
66
|
{
|
58
67
|
author: 'Linus Torvald',
|
59
68
|
body: 'The Linux kernel developed by Linus Torvald has become the backbone of most electronic devices we use to-date. It powers mobile phones, laptops, embedded devices, and even rockets...',
|
@@ -71,12 +80,34 @@ Example
|
|
71
80
|
|
72
81
|
Scrape patterns are xpath or CSS patterns used by Nokogiri to extract relevant HTML elements.
|
73
82
|
|
74
|
-
Extracting each `:data_type` (see Example under **Transformed Data**) requires a scrape pattern. A few `:presets` are specified in [`article_scrape_patterns.yml`](https://github.com/
|
83
|
+
Extracting each `:data_type` (see Example under **Transformed Data**) requires a scrape pattern. A few `:presets` are specified in [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml).
|
75
84
|
|
76
85
|
Since each news site (identified with `:root_domain`) uses a different markup, scrape patterns are defined on a per-`:root_domain` basis.
|
77
86
|
|
78
87
|
Specifying scrape patterns for new, undefined `:root_domains` is called training (see **Training**).
|
79
88
|
|
89
|
+
#### Customizing Scrape Patterns
|
90
|
+
|
91
|
+
`NewsScraper.configuration` is the entry point for scrape patterns. By default, it loads the contents of [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml), but you can override this with the `fetch_method` which accepts a proc.
|
92
|
+
|
93
|
+
For example, to override the domains section we can do this like so:
|
94
|
+
|
95
|
+
```ruby
|
96
|
+
@default_configuration = NewsScraper.configuration.scrape_patterns.dup
|
97
|
+
NewsScraper.configure do |config|
|
98
|
+
config.fetch_method = proc do
|
99
|
+
@default_configuration['domains'] = { ... }
|
100
|
+
@default_configuration
|
101
|
+
end
|
102
|
+
end
|
103
|
+
```
|
104
|
+
|
105
|
+
Of course, using this method you can override any part of the configuration individually, or the entire thing. It is fully customizeable.
|
106
|
+
|
107
|
+
This helps with separate apps which may track domains training itself. If the configuration is not set correctly, a newly trained domain will not be in the configuration and a `NewsScraper::Transformers::ScrapePatternNotDefined` error will be raised.
|
108
|
+
|
109
|
+
It would be appreciated that any domains you train outside of this gem eventually end up as a pull request back to [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml).
|
110
|
+
|
80
111
|
### Training
|
81
112
|
|
82
113
|
For each `:root_domain`, it is neccesary to specify a scrape pattern for each of the `:data_type`s. A rake task was written to provide a CLI for appending new `:root_domain`s using `:preset` scrape patterns.
|
@@ -88,6 +119,44 @@ bundle exec rake scraper:train QUERY=<query>
|
|
88
119
|
|
89
120
|
where the CLI will step through the articles and `:root_domain`s of the articles relevant to `<query>`.
|
90
121
|
|
122
|
+
Of course, this will simply create an entry for a `domain` with `domain_entries`, so as long as your application provides the same functionality, this can be overriden in your app. Just provide a domain entry like so:
|
123
|
+
|
124
|
+
```yaml
|
125
|
+
domains:
|
126
|
+
root_domain.com:
|
127
|
+
author:
|
128
|
+
method: method
|
129
|
+
pattern: pattern
|
130
|
+
body:
|
131
|
+
method: method
|
132
|
+
pattern: pattern
|
133
|
+
description:
|
134
|
+
method: method
|
135
|
+
pattern: pattern
|
136
|
+
keywords:
|
137
|
+
method: method
|
138
|
+
pattern: pattern
|
139
|
+
section:
|
140
|
+
method: method
|
141
|
+
pattern: pattern
|
142
|
+
datetime:
|
143
|
+
method: method
|
144
|
+
pattern: pattern
|
145
|
+
title:
|
146
|
+
method: method
|
147
|
+
pattern: pattern
|
148
|
+
```
|
149
|
+
|
150
|
+
The options using the presets in [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml), can be obtained using this snippet:
|
151
|
+
```ruby
|
152
|
+
include NewsScraper::ExtractorsHelpers
|
153
|
+
|
154
|
+
transformed_data = NewsScraper::Transformers::TrainerArticle.new(
|
155
|
+
url: url,
|
156
|
+
payload: http_request(url).body
|
157
|
+
).transform
|
158
|
+
```
|
159
|
+
|
91
160
|
## Development
|
92
161
|
|
93
162
|
After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
|
@@ -96,7 +165,7 @@ To install this gem onto your local machine, run `bundle exec rake install`. To
|
|
96
165
|
|
97
166
|
## Contributing
|
98
167
|
|
99
|
-
Bug reports and pull requests are welcome on GitHub at https://github.com/
|
168
|
+
Bug reports and pull requests are welcome on GitHub at https://github.com/news-scraper/news_scraper. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
|
100
169
|
|
101
170
|
|
102
171
|
## License
|
@@ -45,6 +45,9 @@ presets:
|
|
45
45
|
og: &og_description
|
46
46
|
method: "xpath"
|
47
47
|
pattern: "//meta[@property='og:description']/@content"
|
48
|
+
metainspector: &metainspector_description
|
49
|
+
method: 'metainspector'
|
50
|
+
pattern: :description
|
48
51
|
keywords:
|
49
52
|
meta: &meta_keywords
|
50
53
|
method: "xpath"
|
@@ -55,6 +58,9 @@ presets:
|
|
55
58
|
news_keywords: &news_keywords_keywords
|
56
59
|
method: "xpath"
|
57
60
|
pattern: "//meta[@name='news_keywords']/@content"
|
61
|
+
highscore: &highscore_keywords
|
62
|
+
method: highscore
|
63
|
+
pattern: ""
|
58
64
|
section:
|
59
65
|
meta: &meta_section
|
60
66
|
method: "xpath"
|
@@ -109,6 +115,9 @@ presets:
|
|
109
115
|
og: &og_title
|
110
116
|
method: "xpath"
|
111
117
|
pattern: "//meta[@property='og:title']/@content"
|
118
|
+
metainspector: &metainspector_title
|
119
|
+
method: 'metainspector'
|
120
|
+
pattern: :best_title
|
112
121
|
|
113
122
|
domains:
|
114
123
|
investors.com:
|
@@ -0,0 +1,459 @@
|
|
1
|
+
---
|
2
|
+
- "-"
|
3
|
+
- "--"
|
4
|
+
- ":"
|
5
|
+
- ":d"
|
6
|
+
- 'no'
|
7
|
+
- 'off'
|
8
|
+
- 'on'
|
9
|
+
- about
|
10
|
+
- above
|
11
|
+
- across
|
12
|
+
- after
|
13
|
+
- again
|
14
|
+
- against
|
15
|
+
- ahahaha
|
16
|
+
- all
|
17
|
+
- almost
|
18
|
+
- alone
|
19
|
+
- along
|
20
|
+
- already
|
21
|
+
- also
|
22
|
+
- although
|
23
|
+
- always
|
24
|
+
- am
|
25
|
+
- among
|
26
|
+
- an
|
27
|
+
- and
|
28
|
+
- another
|
29
|
+
- any
|
30
|
+
- anybody
|
31
|
+
- anyone
|
32
|
+
- anything
|
33
|
+
- anywhere
|
34
|
+
- ar
|
35
|
+
- are
|
36
|
+
- area
|
37
|
+
- areas
|
38
|
+
- around
|
39
|
+
- as
|
40
|
+
- ask
|
41
|
+
- asked
|
42
|
+
- asking
|
43
|
+
- asks
|
44
|
+
- at
|
45
|
+
- aw
|
46
|
+
- away
|
47
|
+
- aww
|
48
|
+
- awww
|
49
|
+
- back
|
50
|
+
- backed
|
51
|
+
- backing
|
52
|
+
- backs
|
53
|
+
- be
|
54
|
+
- became
|
55
|
+
- because
|
56
|
+
- become
|
57
|
+
- becomes
|
58
|
+
- been
|
59
|
+
- before
|
60
|
+
- began
|
61
|
+
- behind
|
62
|
+
- being
|
63
|
+
- beings
|
64
|
+
- best
|
65
|
+
- better
|
66
|
+
- between
|
67
|
+
- big
|
68
|
+
- bit
|
69
|
+
- blud
|
70
|
+
- both
|
71
|
+
- bt
|
72
|
+
- but
|
73
|
+
- by
|
74
|
+
- call
|
75
|
+
- came
|
76
|
+
- can
|
77
|
+
- cannot
|
78
|
+
- case
|
79
|
+
- cases
|
80
|
+
- certain
|
81
|
+
- certainly
|
82
|
+
- chat
|
83
|
+
- clear
|
84
|
+
- clearly
|
85
|
+
- come
|
86
|
+
- comments
|
87
|
+
- could
|
88
|
+
- d
|
89
|
+
- did
|
90
|
+
- differ
|
91
|
+
- different
|
92
|
+
- differently
|
93
|
+
- do
|
94
|
+
- does
|
95
|
+
- done
|
96
|
+
- dont
|
97
|
+
- down
|
98
|
+
- downed
|
99
|
+
- downing
|
100
|
+
- downs
|
101
|
+
- dunno
|
102
|
+
- during
|
103
|
+
- each
|
104
|
+
- early
|
105
|
+
- eh
|
106
|
+
- either
|
107
|
+
- email
|
108
|
+
- end
|
109
|
+
- ended
|
110
|
+
- ending
|
111
|
+
- ends
|
112
|
+
- enough
|
113
|
+
- even
|
114
|
+
- evenly
|
115
|
+
- ever
|
116
|
+
- every
|
117
|
+
- everybody
|
118
|
+
- everyone
|
119
|
+
- everything
|
120
|
+
- everywhere
|
121
|
+
- face
|
122
|
+
- faces
|
123
|
+
- fact
|
124
|
+
- facts
|
125
|
+
- far
|
126
|
+
- felt
|
127
|
+
- few
|
128
|
+
- find
|
129
|
+
- finds
|
130
|
+
- first
|
131
|
+
- for
|
132
|
+
- four
|
133
|
+
- from
|
134
|
+
- full
|
135
|
+
- fully
|
136
|
+
- further
|
137
|
+
- furthered
|
138
|
+
- furthering
|
139
|
+
- furthers
|
140
|
+
- gave
|
141
|
+
- general
|
142
|
+
- generally
|
143
|
+
- get
|
144
|
+
- gets
|
145
|
+
- give
|
146
|
+
- given
|
147
|
+
- gives
|
148
|
+
- go
|
149
|
+
- going
|
150
|
+
- good
|
151
|
+
- goods
|
152
|
+
- got
|
153
|
+
- great
|
154
|
+
- greater
|
155
|
+
- greatest
|
156
|
+
- group
|
157
|
+
- grouped
|
158
|
+
- grouping
|
159
|
+
- groups
|
160
|
+
- ha
|
161
|
+
- haaaa
|
162
|
+
- had
|
163
|
+
- haha
|
164
|
+
- has
|
165
|
+
- have
|
166
|
+
- having
|
167
|
+
- he
|
168
|
+
- heh
|
169
|
+
- hehe
|
170
|
+
- hehehe
|
171
|
+
- her
|
172
|
+
- here
|
173
|
+
- herself
|
174
|
+
- high
|
175
|
+
- higher
|
176
|
+
- highest
|
177
|
+
- him
|
178
|
+
- himself
|
179
|
+
- his
|
180
|
+
- hola
|
181
|
+
- how
|
182
|
+
- however
|
183
|
+
- i
|
184
|
+
- id
|
185
|
+
- if
|
186
|
+
- il
|
187
|
+
- im
|
188
|
+
- important
|
189
|
+
- in
|
190
|
+
- init
|
191
|
+
- interest
|
192
|
+
- interested
|
193
|
+
- interesting
|
194
|
+
- interests
|
195
|
+
- into
|
196
|
+
- is
|
197
|
+
- it
|
198
|
+
- its
|
199
|
+
- itself
|
200
|
+
- iv
|
201
|
+
- ive
|
202
|
+
- jst
|
203
|
+
- just
|
204
|
+
- keep
|
205
|
+
- keeps
|
206
|
+
- kind
|
207
|
+
- knew
|
208
|
+
- know
|
209
|
+
- known
|
210
|
+
- knows
|
211
|
+
- large
|
212
|
+
- largely
|
213
|
+
- last
|
214
|
+
- later
|
215
|
+
- latest
|
216
|
+
- least
|
217
|
+
- less
|
218
|
+
- let
|
219
|
+
- lets
|
220
|
+
- like
|
221
|
+
- likely
|
222
|
+
- lol
|
223
|
+
- long
|
224
|
+
- longer
|
225
|
+
- longest
|
226
|
+
- lool
|
227
|
+
- loool
|
228
|
+
- looool
|
229
|
+
- made
|
230
|
+
- mah
|
231
|
+
- make
|
232
|
+
- making
|
233
|
+
- man
|
234
|
+
- many
|
235
|
+
- may
|
236
|
+
- me
|
237
|
+
- member
|
238
|
+
- members
|
239
|
+
- men
|
240
|
+
- might
|
241
|
+
- more
|
242
|
+
- most
|
243
|
+
- mostly
|
244
|
+
- mr
|
245
|
+
- mrs
|
246
|
+
- much
|
247
|
+
- must
|
248
|
+
- my
|
249
|
+
- myself
|
250
|
+
- necessary
|
251
|
+
- need
|
252
|
+
- needed
|
253
|
+
- needing
|
254
|
+
- needs
|
255
|
+
- never
|
256
|
+
- new
|
257
|
+
- newer
|
258
|
+
- newest
|
259
|
+
- next
|
260
|
+
- nobody
|
261
|
+
- non
|
262
|
+
- noone
|
263
|
+
- not
|
264
|
+
- nothing
|
265
|
+
- now
|
266
|
+
- nowhere
|
267
|
+
- number
|
268
|
+
- numbers
|
269
|
+
- of
|
270
|
+
- often
|
271
|
+
- oh
|
272
|
+
- old
|
273
|
+
- older
|
274
|
+
- oldest
|
275
|
+
- once
|
276
|
+
- one
|
277
|
+
- only
|
278
|
+
- ooh
|
279
|
+
- ooo
|
280
|
+
- open
|
281
|
+
- opened
|
282
|
+
- opening
|
283
|
+
- opens
|
284
|
+
- or
|
285
|
+
- order
|
286
|
+
- ordered
|
287
|
+
- ordering
|
288
|
+
- orders
|
289
|
+
- other
|
290
|
+
- others
|
291
|
+
- our
|
292
|
+
- out
|
293
|
+
- over
|
294
|
+
- part
|
295
|
+
- parted
|
296
|
+
- parting
|
297
|
+
- parts
|
298
|
+
- per
|
299
|
+
- perhaps
|
300
|
+
- place
|
301
|
+
- places
|
302
|
+
- pls
|
303
|
+
- point
|
304
|
+
- pointed
|
305
|
+
- pointing
|
306
|
+
- points
|
307
|
+
- possible
|
308
|
+
- powered
|
309
|
+
- present
|
310
|
+
- presented
|
311
|
+
- presenting
|
312
|
+
- presents
|
313
|
+
- problem
|
314
|
+
- problems
|
315
|
+
- put
|
316
|
+
- puts
|
317
|
+
- quite
|
318
|
+
- rather
|
319
|
+
- really
|
320
|
+
- right
|
321
|
+
- room
|
322
|
+
- rooms
|
323
|
+
- run
|
324
|
+
- safe
|
325
|
+
- said
|
326
|
+
- same
|
327
|
+
- saw
|
328
|
+
- say
|
329
|
+
- says
|
330
|
+
- second
|
331
|
+
- seconds
|
332
|
+
- see
|
333
|
+
- seem
|
334
|
+
- seemed
|
335
|
+
- seeming
|
336
|
+
- seems
|
337
|
+
- sees
|
338
|
+
- several
|
339
|
+
- shall
|
340
|
+
- she
|
341
|
+
- should
|
342
|
+
- show
|
343
|
+
- showed
|
344
|
+
- showing
|
345
|
+
- shows
|
346
|
+
- side
|
347
|
+
- sides
|
348
|
+
- since
|
349
|
+
- small
|
350
|
+
- smaller
|
351
|
+
- smallest
|
352
|
+
- so
|
353
|
+
- some
|
354
|
+
- somebody
|
355
|
+
- someone
|
356
|
+
- something
|
357
|
+
- somewhere
|
358
|
+
- state
|
359
|
+
- states
|
360
|
+
- still
|
361
|
+
- stop
|
362
|
+
- such
|
363
|
+
- sure
|
364
|
+
- ta
|
365
|
+
- tail
|
366
|
+
- take
|
367
|
+
- taken
|
368
|
+
- team
|
369
|
+
- than
|
370
|
+
- thank
|
371
|
+
- thanks
|
372
|
+
- that
|
373
|
+
- the
|
374
|
+
- their
|
375
|
+
- them
|
376
|
+
- then
|
377
|
+
- there
|
378
|
+
- therefore
|
379
|
+
- theres
|
380
|
+
- these
|
381
|
+
- they
|
382
|
+
- thing
|
383
|
+
- things
|
384
|
+
- think
|
385
|
+
- thinks
|
386
|
+
- this
|
387
|
+
- those
|
388
|
+
- though
|
389
|
+
- thought
|
390
|
+
- thoughts
|
391
|
+
- three
|
392
|
+
- through
|
393
|
+
- thus
|
394
|
+
- to
|
395
|
+
- today
|
396
|
+
- together
|
397
|
+
- too
|
398
|
+
- took
|
399
|
+
- toward
|
400
|
+
- tryna
|
401
|
+
- turn
|
402
|
+
- turned
|
403
|
+
- turning
|
404
|
+
- turns
|
405
|
+
- two
|
406
|
+
- under
|
407
|
+
- until
|
408
|
+
- up
|
409
|
+
- upon
|
410
|
+
- ur
|
411
|
+
- us
|
412
|
+
- use
|
413
|
+
- used
|
414
|
+
- uses
|
415
|
+
- very
|
416
|
+
- want
|
417
|
+
- wanted
|
418
|
+
- wanting
|
419
|
+
- wants
|
420
|
+
- was
|
421
|
+
- way
|
422
|
+
- ways
|
423
|
+
- we
|
424
|
+
- welcome
|
425
|
+
- well
|
426
|
+
- wells
|
427
|
+
- went
|
428
|
+
- were
|
429
|
+
- what
|
430
|
+
- when
|
431
|
+
- where
|
432
|
+
- whether
|
433
|
+
- which
|
434
|
+
- while
|
435
|
+
- who
|
436
|
+
- whole
|
437
|
+
- whose
|
438
|
+
- why
|
439
|
+
- will
|
440
|
+
- with
|
441
|
+
- within
|
442
|
+
- without
|
443
|
+
- work
|
444
|
+
- worked
|
445
|
+
- working
|
446
|
+
- works
|
447
|
+
- would
|
448
|
+
- ya
|
449
|
+
- yeah
|
450
|
+
- year
|
451
|
+
- years
|
452
|
+
- yet
|
453
|
+
- yo
|
454
|
+
- you
|
455
|
+
- young
|
456
|
+
- younger
|
457
|
+
- youngest
|
458
|
+
- your
|
459
|
+
- yours
|
@@ -1,26 +1,39 @@
|
|
1
1
|
module NewsScraper
|
2
2
|
class Configuration
|
3
3
|
DEFAULT_SCRAPE_PATTERNS_FILEPATH = File.expand_path('../../../config/article_scrape_patterns.yml', __FILE__)
|
4
|
-
|
4
|
+
STOPWORDS_FILEPATH = File.expand_path('../../../config/stopwords.yml', __FILE__)
|
5
|
+
attr_accessor :scrape_patterns_fetch_method, :stopwords_fetch_method, :scrape_patterns_filepath
|
5
6
|
|
6
7
|
# <code>NewsScraper::Configuration.initialize</code> initializes the scrape_patterns_filepath
|
7
|
-
# and the
|
8
|
+
# and the scrape_patterns_fetch_method to the <code>DEFAULT_SCRAPE_PATTERNS_FILEPATH</code>.
|
9
|
+
# It also sets stopwords to be used during extraction to a default set contained in <code>STOPWORDS_FILEPATH</code>
|
8
10
|
#
|
9
11
|
# Set the <code>scrape_patterns_filepath</code> to <code>nil</code> to disable saving during training
|
10
12
|
#
|
11
13
|
def initialize
|
12
14
|
self.scrape_patterns_filepath = DEFAULT_SCRAPE_PATTERNS_FILEPATH
|
13
|
-
self.
|
15
|
+
self.scrape_patterns_fetch_method = proc { default_scrape_patterns }
|
16
|
+
self.stopwords_fetch_method = proc { YAML.load_file(STOPWORDS_FILEPATH) }
|
14
17
|
end
|
15
18
|
|
16
19
|
# <code>NewsScraper::Configuration.scrape_patterns</code> proxies scrape_patterns
|
17
|
-
# requests to <code>
|
20
|
+
# requests to <code>scrape_patterns_fetch_method</code>:
|
18
21
|
#
|
19
22
|
# *Returns*
|
20
|
-
# - The result of calling the <code>
|
23
|
+
# - The result of calling the <code>scrape_patterns_fetch_method</code> proc, expected to be a hash
|
21
24
|
#
|
22
25
|
def scrape_patterns
|
23
|
-
|
26
|
+
scrape_patterns_fetch_method.call
|
27
|
+
end
|
28
|
+
|
29
|
+
# <code>NewsScraper::Configuration.stopwords</code> proxies stopwords
|
30
|
+
# requests to <code>stopwords_fetch_method</code>:
|
31
|
+
#
|
32
|
+
# *Returns*
|
33
|
+
# - The result of calling the <code>stopwords_fetch_method</code> proc, expected to be an array
|
34
|
+
#
|
35
|
+
def stopwords
|
36
|
+
stopwords_fetch_method.call
|
24
37
|
end
|
25
38
|
|
26
39
|
private
|
@@ -1,8 +1,11 @@
|
|
1
1
|
require 'nokogiri'
|
2
2
|
require 'sanitize'
|
3
|
+
require 'news_scraper/transformers/nokogiri/functions'
|
4
|
+
|
3
5
|
require 'readability'
|
4
6
|
require 'htmlbeautifier'
|
5
|
-
require '
|
7
|
+
require 'metainspector'
|
8
|
+
require 'news_scraper/transformers/helpers/highscore_parser'
|
6
9
|
|
7
10
|
module NewsScraper
|
8
11
|
module Transformers
|
@@ -69,6 +72,11 @@ module NewsScraper
|
|
69
72
|
# Remove any newlines in the text
|
70
73
|
content = content.squeeze("\n").strip
|
71
74
|
HtmlBeautifier.beautify(content)
|
75
|
+
when :metainspector
|
76
|
+
page = MetaInspector.new(@url, document: @payload)
|
77
|
+
page.respond_to?(scrape_pattern.to_sym) ? page.send(scrape_pattern.to_sym) : nil
|
78
|
+
when :highscore
|
79
|
+
NewsScraper::Transformers::Helpers::HighScoreParser.keywords(url: @url, payload: @payload)
|
72
80
|
end
|
73
81
|
end
|
74
82
|
end
|
@@ -0,0 +1,50 @@
|
|
1
|
+
require 'metainspector'
|
2
|
+
require 'highscore'
|
3
|
+
require 'readability'
|
4
|
+
|
5
|
+
module NewsScraper
|
6
|
+
module Transformers
|
7
|
+
module Helpers
|
8
|
+
class HighScoreParser
|
9
|
+
class << self
|
10
|
+
# <code>NewsScraper::Transformers::Helpers::HighScoreParser.keywords</code> parses out keywords
|
11
|
+
#
|
12
|
+
# *Params*
|
13
|
+
# - <code>url:</code>: keyword for the url to parse to a uri
|
14
|
+
# - <code>payload:</code>: keyword for the payload from a request to the url (html body)
|
15
|
+
#
|
16
|
+
# *Returns*
|
17
|
+
# - <code>keywords</code>: Top 5 keywords from the body of text
|
18
|
+
#
|
19
|
+
def keywords(url:, payload:)
|
20
|
+
blacklist = Highscore::Blacklist.load(stopwords(url, payload))
|
21
|
+
content = Readability::Document.new(payload, emove_empty_nodes: true, tags: [], attributes: []).content
|
22
|
+
highscore(content, blacklist)
|
23
|
+
end
|
24
|
+
|
25
|
+
private
|
26
|
+
|
27
|
+
def highscore(content, blacklist)
|
28
|
+
text = Highscore::Content.new(content, blacklist)
|
29
|
+
text.configure do
|
30
|
+
set :multiplier, 2
|
31
|
+
set :upper_case, 3
|
32
|
+
set :long_words, 2
|
33
|
+
set :long_words_threshold, 15
|
34
|
+
set :ignore_case, true
|
35
|
+
end
|
36
|
+
text.keywords.top(5).collect(&:text)
|
37
|
+
end
|
38
|
+
|
39
|
+
def stopwords(url, payload)
|
40
|
+
page = MetaInspector.new(url, document: payload)
|
41
|
+
stopwords = NewsScraper.configuration.stopwords
|
42
|
+
# Add the site name to the stop words
|
43
|
+
stopwords += page.meta['og:site_name'].downcase.split(' ') if page.meta['og:site_name']
|
44
|
+
stopwords
|
45
|
+
end
|
46
|
+
end
|
47
|
+
end
|
48
|
+
end
|
49
|
+
end
|
50
|
+
end
|
data/lib/news_scraper/version.rb
CHANGED
data/news_scraper.gemspec
CHANGED
@@ -1,3 +1,4 @@
|
|
1
|
+
# rubocop:disable BlockLength
|
1
2
|
# coding: utf-8
|
2
3
|
lib = File.expand_path('../lib', __FILE__)
|
3
4
|
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
|
@@ -29,7 +30,9 @@ Gem::Specification.new do |spec|
|
|
29
30
|
spec.add_dependency 'sanitize', '~> 4.2', '>= 4.2.0'
|
30
31
|
spec.add_dependency 'ruby-readability', '~> 0.7', '>= 0.7.0'
|
31
32
|
spec.add_dependency 'htmlbeautifier', '~> 1.1', '>= 1.1.1'
|
32
|
-
spec.add_dependency 'terminal-table', '~> 1.
|
33
|
+
spec.add_dependency 'terminal-table', '~> 1.7.0', '>= 1.7.0'
|
34
|
+
spec.add_dependency 'metainspector', '~> 5.3.0', '>= 5.3.0'
|
35
|
+
spec.add_dependency 'highscore', '~> 1.2.0', '>= 1.2.0'
|
33
36
|
|
34
37
|
spec.add_development_dependency 'bundler', '~> 1.12', '>= 1.12.0'
|
35
38
|
spec.add_development_dependency 'rake', '~> 10.0', '>= 10.0.0'
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: news_scraper
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.1.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Richard Wu
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: exe
|
11
11
|
cert_chain: []
|
12
|
-
date: 2016-
|
12
|
+
date: 2016-10-16 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: nokogiri
|
@@ -117,20 +117,60 @@ dependencies:
|
|
117
117
|
requirements:
|
118
118
|
- - "~>"
|
119
119
|
- !ruby/object:Gem::Version
|
120
|
-
version:
|
120
|
+
version: 1.7.0
|
121
121
|
- - ">="
|
122
122
|
- !ruby/object:Gem::Version
|
123
|
-
version: 1.
|
123
|
+
version: 1.7.0
|
124
124
|
type: :runtime
|
125
125
|
prerelease: false
|
126
126
|
version_requirements: !ruby/object:Gem::Requirement
|
127
127
|
requirements:
|
128
128
|
- - "~>"
|
129
129
|
- !ruby/object:Gem::Version
|
130
|
-
version:
|
130
|
+
version: 1.7.0
|
131
131
|
- - ">="
|
132
132
|
- !ruby/object:Gem::Version
|
133
|
-
version: 1.
|
133
|
+
version: 1.7.0
|
134
|
+
- !ruby/object:Gem::Dependency
|
135
|
+
name: metainspector
|
136
|
+
requirement: !ruby/object:Gem::Requirement
|
137
|
+
requirements:
|
138
|
+
- - "~>"
|
139
|
+
- !ruby/object:Gem::Version
|
140
|
+
version: 5.3.0
|
141
|
+
- - ">="
|
142
|
+
- !ruby/object:Gem::Version
|
143
|
+
version: 5.3.0
|
144
|
+
type: :runtime
|
145
|
+
prerelease: false
|
146
|
+
version_requirements: !ruby/object:Gem::Requirement
|
147
|
+
requirements:
|
148
|
+
- - "~>"
|
149
|
+
- !ruby/object:Gem::Version
|
150
|
+
version: 5.3.0
|
151
|
+
- - ">="
|
152
|
+
- !ruby/object:Gem::Version
|
153
|
+
version: 5.3.0
|
154
|
+
- !ruby/object:Gem::Dependency
|
155
|
+
name: highscore
|
156
|
+
requirement: !ruby/object:Gem::Requirement
|
157
|
+
requirements:
|
158
|
+
- - "~>"
|
159
|
+
- !ruby/object:Gem::Version
|
160
|
+
version: 1.2.0
|
161
|
+
- - ">="
|
162
|
+
- !ruby/object:Gem::Version
|
163
|
+
version: 1.2.0
|
164
|
+
type: :runtime
|
165
|
+
prerelease: false
|
166
|
+
version_requirements: !ruby/object:Gem::Requirement
|
167
|
+
requirements:
|
168
|
+
- - "~>"
|
169
|
+
- !ruby/object:Gem::Version
|
170
|
+
version: 1.2.0
|
171
|
+
- - ">="
|
172
|
+
- !ruby/object:Gem::Version
|
173
|
+
version: 1.2.0
|
134
174
|
- !ruby/object:Gem::Dependency
|
135
175
|
name: bundler
|
136
176
|
requirement: !ruby/object:Gem::Requirement
|
@@ -325,6 +365,7 @@ files:
|
|
325
365
|
- bin/setup
|
326
366
|
- circle.yml
|
327
367
|
- config/article_scrape_patterns.yml
|
368
|
+
- config/stopwords.yml
|
328
369
|
- config/temp_dirs.yml
|
329
370
|
- dev.yml
|
330
371
|
- lib/news_scraper.rb
|
@@ -340,6 +381,7 @@ files:
|
|
340
381
|
- lib/news_scraper/trainer/preset_selector.rb
|
341
382
|
- lib/news_scraper/trainer/url_trainer.rb
|
342
383
|
- lib/news_scraper/transformers/article.rb
|
384
|
+
- lib/news_scraper/transformers/helpers/highscore_parser.rb
|
343
385
|
- lib/news_scraper/transformers/nokogiri/functions.rb
|
344
386
|
- lib/news_scraper/transformers/trainer_article.rb
|
345
387
|
- lib/news_scraper/uri_parser.rb
|