RubyGems - news_scraper - Versions diffs - 1.0.0 → 1.1.0 - Mend

news_scraper 1.0.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml +4 -4
data/Gemfile +4 -0
data/README.md +77 -8
data/config/article_scrape_patterns.yml +9 -0
data/config/stopwords.yml +459 -0
data/lib/news_scraper/configuration.rb +19 -6
data/lib/news_scraper/transformers/article.rb +9 -1
data/lib/news_scraper/transformers/helpers/highscore_parser.rb +50 -0
data/lib/news_scraper/version.rb +1 -1
data/news_scraper.gemspec +4 -1
metadata +48 -6

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: db7d631f3f6cf73ff2e57b9e472804651b9fe1e0
-  data.tar.gz: 1045878eb97749d6b264a486ac34bfb89f4796dd
+  metadata.gz: 608b90149fbc8977b1fc3b42c923557b128ad4df
+  data.tar.gz: 0e0914d81488d9630234860b3a9e732a1471158a
 SHA512:
-  metadata.gz: a53423be5dbda33ead7dbb46bc494e40fcf412172a291496128b40512c985fc157c646481e1b8a183be2f709486e2cc7ec27d47a17bb9715c3b19dbe09dd7e42
-  data.tar.gz: eb43f129a0ca1a9f6eb02f24bfeb891583c96b0cce548afe43cf613231e8908a66e381347aeb16f5be9ea4af3ca96f82dcaf4da005951fb5ae5936d3f3530cbc
+  metadata.gz: 1e344051f216c10b320b324db5dbaeccaa9f034a1bd2d94f1905ea8797ed6389d4780db125bcd354dd553bc0279f75aecc507f5de91d30eb6d00098474e69173
+  data.tar.gz: 1823fe68329a466385e16dd224d49633c122afec26d6c71cd66d9b428888e81e68a73dfc78133a995ad0a9885248ae03389f2d9b3b192b4d514970bc385990b6

data/Gemfile CHANGED Viewed

@@ -1,3 +1,7 @@
 source "https://rubygems.org"
 gemspec
+group :test do
+  gem 'hashdiff'
+end

data/README.md CHANGED Viewed

@@ -33,14 +33,23 @@ Optionally, you can pass in a block and it will yield the transformed data on a
 It takes in 1 parameter `query:`.
 Array notation
-```
+```ruby
 article_hashes = NewsScraper::Scraper.new(query: 'Shopify').scrape # [ { author: ... }, { author: ... } ... ]
 ```
+*Note:* the array notation may raise `NewsScraper::Transformers::ScrapePatternNotDefined` (domain is not in the configuration) or `NewsScraper::ResponseError` (non-200 response), for this reason, it is suggested to use the block notation where this can be handled properly
 Block notation
-```
-NewsScraper::Scraper.new(query: 'Shopify').scrape do |article_hash|
-  # { author: ... }
+```ruby
+NewsScraper::Scraper.new(query: 'Shopify').scrape do |a|
+  case a.class.to_s
+  when "NewsScraper::Transformers::ScrapePatternNotDefined"
+    puts "#{a.root_domain} was not trained"
+  when "NewsScraper::ResponseError"
+    puts "#{a.url} returned an error: #{a.error_code}-#{a.message}"
+  else
+    # { author: ... }
+  end
 end
 ```
@@ -48,12 +57,12 @@ How the `Scraper` extracts and parses for the information is determined by scrap
 ### Transformed Data
-Calling `NewsScraper::Scraper#scrape` with either the array or block notation will yield `transformed_data` hashes. [`article_scrape_patterns.yml`](https://github.com/richardwu/news_scraper/blob/master/config/article_scrape_patterns.yml) defines the data types that will be scraped for.
+Calling `NewsScraper::Scraper#scrape` with either the array or block notation will yield `transformed_data` hashes. [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml) defines the data types that will be scraped for.
 In addition, the `url` and `root_domain`(hostname) of the article will be returned in the hash too.
 Example
-```
+```ruby
 {
   author: 'Linus Torvald',
   body: 'The Linux kernel developed by Linus Torvald has become the backbone of most electronic devices we use to-date. It powers mobile phones, laptops, embedded devices, and even rockets...',
@@ -71,12 +80,34 @@ Example
 Scrape patterns are xpath or CSS patterns used by Nokogiri to extract relevant HTML elements.
-Extracting each `:data_type` (see Example under **Transformed Data**) requires a scrape pattern. A few `:presets` are specified in [`article_scrape_patterns.yml`](https://github.com/richardwu/news_scraper/blob/master/config/article_scrape_patterns.yml).
+Extracting each `:data_type` (see Example under **Transformed Data**) requires a scrape pattern. A few `:presets` are specified in [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml).
 Since each news site (identified with `:root_domain`) uses a different markup, scrape patterns are defined on a per-`:root_domain` basis.
 Specifying scrape patterns for new, undefined `:root_domains` is called training (see **Training**).
+#### Customizing Scrape Patterns
+`NewsScraper.configuration` is the entry point for scrape patterns. By default, it loads the contents of [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml), but you can override this with the `fetch_method` which accepts a proc.
+For example, to override the domains section we can do this like so:
+```ruby
+@default_configuration = NewsScraper.configuration.scrape_patterns.dup
+NewsScraper.configure do |config|
+  config.fetch_method = proc do
+    @default_configuration['domains'] = { ... }
+    @default_configuration
+  end
+end
+```
+Of course, using this method you can override any part of the configuration individually, or the entire thing. It is fully customizeable.
+This helps with separate apps which may track domains training itself. If the configuration is not set correctly, a newly trained domain will not be in the configuration and a `NewsScraper::Transformers::ScrapePatternNotDefined` error will be raised.
+It would be appreciated that any domains you train outside of this gem eventually end up as a pull request back to [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml).
 ### Training
 For each `:root_domain`, it is neccesary to specify a scrape pattern for each of the `:data_type`s. A rake task was written to provide a CLI for appending new `:root_domain`s using `:preset` scrape patterns.
@@ -88,6 +119,44 @@ bundle exec rake scraper:train QUERY=<query>
 where the CLI will step through the articles and `:root_domain`s of the articles relevant to `<query>`.
+Of course, this will simply create an entry for a `domain` with `domain_entries`, so as long as your application provides the same functionality, this can be overriden in your app. Just provide a domain entry like so:
+```yaml
+domains:
+  root_domain.com:
+    author:
+      method: method
+      pattern: pattern
+    body:
+      method: method
+      pattern: pattern
+    description:
+      method: method
+      pattern: pattern
+    keywords:
+      method: method
+      pattern: pattern
+    section:
+      method: method
+      pattern: pattern
+    datetime:
+      method: method
+      pattern: pattern
+    title:
+      method: method
+      pattern: pattern
+```
+The options using the presets in [`article_scrape_patterns.yml`](https://github.com/news-scraper/news_scraper/blob/master/config/article_scrape_patterns.yml), can be obtained using this snippet:
+```ruby
+include NewsScraper::ExtractorsHelpers
+transformed_data = NewsScraper::Transformers::TrainerArticle.new(
+  url: url,
+  payload: http_request(url).body
+).transform
+```
 ## Development
 After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
@@ -96,7 +165,7 @@ To install this gem onto your local machine, run `bundle exec rake install`. To
 ## Contributing
-Bug reports and pull requests are welcome on GitHub at https://github.com/richardwu/news_scraper. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
+Bug reports and pull requests are welcome on GitHub at https://github.com/news-scraper/news_scraper. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
 ## License

data/config/article_scrape_patterns.yml CHANGED Viewed

@@ -45,6 +45,9 @@ presets:
     og: &og_description
       method: "xpath"
       pattern: "//meta[@property='og:description']/@content"
+    metainspector: &metainspector_description
+      method: 'metainspector'
+      pattern: :description
   keywords:
     meta: &meta_keywords
       method: "xpath"
@@ -55,6 +58,9 @@ presets:
     news_keywords: &news_keywords_keywords
       method: "xpath"
       pattern: "//meta[@name='news_keywords']/@content"
+    highscore: &highscore_keywords
+      method: highscore
+      pattern: ""
   section:
     meta: &meta_section
       method: "xpath"
@@ -109,6 +115,9 @@ presets:
     og: &og_title
       method: "xpath"
       pattern: "//meta[@property='og:title']/@content"
+    metainspector: &metainspector_title
+      method: 'metainspector'
+      pattern: :best_title
 domains:
   investors.com:

data/config/stopwords.yml ADDED Viewed

@@ -0,0 +1,459 @@
+---
+- "-"
+- "--"
+- ":"
+- ":d"
+- 'no'
+- 'off'
+- 'on'
+- about
+- above
+- across
+- after
+- again
+- against
+- ahahaha
+- all
+- almost
+- alone
+- along
+- already
+- also
+- although
+- always
+- am
+- among
+- an
+- and
+- another
+- any
+- anybody
+- anyone
+- anything
+- anywhere
+- ar
+- are
+- area
+- areas
+- around
+- as
+- ask
+- asked
+- asking
+- asks
+- at
+- aw
+- away
+- aww
+- awww
+- back
+- backed
+- backing
+- backs
+- be
+- became
+- because
+- become
+- becomes
+- been
+- before
+- began
+- behind
+- being
+- beings
+- best
+- better
+- between
+- big
+- bit
+- blud
+- both
+- bt
+- but
+- by
+- call
+- came
+- can
+- cannot
+- case
+- cases
+- certain
+- certainly
+- chat
+- clear
+- clearly
+- come
+- comments
+- could
+- d
+- did
+- differ
+- different
+- differently
+- do
+- does
+- done
+- dont
+- down
+- downed
+- downing
+- downs
+- dunno
+- during
+- each
+- early
+- eh
+- either
+- email
+- end
+- ended
+- ending
+- ends
+- enough
+- even
+- evenly
+- ever
+- every
+- everybody
+- everyone
+- everything
+- everywhere
+- face
+- faces
+- fact
+- facts
+- far
+- felt
+- few
+- find
+- finds
+- first
+- for
+- four
+- from
+- full
+- fully
+- further
+- furthered
+- furthering
+- furthers
+- gave
+- general
+- generally
+- get
+- gets
+- give
+- given
+- gives
+- go
+- going
+- good
+- goods
+- got
+- great
+- greater
+- greatest
+- group
+- grouped
+- grouping
+- groups
+- ha
+- haaaa
+- had
+- haha
+- has
+- have
+- having
+- he
+- heh
+- hehe
+- hehehe
+- her
+- here
+- herself
+- high
+- higher
+- highest
+- him
+- himself
+- his
+- hola
+- how
+- however
+- i
+- id
+- if
+- il
+- im
+- important
+- in
+- init
+- interest
+- interested
+- interesting
+- interests
+- into
+- is
+- it
+- its
+- itself
+- iv
+- ive
+- jst
+- just
+- keep
+- keeps
+- kind
+- knew
+- know
+- known
+- knows
+- large
+- largely
+- last
+- later
+- latest
+- least
+- less
+- let
+- lets
+- like
+- likely
+- lol
+- long
+- longer
+- longest
+- lool
+- loool
+- looool
+- made
+- mah
+- make
+- making
+- man
+- many
+- may
+- me
+- member
+- members
+- men
+- might
+- more
+- most
+- mostly
+- mr
+- mrs
+- much
+- must
+- my
+- myself
+- necessary
+- need
+- needed
+- needing
+- needs
+- never
+- new
+- newer
+- newest
+- next
+- nobody
+- non
+- noone
+- not
+- nothing
+- now
+- nowhere
+- number
+- numbers
+- of
+- often
+- oh
+- old
+- older
+- oldest
+- once
+- one
+- only
+- ooh
+- ooo
+- open
+- opened
+- opening
+- opens
+- or
+- order
+- ordered
+- ordering
+- orders
+- other
+- others
+- our
+- out
+- over
+- part
+- parted
+- parting
+- parts
+- per
+- perhaps
+- place
+- places
+- pls
+- point
+- pointed
+- pointing
+- points
+- possible
+- powered
+- present
+- presented
+- presenting
+- presents
+- problem
+- problems
+- put
+- puts
+- quite
+- rather
+- really
+- right
+- room
+- rooms
+- run
+- safe
+- said
+- same
+- saw
+- say
+- says
+- second
+- seconds
+- see
+- seem
+- seemed
+- seeming
+- seems
+- sees
+- several
+- shall
+- she
+- should
+- show
+- showed
+- showing
+- shows
+- side
+- sides
+- since
+- small
+- smaller
+- smallest
+- so
+- some
+- somebody
+- someone
+- something
+- somewhere
+- state
+- states
+- still
+- stop
+- such
+- sure
+- ta
+- tail
+- take
+- taken
+- team
+- than
+- thank
+- thanks
+- that
+- the
+- their
+- them
+- then
+- there
+- therefore
+- theres
+- these
+- they
+- thing
+- things
+- think
+- thinks
+- this
+- those
+- though
+- thought
+- thoughts
+- three
+- through
+- thus
+- to
+- today
+- together
+- too
+- took
+- toward
+- tryna
+- turn
+- turned
+- turning
+- turns
+- two
+- under
+- until
+- up
+- upon
+- ur
+- us
+- use
+- used
+- uses
+- very
+- want
+- wanted
+- wanting
+- wants
+- was
+- way
+- ways
+- we
+- welcome
+- well
+- wells
+- went
+- were
+- what
+- when
+- where
+- whether
+- which
+- while
+- who
+- whole
+- whose
+- why
+- will
+- with
+- within
+- without
+- work
+- worked
+- working
+- works
+- would
+- ya
+- yeah
+- year
+- years
+- yet
+- yo
+- you
+- young
+- younger
+- youngest
+- your
+- yours

data/lib/news_scraper/configuration.rb CHANGED Viewed

@@ -1,26 +1,39 @@
 module NewsScraper
   class Configuration
     DEFAULT_SCRAPE_PATTERNS_FILEPATH = File.expand_path('../../../config/article_scrape_patterns.yml', __FILE__)
-    attr_accessor :fetch_method, :scrape_patterns_filepath
+    STOPWORDS_FILEPATH = File.expand_path('../../../config/stopwords.yml', __FILE__)
+    attr_accessor :scrape_patterns_fetch_method, :stopwords_fetch_method, :scrape_patterns_filepath
     # <code>NewsScraper::Configuration.initialize</code> initializes the scrape_patterns_filepath
-    # and the fetch_method to the <code>DEFAULT_SCRAPE_PATTERNS_FILEPATH</code>
+    # and the scrape_patterns_fetch_method to the <code>DEFAULT_SCRAPE_PATTERNS_FILEPATH</code>.
+    # It also sets stopwords to be used during extraction to a default set contained in <code>STOPWORDS_FILEPATH</code>
     #
     # Set the <code>scrape_patterns_filepath</code> to <code>nil</code> to disable saving during training
     #
     def initialize
       self.scrape_patterns_filepath = DEFAULT_SCRAPE_PATTERNS_FILEPATH
-      self.fetch_method = proc { default_scrape_patterns }
+      self.scrape_patterns_fetch_method = proc { default_scrape_patterns }
+      self.stopwords_fetch_method = proc { YAML.load_file(STOPWORDS_FILEPATH) }
     end
     # <code>NewsScraper::Configuration.scrape_patterns</code> proxies scrape_patterns
-    # requests to <code>fetch_method</code>:
+    # requests to <code>scrape_patterns_fetch_method</code>:
     #
     # *Returns*
-    # - The result of calling the <code>fetch_method</code> proc, expected to be a hash
+    # - The result of calling the <code>scrape_patterns_fetch_method</code> proc, expected to be a hash
     #
     def scrape_patterns
-      fetch_method.call
+      scrape_patterns_fetch_method.call
+    end
+    # <code>NewsScraper::Configuration.stopwords</code> proxies stopwords
+    # requests to <code>stopwords_fetch_method</code>:
+    #
+    # *Returns*
+    # - The result of calling the <code>stopwords_fetch_method</code> proc, expected to be an array
+    #
+    def stopwords
+      stopwords_fetch_method.call
     end
     private

data/lib/news_scraper/transformers/article.rb CHANGED Viewed

@@ -1,8 +1,11 @@
 require 'nokogiri'
 require 'sanitize'
+require 'news_scraper/transformers/nokogiri/functions'
 require 'readability'
 require 'htmlbeautifier'
-require 'news_scraper/transformers/nokogiri/functions'
+require 'metainspector'
+require 'news_scraper/transformers/helpers/highscore_parser'
 module NewsScraper
   module Transformers
@@ -69,6 +72,11 @@ module NewsScraper
           # Remove any newlines in the text
           content = content.squeeze("\n").strip
           HtmlBeautifier.beautify(content)
+        when :metainspector
+          page = MetaInspector.new(@url, document: @payload)
+          page.respond_to?(scrape_pattern.to_sym) ? page.send(scrape_pattern.to_sym) : nil
+        when :highscore
+          NewsScraper::Transformers::Helpers::HighScoreParser.keywords(url: @url, payload: @payload)
         end
       end
     end

data/lib/news_scraper/transformers/helpers/highscore_parser.rb ADDED Viewed

@@ -0,0 +1,50 @@
+require 'metainspector'
+require 'highscore'
+require 'readability'
+module NewsScraper
+  module Transformers
+    module Helpers
+      class HighScoreParser
+        class << self
+          # <code>NewsScraper::Transformers::Helpers::HighScoreParser.keywords</code> parses out keywords
+          #
+          # *Params*
+          # - <code>url:</code>: keyword for the url to parse to a uri
+          # - <code>payload:</code>: keyword for the payload from a request to the url (html body)
+          #
+          # *Returns*
+          # - <code>keywords</code>: Top 5 keywords from the body of text
+          #
+          def keywords(url:, payload:)
+            blacklist = Highscore::Blacklist.load(stopwords(url, payload))
+            content = Readability::Document.new(payload, emove_empty_nodes: true, tags: [], attributes: []).content
+            highscore(content, blacklist)
+          end
+          private
+          def highscore(content, blacklist)
+            text = Highscore::Content.new(content, blacklist)
+            text.configure do
+              set :multiplier, 2
+              set :upper_case, 3
+              set :long_words, 2
+              set :long_words_threshold, 15
+              set :ignore_case, true
+            end
+            text.keywords.top(5).collect(&:text)
+          end
+          def stopwords(url, payload)
+            page = MetaInspector.new(url, document: payload)
+            stopwords = NewsScraper.configuration.stopwords
+            # Add the site name to the stop words
+            stopwords += page.meta['og:site_name'].downcase.split(' ') if page.meta['og:site_name']
+            stopwords
+          end
+        end
+      end
+    end
+  end
+end

data/lib/news_scraper/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module NewsScraper
-  VERSION = "1.0.0".freeze
+  VERSION = "1.1.0".freeze
 end

data/news_scraper.gemspec CHANGED Viewed

@@ -1,3 +1,4 @@
+# rubocop:disable BlockLength
 # coding: utf-8
 lib = File.expand_path('../lib', __FILE__)
 $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
@@ -29,7 +30,9 @@ Gem::Specification.new do |spec|
   spec.add_dependency 'sanitize', '~> 4.2', '>= 4.2.0'
   spec.add_dependency 'ruby-readability', '~> 0.7', '>= 0.7.0'
   spec.add_dependency 'htmlbeautifier', '~> 1.1', '>= 1.1.1'
-  spec.add_dependency 'terminal-table', '~> 1.5', '>= 1.5.2'
+  spec.add_dependency 'terminal-table', '~> 1.7.0', '>= 1.7.0'
+  spec.add_dependency 'metainspector', '~> 5.3.0', '>= 5.3.0'
+  spec.add_dependency 'highscore', '~> 1.2.0', '>= 1.2.0'
   spec.add_development_dependency 'bundler', '~> 1.12', '>= 1.12.0'
   spec.add_development_dependency 'rake', '~> 10.0', '>= 10.0.0'

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: news_scraper
 version: !ruby/object:Gem::Version
-  version: 1.0.0
+  version: 1.1.0
 platform: ruby
 authors:
 - Richard Wu
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2016-09-25 00:00:00.000000000 Z
+date: 2016-10-16 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: nokogiri
@@ -117,20 +117,60 @@ dependencies:
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.5'
+        version: 1.7.0
     - - ">="
       - !ruby/object:Gem::Version
-        version: 1.5.2
+        version: 1.7.0
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.5'
+        version: 1.7.0
     - - ">="
       - !ruby/object:Gem::Version
-        version: 1.5.2
+        version: 1.7.0
+- !ruby/object:Gem::Dependency
+  name: metainspector
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 5.3.0
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: 5.3.0
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 5.3.0
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: 5.3.0
+- !ruby/object:Gem::Dependency
+  name: highscore
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 1.2.0
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: 1.2.0
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 1.2.0
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: 1.2.0
 - !ruby/object:Gem::Dependency
   name: bundler
   requirement: !ruby/object:Gem::Requirement
@@ -325,6 +365,7 @@ files:
 - bin/setup
 - circle.yml
 - config/article_scrape_patterns.yml
+- config/stopwords.yml
 - config/temp_dirs.yml
 - dev.yml
 - lib/news_scraper.rb
@@ -340,6 +381,7 @@ files:
 - lib/news_scraper/trainer/preset_selector.rb
 - lib/news_scraper/trainer/url_trainer.rb
 - lib/news_scraper/transformers/article.rb
+- lib/news_scraper/transformers/helpers/highscore_parser.rb
 - lib/news_scraper/transformers/nokogiri/functions.rb
 - lib/news_scraper/transformers/trainer_article.rb
 - lib/news_scraper/uri_parser.rb