RubyGems - rubyscraper - Versions diffs - 0.3.0 → 0.9.0 - Mend

rubyscraper 0.3.0 → 0.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

checksums.yaml +4 -4
data/Gemfile.lock +5 -6
data/README.md +37 -7
data/lib/rubyscraper.rb +14 -148
data/lib/rubyscraper/api_dispatcher.rb +31 -0
data/lib/rubyscraper/binary.rb +9 -6
data/lib/rubyscraper/option_parser.rb +72 -0
data/lib/rubyscraper/paginator.rb +59 -0
data/lib/rubyscraper/processor.rb +47 -0
data/lib/rubyscraper/sub_page_scraper.rb +53 -0
data/lib/rubyscraper/summary_scraper.rb +65 -0
data/lib/rubyscraper/version.rb +1 -1
data/rubyscraper.gemspec +5 -6
data/spec/paginator_spec.rb +83 -0
data/spec/rubyscraper_spec.rb +2 -6
data/spec/spec_helper.rb +3 -0
data/spec/sub_page_scraper_spec.rb +51 -0
data/spec/summary_scraper_spec.rb +125 -0
metadata +27 -33
data/lib/assets/scrapes.json +0 -287

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 52f18807f9710b9ae0f38e5b95e19a647ab9435a
-  data.tar.gz: 339219074a3332916f1ba86bf3f7ff8bcbc1a357
+  metadata.gz: d243d8f44eb571d03e741459af753e87c10fc254
+  data.tar.gz: a49de3dfd374cd2bfcc6fa2d7e2d0f735a4ce567
 SHA512:
-  metadata.gz: 625604ca6bde5493de83fc564fc82ae52fdf3777c9b59fb0a2460b8e29b0dcc32f37e75d5c401300e441cfe6085645ba105fc45e50449160c5b190587dbf70b4
-  data.tar.gz: e8bdbf5adc50d263e7015aa3b574ceff9f669d0e6aaa766b52f515107010b76e6bbdaecfa190ac52634e852ac9df553412d1336a3bc190dc952773f5174068ed
+  metadata.gz: 705135c2b75b796ce8aa7db591d435da5a1643c6a7934241529deca24959249720a00b1b980e335fc29b7590991f275d41c32b275302170316ec0ab121459132
+  data.tar.gz: 23313e00a3eb5d907c27881a1e4c3c0640ba6afc25189a86e53494f2c545bd1f7968b46ef130bbcf93f38aca88c6de2ab345c82716c7d8cac039060657688404

data/Gemfile.lock CHANGED Viewed

@@ -1,11 +1,10 @@
 PATH
   remote: .
   specs:
-    rubyscraper (0.2.0)
-      capybara
-      poltergeist
-      rest-client
-      slop
+    rubyscraper (0.1.0)
+      capybara (~> 2.4)
+      poltergeist (~> 1.6)
+      rest-client (~> 1.8)
 GEM
   remote: https://rubygems.org/
@@ -75,7 +74,7 @@ PLATFORMS
 DEPENDENCIES
   bundler (~> 1.9)
-  pry
+  pry (~> 0.10)
   rake (~> 10.0)
   rspec (~> 3.0)
   rubyscraper!

data/README.md CHANGED Viewed

@@ -1,11 +1,24 @@
 # RubyScraper
-Welcome to your new gem! In this directory, you'll find the files you need to be able to package up your Ruby library into a gem. Put your Ruby code in the file `lib/rubyscraper`. To experiment with that code, run `bin/console` for an interactive prompt.
-TODO: Delete this and the text above, and describe your gem
+RubyScraper is a gem built
 ## Installation
+### Dependency
+RubyScraper relies on PhantomJS as its headless web browser. Install this before installing the gem with:
+```
+brew install phantomjs
+```
+### CLI
+Install RubyScraper by running:
+```
+gem install rubyscraper
+```
+### Gemfile
+*Work in Progress*
 Add this line to your application's Gemfile:
 ```ruby
@@ -22,18 +35,35 @@ Or install it yourself as:
 ## Usage
-TODO: Write usage instructions here
+```
+Usage: RubyScraper [options]
+Specific options:
+REQUIRED:
+    -f, --file FILENAME.JSON         Specify the file_name of your RubyScraper config file
-## Development
+REQUIRED (if using as service to send results as post requests):
+    -e, --endpoint URL               Enter the api endpoint URL here
+                                       (If using scraper as a service to send post requests to server)
-After checking out the repo, run `bin/setup` to install dependencies. Then, run `bin/console` for an interactive prompt that will allow you to experiment.
+OPTIONAL:
+    -r, --record-limit N             Pull N records per site
+                                       (approximate because if there are 25 records per
+                                       page, and 51 is provided, it will go to 3 pages)
+    -d, --delay N                    Delay N seconds before executing
+    -s, --site SITENAME              Scrape a single SITENAME from the config file
-To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release` to create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
+Common options:
+    -h, --help                       Show this message
+        --version                    Show version
+```
 ## Contributing
 1. Fork it ( https://github.com/[my-github-username]/rubyscraper/fork )
 2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Write your tests and don't break anything :) *run tests with `rspec`*
 3. Commit your changes (`git commit -am 'Add some feature'`)
 4. Push to the branch (`git push origin my-new-feature`)
 5. Create a new Pull Request

data/lib/rubyscraper.rb CHANGED Viewed

@@ -1,153 +1,19 @@
-require 'capybara'
-require 'capybara/poltergeist'
-require 'rest-client'
 require 'rubyscraper/version'
+require 'rubyscraper/processor'
+require 'rubyscraper/api_dispatcher'
 class RubyScraper
-  include Capybara::DSL
-  attr_reader :scrape_config, :pages, :jobs, :posted_jobs, :endpoint, :scraped_jobs
-  def initialize(endpoint, pages=1)
-    Capybara.register_driver :poltergeist do |app|
-      Capybara::Poltergeist::Driver.new(app, js_errors: false)
-    end
-    Capybara.default_driver = :poltergeist
-    @jobs = []
-    @scraped_jobs = 0
-    @posted_jobs = 0
-    @pages = pages
-    @endpoint = endpoint
-    @scrape_file = File.expand_path('../assets/scrapes.json', __FILE__)
-    @scrape_config = JSON.parse(File.read(@scrape_file))
-  end
-  def scrape(single_site=nil)
-    if single_site
-      search_site = scrape_config.select { |site| site["name"] == single_site }
-      if search_site
-        get_data(search_site.first)
-      else
-        raise "Invalid single site name #{single_site}. Not in scrape file."
-      end
-    else
-      scrape_config.each do |site|
-        unless site["skip"] == "true"
-          get_data(site)
-        end
-      end
-    end
-    return scraped_jobs, posted_jobs
-  end
-  def get_data(site)
-    get_summaries(site)
-    get_bodies(site)
-    send_to_server
-  end
-  def get_summaries(site)
-    if site["summary"]["params"].length > 0 && !site["summary"]["no_pagination?"]
-      site["summary"]["params"][0]["SEARCHTERM"].each do |term|
-        summary_url = "#{site["base_url"]}#{site["summary"]["url"].sub("SEARCHTERM", term)}"
-        pagination_start = site["summary"]["pagination_start"].to_i
-        pagination_end   = pagination_start + pages - 1
-        (pagination_start..pagination_end).to_a.each do |page|
-          visit "#{summary_url}#{site["summary"]["pagination_fmt"]}#{page * site["summary"]["pagination_scale"].to_i}"
-          all(site["summary"]["loop"]).each do |listing|
-            job = pull_summary_data(site, listing)
-            job = modify_data(site, job)
-            jobs << job
-          end
-          puts "Pulled #{site["name"]}: #{term} (page: #{page}) job summaries."
-        end
-      end
-    else
-      summary_url = "#{site["base_url"]}#{site["summary"]["url"]}"
-      visit summary_url
-      all(site["summary"]["loop"]).each do |listing|
-        job = pull_summary_data(site, listing)
-        job = modify_data(site, job)
-        jobs << job
-      end
-      puts "Pulled #{site["name"]} job summaries."
-    end
-  end
-  def pull_summary_data(site, listing)
-    job = Hash.new
-    site["summary"]["fields"].each do |field|
-      if field["attr"]
-        if listing.has_css?(field["path"])
-          job[field["field"]] =
-            listing.send(field["method"].to_sym, field["path"])[field["attr"]]
-        end
-      else
-        if listing.has_css?(field["path"])
-          job[field["field"]] =
-            listing.send(field["method"].to_sym, field["path"]).text
-        end
-      end
-    end; job
-  end
-  def modify_data(site, job)
-    job["url"] = "#{site["base_url"]}#{job["url"]}" unless job["url"].match(/^http/)
-    job
-  end
-  def get_bodies(site)
-    jobs.each_with_index do |job, i|
-      sleep 1
-      pull_job_data(site, job)
-      puts "Job #{i+1} pulled."
-    end
-  end
-  def pull_job_data(site, job)
-    visit job["url"]
-    site["sub_page"]["fields"].each do |field|
-      if field["method"] == "all"
-        if has_css?(field["path"])
-          values = all(field["path"]).map do |elem|
-            elem.send(field["loop_collect"])
-          end
-          job[field["field"]] = values.join(field["join"])
-        end
-      else
-        if has_css?(field["path"])
-          job[field["field"]] =
-            send(field["method"].to_sym,field["path"]).text
-        end
-      end
-    end
-  end
-  def send_to_server
-    @scraped_jobs += jobs.length
-    jobs.each do |job|
-      tags = job["tags"] || ""
-      new_job = {
-        position: job["position"],
-        location: job["location"],
-        description: job["description"],
-        source: job["url"],
-        company: job["company"],
-        tags: tags.split(", ")
-      }
-      RestClient.post(endpoint, job: new_job){ |response, request, result, &block|
-        case response.code
-        when 201
-          @posted_jobs += 1
-          puts "Job saved."
-        when 302
-          puts "Job already exists."
-        else
-          puts "Bad request."
-        end
-      }
-    end
-    @jobs = []
+  def self.call(opts)
+    record_limit = opts.record_limit
+    config_file  = File.expand_path(opts.config_file, Dir.pwd)
+    single_site  = opts.single_site
+    scrape_delay = opts.scrape_delay
+    endpoint     = opts.endpoint
+    processor = Processor.new(config_file, single_site, record_limit, scrape_delay)
+    results   = processor.call
+    num_saved = ApiDispatcher.post(results, endpoint)
+    return results.count, num_saved
   end
 end

data/lib/rubyscraper/api_dispatcher.rb ADDED Viewed

@@ -0,0 +1,31 @@
+require 'rest-client'
+class ApiDispatcher
+  def self.post(results, endpoint)
+    results.inject 0 do |posted, listing|
+      tags = listing["tags"].split(", ") if listing["tags"]
+      new_listing = {
+        position: listing["position"],
+        location: listing["location"],
+        company: listing["company"],
+        description: listing["description"],
+        source: listing["url"],
+        tags: tags
+      }
+      RestClient.post(endpoint, job: new_listing){ |response, request, result, &block|
+        case response.code
+        when 201
+          puts "Job saved."
+          posted += 1
+        when 302
+          puts "Job already exists."
+          posted
+        else
+          puts "Bad request."
+          posted
+        end
+      }
+    end
+  end
+end

data/lib/rubyscraper/binary.rb CHANGED Viewed

@@ -1,16 +1,19 @@
 require 'rubyscraper'
+require 'rubyscraper/option_parser'
 class RubyScraper
   class Binary
     def self.call(argv, outstream, errstream)
-      outstream.puts "StackOverflow Job Scraper"
+      outstream.puts "RubyScraper"
       outstream.puts "---------------------------------------------"
       outstream.puts "Started scraping..."
-      endpoint = argv[0]
-      single_site = argv[1]
-      outstream.puts "Sending post requests to #{endpoint}"
-      jobs_scraped, jobs_saved = RubyScraper.new(endpoint).scrape(single_site)
-      outstream.puts "Scraped #{jobs_scraped} jobs, succesfully posted #{jobs_saved} jobs."
+      outstream.puts "---------------------------------------------"
+      options = OptparseExample.parse(argv)
+      records_scraped, records_saved = RubyScraper.call(options)
+      outstream.puts "---------------------------------------------"
+      outstream.puts "Scraped #{records_scraped} records, succesfully posted #{records_saved} records."
       outstream.puts "---------------------------------------------"
       outstream.puts "Completed!"
     end

data/lib/rubyscraper/option_parser.rb ADDED Viewed

@@ -0,0 +1,72 @@
+require 'rubyscraper/version'
+require 'optparse'
+require 'ostruct'
+class OptparseExample
+  def self.parse(args)
+    options              = OpenStruct.new
+    options.config_file  = ""
+    options.endpoint     = ""
+    options.record_limit = 50
+    options.single_site  = ""
+    options.scrape_delay = 1
+    opt_parser = OptionParser.new do |opts|
+      opts.banner = "Usage: RubyScraper [options]"
+      opts.separator ""
+      opts.separator "Specific options:"
+      opts.separator ""
+      opts.separator "REQUIRED:"
+      # Mandatory argument
+      opts.on("-f", "--file FILENAME.JSON",
+              "Specify the file_name of your RubyScraper config file") do |file|
+        options.config_file = file
+      end
+      opts.separator ""
+      opts.separator "REQUIRED (if using as service to send results as post requests):"
+      # Mandatory argument if sending results to POST endpoint
+      opts.on("-e", "--endpoint URL",
+              "Enter the api endpoint URL here",
+              "  (If using scraper as a service to send post requests to server)",) do |url|
+        options.endpoint = url
+      end
+      opts.separator ""
+      opts.separator "OPTIONAL:"
+      opts.on("-rl", "--record-limit N", Integer,
+              "Pull N records per site",
+              "  (approximate because if there are 25 records per",
+              "  page, and 51 is provided, it will go to 3 pages)") do |limit|
+        options.record_limit = limit
+      end
+      opts.on("-d", "--delay N", Float, "Delay N seconds before executing") do |n|
+        options.delay = n
+      end
+      opts.on("-s", "--site SITENAME", "Scrape a single SITENAME from the config file") do |site|
+        options.single_site = site
+      end
+      opts.separator ""
+      opts.separator "Common options:"
+      opts.on_tail("-h", "--help", "Show this message") do
+        puts opts
+        exit
+      end
+      opts.on_tail("--version", "Show version") do
+        puts RubyScraper::VERSION
+        exit
+      end
+    end
+    opt_parser.parse!(args)
+    options
+  end
+end

data/lib/rubyscraper/paginator.rb ADDED Viewed

@@ -0,0 +1,59 @@
+class Paginator
+  attr_reader :site, :record_limit, :pagination
+  def initialize(site, record_limit)
+    @site         = site
+    @pagination   = site["summary"]["pagination"]
+    @record_limit = record_limit
+  end
+  def define_pagination_params
+    if paginated_site?
+      @steps  = url_page_addons
+      @add_on = pagination["format"]
+    else
+      @steps  = [""]
+      @add_on = ""
+    end
+  end
+  def add_on
+    @add_on
+  end
+  def steps
+    @steps
+  end
+  private
+  def url_page_addons
+    output = []
+    num_pages.times do |i|
+      output << pagination_start + pagination_scale * i
+    end
+    output
+  end
+  def num_pages
+    output = record_limit / records_per_page
+    output += 1 if record_limit % records_per_page != 0
+    output
+  end
+  def records_per_page
+    pagination["records_per_page"].to_i
+  end
+  def pagination_start
+    pagination["start"].to_i
+  end
+  def pagination_scale
+    pagination["scale"].to_i
+  end
+  def paginated_site?
+    site["summary"]["paginated"] == "true"
+  end
+end