wonder_scrape 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: c27721fddd799f4cb631710d07c090a17abdaef56be4b9e725ac15e95bacce36
4
+ data.tar.gz: fc6469515a2d78a505d0911c0d8698b39a56a2ba7d1774aaf53723bf533c4ef3
5
+ SHA512:
6
+ metadata.gz: 23f4a73f08832f3ce85d06991ca879efb6a01c1592d53b8684d3a4a3cd8057c3e1335e8a4cb5f247c69026fa853df71e4d7d55c39f55be3cc14a27e60ffd549a
7
+ data.tar.gz: 8cc004a61c5a3f032c0f3a1e8e2e4028ce14c39ba04f62c77712ead4933418d3b7cdb08cf5fe324446c3f1d7626aaf9a1e2c6a67d2951d9e95ac018c2e90416f
data/.gitignore ADDED
@@ -0,0 +1,11 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /_yardoc/
4
+ /coverage/
5
+ /doc/
6
+ /pkg/
7
+ /spec/reports/
8
+ /tmp/
9
+
10
+ # rspec failure tracking
11
+ .rspec_status
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --format documentation
2
+ --color
3
+ --require spec_helper
data/.ruby-version ADDED
@@ -0,0 +1 @@
1
+ 2.7.0
data/.travis.yml ADDED
@@ -0,0 +1,6 @@
1
+ ---
2
+ language: ruby
3
+ cache: bundler
4
+ rvm:
5
+ - 2.7.0
6
+ before_install: gem install bundler -v 2.1.4
data/CHANGELOG.md ADDED
File without changes
@@ -0,0 +1,74 @@
1
+ # Contributor Covenant Code of Conduct
2
+
3
+ ## Our Pledge
4
+
5
+ In the interest of fostering an open and welcoming environment, we as
6
+ contributors and maintainers pledge to making participation in our project and
7
+ our community a harassment-free experience for everyone, regardless of age, body
8
+ size, disability, ethnicity, gender identity and expression, level of experience,
9
+ nationality, personal appearance, race, religion, or sexual identity and
10
+ orientation.
11
+
12
+ ## Our Standards
13
+
14
+ Examples of behavior that contributes to creating a positive environment
15
+ include:
16
+
17
+ * Using welcoming and inclusive language
18
+ * Being respectful of differing viewpoints and experiences
19
+ * Gracefully accepting constructive criticism
20
+ * Focusing on what is best for the community
21
+ * Showing empathy towards other community members
22
+
23
+ Examples of unacceptable behavior by participants include:
24
+
25
+ * The use of sexualized language or imagery and unwelcome sexual attention or
26
+ advances
27
+ * Trolling, insulting/derogatory comments, and personal or political attacks
28
+ * Public or private harassment
29
+ * Publishing others' private information, such as a physical or electronic
30
+ address, without explicit permission
31
+ * Other conduct which could reasonably be considered inappropriate in a
32
+ professional setting
33
+
34
+ ## Our Responsibilities
35
+
36
+ Project maintainers are responsible for clarifying the standards of acceptable
37
+ behavior and are expected to take appropriate and fair corrective action in
38
+ response to any instances of unacceptable behavior.
39
+
40
+ Project maintainers have the right and responsibility to remove, edit, or
41
+ reject comments, commits, code, wiki edits, issues, and other contributions
42
+ that are not aligned to this Code of Conduct, or to ban temporarily or
43
+ permanently any contributor for other behaviors that they deem inappropriate,
44
+ threatening, offensive, or harmful.
45
+
46
+ ## Scope
47
+
48
+ This Code of Conduct applies both within project spaces and in public spaces
49
+ when an individual is representing the project or its community. Examples of
50
+ representing a project or community include using an official project e-mail
51
+ address, posting via an official social media account, or acting as an appointed
52
+ representative at an online or offline event. Representation of a project may be
53
+ further defined and clarified by project maintainers.
54
+
55
+ ## Enforcement
56
+
57
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be
58
+ reported by contacting the project team at bendawson.rb@gmail.com. All
59
+ complaints will be reviewed and investigated and will result in a response that
60
+ is deemed necessary and appropriate to the circumstances. The project team is
61
+ obligated to maintain confidentiality with regard to the reporter of an incident.
62
+ Further details of specific enforcement policies may be posted separately.
63
+
64
+ Project maintainers who do not follow or enforce the Code of Conduct in good
65
+ faith may face temporary or permanent repercussions as determined by other
66
+ members of the project's leadership.
67
+
68
+ ## Attribution
69
+
70
+ This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71
+ available at [https://contributor-covenant.org/version/1/4][version]
72
+
73
+ [homepage]: https://contributor-covenant.org
74
+ [version]: https://contributor-covenant.org/version/1/4/
data/Gemfile ADDED
@@ -0,0 +1,7 @@
1
+ source "https://rubygems.org"
2
+
3
+ # Specify your gem's dependencies in wonder_scrape.gemspec
4
+ gemspec
5
+
6
+ gem "rake", "~> 12.0"
7
+ gem "rspec", "~> 3.0"
data/Gemfile.lock ADDED
@@ -0,0 +1,87 @@
1
+ PATH
2
+ remote: .
3
+ specs:
4
+ wonder_scrape (0.1.0)
5
+ nokogiri (~> 1.10.9)
6
+ thor
7
+ tty-progressbar
8
+ tty-prompt
9
+ upton (~> 0.3.6)
10
+
11
+ GEM
12
+ remote: https://rubygems.org/
13
+ specs:
14
+ diff-lcs (1.3)
15
+ domain_name (0.5.20190701)
16
+ unf (>= 0.0.5, < 1.0.0)
17
+ equatable (0.6.1)
18
+ http-accept (1.7.0)
19
+ http-cookie (1.0.3)
20
+ domain_name (~> 0.5)
21
+ mime-types (3.3.1)
22
+ mime-types-data (~> 3.2015)
23
+ mime-types-data (3.2020.0425)
24
+ mini_portile2 (2.4.0)
25
+ necromancer (0.5.1)
26
+ netrc (0.11.0)
27
+ nokogiri (1.10.9)
28
+ mini_portile2 (~> 2.4.0)
29
+ pastel (0.7.3)
30
+ equatable (~> 0.6)
31
+ tty-color (~> 0.5)
32
+ rake (12.3.3)
33
+ rest-client (2.1.0)
34
+ http-accept (>= 1.7.0, < 2.0)
35
+ http-cookie (>= 1.0.2, < 2.0)
36
+ mime-types (>= 1.16, < 4.0)
37
+ netrc (~> 0.8)
38
+ rspec (3.9.0)
39
+ rspec-core (~> 3.9.0)
40
+ rspec-expectations (~> 3.9.0)
41
+ rspec-mocks (~> 3.9.0)
42
+ rspec-core (3.9.1)
43
+ rspec-support (~> 3.9.1)
44
+ rspec-expectations (3.9.1)
45
+ diff-lcs (>= 1.2.0, < 2.0)
46
+ rspec-support (~> 3.9.0)
47
+ rspec-mocks (3.9.1)
48
+ diff-lcs (>= 1.2.0, < 2.0)
49
+ rspec-support (~> 3.9.0)
50
+ rspec-support (3.9.2)
51
+ strings-ansi (0.1.0)
52
+ thor (1.0.1)
53
+ tty-color (0.5.1)
54
+ tty-cursor (0.7.1)
55
+ tty-progressbar (0.17.0)
56
+ strings-ansi (~> 0.1.0)
57
+ tty-cursor (~> 0.7)
58
+ tty-screen (~> 0.7)
59
+ unicode-display_width (~> 1.6)
60
+ tty-prompt (0.21.0)
61
+ necromancer (~> 0.5.0)
62
+ pastel (~> 0.7.0)
63
+ tty-reader (~> 0.7.0)
64
+ tty-reader (0.7.0)
65
+ tty-cursor (~> 0.7)
66
+ tty-screen (~> 0.7)
67
+ wisper (~> 2.0.0)
68
+ tty-screen (0.7.1)
69
+ unf (0.1.4)
70
+ unf_ext
71
+ unf_ext (0.0.7.7)
72
+ unicode-display_width (1.7.0)
73
+ upton (0.3.6)
74
+ nokogiri (~> 1.5)
75
+ rest-client (~> 2.0, >= 1.6)
76
+ wisper (2.0.1)
77
+
78
+ PLATFORMS
79
+ ruby
80
+
81
+ DEPENDENCIES
82
+ rake (~> 12.0)
83
+ rspec (~> 3.0)
84
+ wonder_scrape!
85
+
86
+ BUNDLED WITH
87
+ 2.1.4
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2020 Benjamin Dawson
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,47 @@
1
+ # WonderScrape
2
+
3
+ A project to collect useful information from figure collecting websites.
4
+
5
+ ## Installation
6
+
7
+ Add this line to your application's Gemfile:
8
+
9
+ ```ruby
10
+ gem 'wonder_scrape'
11
+ ```
12
+
13
+ And then execute:
14
+
15
+ $ bundle install
16
+
17
+ Or install it yourself as:
18
+
19
+ $ gem install wonder_scrape
20
+
21
+ ## Usage
22
+
23
+ To get started, run:
24
+
25
+ $ wonder_scrape scrape
26
+
27
+ For more configuration options, run:
28
+
29
+ # wonder_scrape help scrape
30
+
31
+ ## Development
32
+
33
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
34
+
35
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
36
+
37
+ ## Contributing
38
+
39
+ Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/wonder_scrape. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/[USERNAME]/wonder_scrape/blob/master/CODE_OF_CONDUCT.md).
40
+
41
+ ## License
42
+
43
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
44
+
45
+ ## Code of Conduct
46
+
47
+ Everyone interacting in the WonderScrape project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/wonder_scrape/blob/master/CODE_OF_CONDUCT.md).
data/Rakefile ADDED
@@ -0,0 +1,8 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'bundler/gem_tasks'
4
+ require 'rspec/core/rake_task'
5
+
6
+ RSpec::Core::RakeTask.new(:spec)
7
+
8
+ task default: :spec
data/bin/console ADDED
@@ -0,0 +1,15 @@
1
+ #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
3
+
4
+ require 'bundler/setup'
5
+ require 'wonder_scrape'
6
+
7
+ # You can add fixtures and/or initialization code here to make experimenting
8
+ # with your gem easier. You can also use a different console, if you like.
9
+
10
+ # (If you use this, don't forget to add pry to your Gemfile!)
11
+ # require "pry"
12
+ # Pry.start
13
+
14
+ require 'irb'
15
+ IRB.start(__FILE__)
data/bin/setup ADDED
@@ -0,0 +1,8 @@
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+ set -vx
5
+
6
+ bundle install
7
+
8
+ # Do any other automated setup that you need to do here
data/exe/wonder_scrape ADDED
@@ -0,0 +1,19 @@
1
+ #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
3
+
4
+ lib_path = File.expand_path('../lib', __dir__)
5
+ $LOAD_PATH.unshift(lib_path) unless $LOAD_PATH.include?(lib_path)
6
+ require 'wonder_scrape'
7
+ require 'wonder_scrape/cli'
8
+
9
+ Signal.trap('INT') do
10
+ warn("\n#{caller.join("\n")}: interrupted")
11
+ exit(1)
12
+ end
13
+
14
+ begin
15
+ WonderScrape::CLI.start
16
+ rescue WonderScrape::CLI::Error => e
17
+ puts "ERROR: #{e.message}"
18
+ exit 1
19
+ end
@@ -0,0 +1,7 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'wonder_scrape/version'
4
+
5
+ module WonderScrape
6
+ class Error < StandardError; end
7
+ end
@@ -0,0 +1,49 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'thor'
4
+ require_relative 'commands/scrape'
5
+
6
+ module WonderScrape
7
+ # Handle the application command line parsing
8
+ # and the dispatch to various command objects
9
+ #
10
+ # @api public
11
+ class CLI < Thor
12
+ # Error raised by this runner
13
+ Error = Class.new(StandardError)
14
+
15
+ desc 'version', 'wonder_scrape version'
16
+ def version
17
+ require_relative 'version'
18
+ puts "v#{WonderScrape::VERSION}"
19
+ end
20
+ map %w[--version -v] => :version
21
+
22
+ desc 'scrape', 'Scrape a target website for item data'
23
+ method_option :target, aliases: '-t', type: :string, banner: 'targetWebsite',
24
+ desc: 'Sets the target website for scraping.',
25
+ enum: WonderScrape::Commands::Scrape::VALID_SCRAPER_NAMES
26
+ method_option :output, aliases: '-o', type: :string, banner: 'csv',
27
+ desc: 'Specifies the output format',
28
+ enum: %w[csv json]
29
+ method_option :file, aliases: '-f', type: :string, banner: 'path/to/file',
30
+ desc: 'Path to the file to write output to. Only necessary for CSV.'
31
+ method_option :num_pages, aliases: '-n', type: :numeric, banner: 2,
32
+ desc: 'Expected number of pages for search results.'
33
+ method_option :start_page, aliases: '-s', type: :numeric, banner: 1,
34
+ desc: 'What page of search results to begin scraping from.'
35
+ method_option :request_delay, aliases: '-d', type: :numeric, banner: 5,
36
+ desc: 'How long in seconds to wait between requests. Useful to avoid tripping rate limits.'
37
+ method_option :verbose, aliases: '-v', type: :boolean,
38
+ desc: 'Runs in verbose mode, outputting in greater detail'
39
+ method_option :help, aliases: '-h', type: :boolean,
40
+ desc: 'Display usage information'
41
+ def scrape(*)
42
+ if options[:help]
43
+ invoke :help, ['scrape']
44
+ else
45
+ WonderScrape::Commands::Scrape.new(options).execute
46
+ end
47
+ end
48
+ end
49
+ end
@@ -0,0 +1,41 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'forwardable'
4
+
5
+ module WonderScrape
6
+ class Command
7
+ extend Forwardable
8
+
9
+ def_delegators :command, :run
10
+
11
+ # Execute this command
12
+ #
13
+ # @api public
14
+ def execute(*)
15
+ raise(
16
+ NotImplementedError,
17
+ "#{self.class}##{__method__} must be implemented"
18
+ )
19
+ end
20
+
21
+ # The external commands runner
22
+ #
23
+ # @see http://www.rubydoc.info/gems/tty-command
24
+ #
25
+ # @api public
26
+ def command(**options)
27
+ require 'tty-command'
28
+ TTY::Command.new(options)
29
+ end
30
+
31
+ # The interactive prompt
32
+ #
33
+ # @see http://www.rubydoc.info/gems/tty-prompt
34
+ #
35
+ # @api public
36
+ def prompt
37
+ require 'tty-prompt'
38
+ TTY::Prompt.new(interrupt: :exit)
39
+ end
40
+ end
41
+ end
@@ -0,0 +1 @@
1
+ #
@@ -0,0 +1,93 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'tty-progressbar'
4
+
5
+ require_relative '../command'
6
+ require_relative '../scrapers/mfc/scraper'
7
+ require_relative '../writers/csv'
8
+ require_relative '../writers/hash'
9
+ require_relative '../recorder'
10
+
11
+ module WonderScrape
12
+ module Commands
13
+ class Scrape < WonderScrape::Command
14
+ VALID_SCRAPER_NAMES = [
15
+ WonderScrape::Scrapers::MFC::Scraper::NAME
16
+ ].freeze
17
+
18
+ VALID_WRITERS = [
19
+ WonderScrape::Writers::CSV::NAME,
20
+ WonderScrape::Writers::Hash::NAME
21
+ ].freeze
22
+
23
+ def initialize(raw_options)
24
+ @raw_options = raw_options
25
+ end
26
+
27
+ def execute(input: $stdin, output: $stdout)
28
+ recorder = WonderScrape::Recorder.new(output, options)
29
+ writer = build_writer
30
+ scraper = build_scraper(writer, recorder)
31
+
32
+ scraper.scrape
33
+ writer.output_results
34
+ recorder.print
35
+ end
36
+
37
+ private
38
+
39
+ attr_reader :raw_options
40
+
41
+ def build_scraper(writer, recorder)
42
+ target_module.new(writer, recorder, options)
43
+ end
44
+
45
+ def build_writer
46
+ case output
47
+ when WonderScrape::Writers::CSV::NAME
48
+ WonderScrape::Writers::CSV.new(file, target_module::FIELDS)
49
+ when WonderScrape::Writers::Hash::NAME
50
+ WonderScrape::Writers::Hash.new
51
+ end
52
+ end
53
+
54
+ def target_module
55
+ @target_module ||= case target
56
+ when WonderScrape::Scrapers::MFC::Scraper::NAME
57
+ WonderScrape::Scrapers::MFC::Scraper
58
+ end
59
+ end
60
+
61
+ def progress_bar
62
+ TTY::ProgressBar.new('[:bar] :percent', total: approximate_records)
63
+ end
64
+
65
+ def target
66
+ @target ||= raw_options[:target] || prompt.select('What website would you like to scrape?', VALID_SCRAPER_NAMES)
67
+ end
68
+
69
+ def output
70
+ @output ||= raw_options[:format] || prompt.select('How would you like to output?', VALID_WRITERS)
71
+ end
72
+
73
+ def file
74
+ @file ||= raw_options[:file] || prompt.ask('Please specify the file path you want to write to:', required: true)
75
+ end
76
+
77
+ def options
78
+ @options ||= raw_options.merge({
79
+ progress_bar: progress_bar,
80
+ num_pages: num_pages
81
+ })
82
+ end
83
+
84
+ def approximate_records
85
+ target_module::RESULTS_PER_PAGE * num_pages
86
+ end
87
+
88
+ def num_pages
89
+ @num_pages ||= raw_options[:num_pages] || prompt.ask('How many pages of search results do you want to scrape?', default: target_module::DEFAULT_MAX_PAGES, convert: :int)
90
+ end
91
+ end
92
+ end
93
+ end
@@ -0,0 +1,47 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'json'
4
+
5
+ class WonderScrape::Recorder
6
+ def initialize(output, options = {})
7
+ @output = output
8
+ @verbose = options[:verbose] || false
9
+ @progress_bar = options[:progress_bar]
10
+ @items_scraped = 0
11
+ @item_issues = {}
12
+ @unexpected_fields = []
13
+ end
14
+
15
+ def print
16
+ output.puts "Successfully processed #{items_scraped} items!"
17
+
18
+ if unexpected_fields.count > 0
19
+ output.puts "Encountered the following unexpected fields: #{unexpected_fields}"
20
+ end
21
+
22
+ if item_issues.count > 0
23
+ output.puts "Had issues with #{item_issues.count} items below"
24
+ output.puts JSON.pretty_generate(item_issues)
25
+ end
26
+ end
27
+
28
+ def increment_items_scraped(item)
29
+ @items_scraped += 1
30
+ if verbose
31
+ output.puts JSON.pretty_generate(item)
32
+ else
33
+ progress_bar&.advance(1)
34
+ end
35
+ end
36
+
37
+ def record_unexpected_field(item_id, field_name)
38
+ item_issues[item_id] ||= []
39
+ item_issues[item_id] << "Unexpected field: #{field_name}"
40
+ unexpected_fields << field_name
41
+ end
42
+
43
+ private
44
+
45
+ attr_reader :output, :verbose, :items_scraped, :progress_bar
46
+ attr_accessor :item_issues, :unexpected_fields
47
+ end
@@ -0,0 +1,71 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative 'mfc'
4
+
5
+ module WonderScrape::Scrapers::MFC
6
+ module FieldParsers
7
+ class Standard
8
+ def self.parse(field_content)
9
+ field_content.text
10
+ end
11
+ end
12
+
13
+ class StandardList
14
+ def self.parse(field_content)
15
+ field_content.search('a').map(&:text)
16
+ end
17
+ end
18
+
19
+ class Price
20
+ def self.parse(field_content)
21
+ field_content.search('.item-price').text
22
+ end
23
+ end
24
+
25
+ class Dates
26
+ def self.parse(field_content)
27
+ field_content.search('a.time').map(&:text)
28
+ end
29
+ end
30
+
31
+ class Events
32
+ def self.parse(field_content)
33
+ field_content.search('a.item-entry > span').map(&:text)
34
+ end
35
+ end
36
+
37
+ class MainImage
38
+ def self.parse(field_content)
39
+ image_url = field_content.search('#content .item-picture a.main img').attr('src')
40
+
41
+ parsed_uri = URI.parse(image_url)
42
+ parsed_uri.query = nil
43
+ parsed_uri.path = parsed_uri.path.gsub('/big/', '/large/')
44
+ parsed_uri.to_s
45
+ end
46
+ end
47
+
48
+ class AdditionalImages
49
+ STYLE_URL_REGEX = /url\(([^\(\)]+)\)/.freeze
50
+
51
+ class << self
52
+ def parse(field_content)
53
+ field_content.search('#content .item-picture a.more').map do |image_link|
54
+ extract_clean_url(image_link.attr('style'))
55
+ end
56
+ end
57
+
58
+ private
59
+
60
+ def extract_clean_url(style_string)
61
+ image_url = style_string.scan(STYLE_URL_REGEX).flatten.first
62
+
63
+ parsed_uri = URI.parse(image_url)
64
+ parsed_uri.query = nil
65
+ parsed_uri.path = parsed_uri.path.gsub('/thumbnails/', '/')
66
+ parsed_uri.to_s
67
+ end
68
+ end
69
+ end
70
+ end
71
+ end
@@ -0,0 +1,146 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'nokogiri'
4
+ require_relative 'mfc'
5
+ require_relative 'field_parsers'
6
+
7
+ module WonderScrape::Scrapers::MFC
8
+ class ItemParser
9
+ DUPLICATE_FIELD_NAMES = {
10
+ 'Artist' => 'Artists',
11
+ 'Character' => 'Characters',
12
+ 'Classification' => 'Classifications',
13
+ 'Event' => 'Events',
14
+ 'Material' => 'Materials',
15
+ 'Release date' => 'Release dates'
16
+ }.freeze
17
+
18
+ VALID_FIELD_NAMES = [
19
+ 'Title',
20
+ 'Artists',
21
+ 'Category',
22
+ 'Characters',
23
+ 'Classifications',
24
+ 'Company',
25
+ 'Events',
26
+ 'JAN',
27
+ 'Materials',
28
+ 'Numbering',
29
+ 'Origin',
30
+ 'Price',
31
+ 'Release dates',
32
+ 'Scale & Dimensions',
33
+ 'Various',
34
+ 'Version',
35
+ 'Images'
36
+ ].freeze
37
+
38
+ ID_SELECTOR = '#content #ariadne > a.current'
39
+ TITLE_SELECTOR = '#content h1 span.headline'
40
+ FIELD_ELEMENTS_SELECTOR = '#content .data > .form > .form-field'
41
+ FIELD_NAME_SELECTOR = '.form-label'
42
+ FIELD_CONTENT_SELECTOR = '.form-input'
43
+
44
+ def self.parse(writer, recorder)
45
+ proc do |item_html_text|
46
+ item_html = ::Nokogiri::HTML(item_html_text)
47
+ new(writer, recorder, item_html).parse
48
+ end
49
+ end
50
+
51
+ def initialize(writer, recorder, item_html)
52
+ @writer = writer
53
+ @recorder = recorder
54
+ @item_html = item_html
55
+ @unexpected_fields = []
56
+ end
57
+
58
+ def parse
59
+ result = {}
60
+ result['Title'] = parsed_title
61
+ result.merge! parsed_fields
62
+ result['Images'] = parsed_images
63
+
64
+ writer.write(result)
65
+ recorder.increment_items_scraped(result)
66
+ end
67
+
68
+ private
69
+
70
+ attr_reader :writer, :recorder, :item_html
71
+
72
+ def parsed_id
73
+ id_element.text
74
+ end
75
+
76
+ def parsed_title
77
+ title_element.text
78
+ end
79
+
80
+ def parsed_fields
81
+ fields = {}
82
+
83
+ field_elements.each do |field_element|
84
+ field_name = dedupe_field_name(field_name_for(field_element))
85
+
86
+ if unexpected_field?(field_name)
87
+ recorder.record_unexpected_field(parsed_id, field_name)
88
+ next
89
+ end
90
+
91
+ field_content_element = field_content_element_for(field_element)
92
+ field_value = case field_name
93
+ when 'Price'
94
+ FieldParsers::Price.parse(field_content_element)
95
+ when 'Release dates'
96
+ FieldParsers::Dates.parse(field_content_element)
97
+ when 'Events'
98
+ FieldParsers::Events.parse(field_content_element)
99
+ when 'Artists', 'Characters', 'Classifications', 'Materials'
100
+ FieldParsers::StandardList.parse(field_content_element)
101
+ else
102
+ FieldParsers::Standard.parse(field_content_element)
103
+ end
104
+
105
+ fields[field_name] = field_value
106
+ end
107
+
108
+ fields
109
+ end
110
+
111
+ def parsed_images
112
+ images = []
113
+ images << FieldParsers::MainImage.parse(item_html)
114
+ images.concat FieldParsers::AdditionalImages.parse(item_html)
115
+ images.compact.uniq
116
+ end
117
+
118
+ def id_element
119
+ item_html.search(ID_SELECTOR)
120
+ end
121
+
122
+ def title_element
123
+ item_html.search(TITLE_SELECTOR)
124
+ end
125
+
126
+ def field_elements
127
+ item_html.search(FIELD_ELEMENTS_SELECTOR)
128
+ end
129
+
130
+ def field_name_for(field_element)
131
+ field_element.search(FIELD_NAME_SELECTOR).text
132
+ end
133
+
134
+ def dedupe_field_name(field_name)
135
+ DUPLICATE_FIELD_NAMES[field_name] || field_name
136
+ end
137
+
138
+ def field_content_element_for(field_element)
139
+ field_element.search(FIELD_CONTENT_SELECTOR)
140
+ end
141
+
142
+ def unexpected_field?(field_name)
143
+ !VALID_FIELD_NAMES.include?(field_name)
144
+ end
145
+ end
146
+ end
@@ -0,0 +1,5 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative '../scrapers'
4
+
5
+ module WonderScrape::Scrapers::MFC; end
@@ -0,0 +1,72 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'upton'
4
+ require_relative 'mfc'
5
+ require_relative 'item_parser'
6
+
7
+ module WonderScrape::Scrapers::MFC
8
+ class Scraper
9
+ NAME = 'MFC'
10
+ FIELDS = ItemParser::VALID_FIELD_NAMES
11
+
12
+ BASE_URL = 'myfigurecollection.net'
13
+ SEARCH_PATH = '/browse.v4.php'
14
+ SEARCH_RESULT_ITEM_SELECTOR = 'ul.listing div.item-icons span.item-icon > a.tbx-tooltip'
15
+ RESULTS_PER_PAGE = 81
16
+
17
+ DEFAULT_DELAY_BETWEEN_REQUESTS = 2 # seconds
18
+ DEFAULT_MAX_PAGES = 2
19
+ DEFAULT_START_PAGE = 1
20
+ DEFAULT_SEARCH_CATEGORY = 4 # Garage kits
21
+
22
+ def initialize(writer, recorder, options = {})
23
+ @writer = writer
24
+ @recorder = recorder
25
+ @options = options
26
+ end
27
+
28
+ def scrape
29
+ scraper.scrape(&ItemParser.parse(writer, recorder))
30
+ end
31
+
32
+ private
33
+
34
+ attr_reader :writer, :recorder, :options
35
+
36
+ def scraper
37
+ @scraper ||= build_scraper
38
+ end
39
+
40
+ def build_scraper
41
+ new_scraper = Upton::Scraper.new(
42
+ search_url,
43
+ SEARCH_RESULT_ITEM_SELECTOR
44
+ )
45
+
46
+ new_scraper.paginated = true
47
+ new_scraper.pagination_start_index = options[:start_page] || DEFAULT_START_PAGE
48
+ new_scraper.pagination_max_pages = options[:num_pages] || DEFAULT_MAX_PAGES
49
+ new_scraper.verbose = options[:verbose] || false
50
+ new_scraper.sleep_time_between_requests = options[:request_delay] || DEFAULT_DELAY_BETWEEN_REQUESTS
51
+
52
+ new_scraper
53
+ end
54
+
55
+ def search_url
56
+ URI::HTTPS.build(
57
+ host: BASE_URL,
58
+ path: SEARCH_PATH,
59
+ query: build_search_query_params
60
+ ).to_s
61
+ end
62
+
63
+ def build_search_query_params
64
+ URI.encode_www_form({
65
+ 'mode': 'search',
66
+ 'categoryId': DEFAULT_SEARCH_CATEGORY,
67
+ 'sort': 'date',
68
+ 'order': 'desc'
69
+ })
70
+ end
71
+ end
72
+ end
@@ -0,0 +1,5 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'wonder_scrape'
4
+
5
+ module WonderScrape::Scrapers; end
@@ -0,0 +1 @@
1
+ #
@@ -0,0 +1,5 @@
1
+ # frozen_string_literal: true
2
+
3
+ module WonderScrape
4
+ VERSION = '0.1.0'
5
+ end
@@ -0,0 +1,32 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'csv'
4
+ require_relative 'writers'
5
+
6
+ class WonderScrape::Writers::CSV
7
+ NAME = 'csv'
8
+
9
+ def initialize(file_name, headers)
10
+ @headers = headers
11
+ @csv = build_csv_writer(file_name)
12
+ end
13
+
14
+ def write(entry)
15
+ csv << entry.values_at(*headers)
16
+ end
17
+
18
+ def output_results
19
+ csv.close
20
+ end
21
+
22
+ private
23
+
24
+ attr_reader :headers
25
+ attr_accessor :csv
26
+
27
+ def build_csv_writer(file_name)
28
+ new_csv = CSV.open(file_name, 'wb')
29
+ new_csv << headers
30
+ new_csv
31
+ end
32
+ end
@@ -0,0 +1,22 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'json'
4
+ require_relative 'writers'
5
+
6
+ class WonderScrape::Writers::Hash
7
+ NAME = 'hash'
8
+
9
+ def initialize
10
+ @results = []
11
+ end
12
+
13
+ attr_reader :results
14
+
15
+ def write(entry)
16
+ @results << entry
17
+ end
18
+
19
+ def output_results
20
+ puts JSON.pretty_generate(@results)
21
+ end
22
+ end
@@ -0,0 +1,3 @@
1
+ # frozen_string_literal: true
2
+
3
+ module WonderScrape::Writers; end
@@ -0,0 +1,38 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative 'lib/wonder_scrape/version'
4
+
5
+ Gem::Specification.new do |spec|
6
+ spec.name = 'wonder_scrape'
7
+ spec.version = WonderScrape::VERSION
8
+ spec.authors = ['Ben Dawson']
9
+ spec.email = ['bendawson.rb@gmail.com']
10
+
11
+ spec.summary = 'A project to collect useful information from figure collecting websites.'
12
+ spec.homepage = 'https://gitlab.com/maleckai/wonder_scrape'
13
+ spec.license = 'MIT'
14
+ spec.required_ruby_version = Gem::Requirement.new('>= 2.3.0')
15
+
16
+ spec.metadata['homepage_uri'] = spec.homepage
17
+ spec.metadata['source_code_uri'] = spec.homepage
18
+ spec.metadata['changelog_uri'] = "#{spec.homepage}/-/blob/master/CHANGELOG.md"
19
+ spec.required_ruby_version = Gem::Requirement.new('>= 2.3.0')
20
+
21
+ spec.metadata['allowed_push_host'] = 'https://rubygems.org'
22
+
23
+ # Specify which files should be added to the gem when it is released.
24
+ # The `git ls-files -z` loads the files in the RubyGem that have been added into git.
25
+ spec.files = Dir.chdir(File.expand_path(__dir__)) do
26
+ `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
27
+ end
28
+ spec.bindir = 'exe'
29
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
30
+ spec.require_paths = ['lib']
31
+
32
+ spec.add_dependency 'thor'
33
+ spec.add_dependency 'tty-progressbar'
34
+ spec.add_dependency 'tty-prompt'
35
+
36
+ spec.add_runtime_dependency 'nokogiri', ['~> 1.10.9']
37
+ spec.add_runtime_dependency 'upton', ['~> 0.3.6']
38
+ end
metadata ADDED
@@ -0,0 +1,150 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: wonder_scrape
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Ben Dawson
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2020-05-05 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: thor
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ">="
18
+ - !ruby/object:Gem::Version
19
+ version: '0'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ">="
25
+ - !ruby/object:Gem::Version
26
+ version: '0'
27
+ - !ruby/object:Gem::Dependency
28
+ name: tty-progressbar
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ">="
32
+ - !ruby/object:Gem::Version
33
+ version: '0'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ">="
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: tty-prompt
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ">="
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ type: :runtime
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - ">="
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ - !ruby/object:Gem::Dependency
56
+ name: nokogiri
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - "~>"
60
+ - !ruby/object:Gem::Version
61
+ version: 1.10.9
62
+ type: :runtime
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - "~>"
67
+ - !ruby/object:Gem::Version
68
+ version: 1.10.9
69
+ - !ruby/object:Gem::Dependency
70
+ name: upton
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - "~>"
74
+ - !ruby/object:Gem::Version
75
+ version: 0.3.6
76
+ type: :runtime
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - "~>"
81
+ - !ruby/object:Gem::Version
82
+ version: 0.3.6
83
+ description:
84
+ email:
85
+ - bendawson.rb@gmail.com
86
+ executables:
87
+ - wonder_scrape
88
+ extensions: []
89
+ extra_rdoc_files: []
90
+ files:
91
+ - ".gitignore"
92
+ - ".rspec"
93
+ - ".ruby-version"
94
+ - ".travis.yml"
95
+ - CHANGELOG.md
96
+ - CODE_OF_CONDUCT.md
97
+ - Gemfile
98
+ - Gemfile.lock
99
+ - LICENSE.txt
100
+ - README.md
101
+ - Rakefile
102
+ - bin/console
103
+ - bin/setup
104
+ - exe/wonder_scrape
105
+ - lib/wonder_scrape.rb
106
+ - lib/wonder_scrape/cli.rb
107
+ - lib/wonder_scrape/command.rb
108
+ - lib/wonder_scrape/commands/.gitkeep
109
+ - lib/wonder_scrape/commands/scrape.rb
110
+ - lib/wonder_scrape/recorder.rb
111
+ - lib/wonder_scrape/scrapers/mfc/field_parsers.rb
112
+ - lib/wonder_scrape/scrapers/mfc/item_parser.rb
113
+ - lib/wonder_scrape/scrapers/mfc/mfc.rb
114
+ - lib/wonder_scrape/scrapers/mfc/scraper.rb
115
+ - lib/wonder_scrape/scrapers/scrapers.rb
116
+ - lib/wonder_scrape/templates/.gitkeep
117
+ - lib/wonder_scrape/templates/scrape/.gitkeep
118
+ - lib/wonder_scrape/version.rb
119
+ - lib/wonder_scrape/writers/csv.rb
120
+ - lib/wonder_scrape/writers/hash.rb
121
+ - lib/wonder_scrape/writers/writers.rb
122
+ - wonder_scrape.gemspec
123
+ homepage: https://gitlab.com/maleckai/wonder_scrape
124
+ licenses:
125
+ - MIT
126
+ metadata:
127
+ homepage_uri: https://gitlab.com/maleckai/wonder_scrape
128
+ source_code_uri: https://gitlab.com/maleckai/wonder_scrape
129
+ changelog_uri: https://gitlab.com/maleckai/wonder_scrape/-/blob/master/CHANGELOG.md
130
+ allowed_push_host: https://rubygems.org
131
+ post_install_message:
132
+ rdoc_options: []
133
+ require_paths:
134
+ - lib
135
+ required_ruby_version: !ruby/object:Gem::Requirement
136
+ requirements:
137
+ - - ">="
138
+ - !ruby/object:Gem::Version
139
+ version: 2.3.0
140
+ required_rubygems_version: !ruby/object:Gem::Requirement
141
+ requirements:
142
+ - - ">="
143
+ - !ruby/object:Gem::Version
144
+ version: '0'
145
+ requirements: []
146
+ rubygems_version: 3.1.2
147
+ signing_key:
148
+ specification_version: 4
149
+ summary: A project to collect useful information from figure collecting websites.
150
+ test_files: []