wonder_scrape 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: c27721fddd799f4cb631710d07c090a17abdaef56be4b9e725ac15e95bacce36
4
+ data.tar.gz: fc6469515a2d78a505d0911c0d8698b39a56a2ba7d1774aaf53723bf533c4ef3
5
+ SHA512:
6
+ metadata.gz: 23f4a73f08832f3ce85d06991ca879efb6a01c1592d53b8684d3a4a3cd8057c3e1335e8a4cb5f247c69026fa853df71e4d7d55c39f55be3cc14a27e60ffd549a
7
+ data.tar.gz: 8cc004a61c5a3f032c0f3a1e8e2e4028ce14c39ba04f62c77712ead4933418d3b7cdb08cf5fe324446c3f1d7626aaf9a1e2c6a67d2951d9e95ac018c2e90416f
data/.gitignore ADDED
@@ -0,0 +1,11 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /_yardoc/
4
+ /coverage/
5
+ /doc/
6
+ /pkg/
7
+ /spec/reports/
8
+ /tmp/
9
+
10
+ # rspec failure tracking
11
+ .rspec_status
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --format documentation
2
+ --color
3
+ --require spec_helper
data/.ruby-version ADDED
@@ -0,0 +1 @@
1
+ 2.7.0
data/.travis.yml ADDED
@@ -0,0 +1,6 @@
1
+ ---
2
+ language: ruby
3
+ cache: bundler
4
+ rvm:
5
+ - 2.7.0
6
+ before_install: gem install bundler -v 2.1.4
data/CHANGELOG.md ADDED
File without changes
@@ -0,0 +1,74 @@
1
+ # Contributor Covenant Code of Conduct
2
+
3
+ ## Our Pledge
4
+
5
+ In the interest of fostering an open and welcoming environment, we as
6
+ contributors and maintainers pledge to making participation in our project and
7
+ our community a harassment-free experience for everyone, regardless of age, body
8
+ size, disability, ethnicity, gender identity and expression, level of experience,
9
+ nationality, personal appearance, race, religion, or sexual identity and
10
+ orientation.
11
+
12
+ ## Our Standards
13
+
14
+ Examples of behavior that contributes to creating a positive environment
15
+ include:
16
+
17
+ * Using welcoming and inclusive language
18
+ * Being respectful of differing viewpoints and experiences
19
+ * Gracefully accepting constructive criticism
20
+ * Focusing on what is best for the community
21
+ * Showing empathy towards other community members
22
+
23
+ Examples of unacceptable behavior by participants include:
24
+
25
+ * The use of sexualized language or imagery and unwelcome sexual attention or
26
+ advances
27
+ * Trolling, insulting/derogatory comments, and personal or political attacks
28
+ * Public or private harassment
29
+ * Publishing others' private information, such as a physical or electronic
30
+ address, without explicit permission
31
+ * Other conduct which could reasonably be considered inappropriate in a
32
+ professional setting
33
+
34
+ ## Our Responsibilities
35
+
36
+ Project maintainers are responsible for clarifying the standards of acceptable
37
+ behavior and are expected to take appropriate and fair corrective action in
38
+ response to any instances of unacceptable behavior.
39
+
40
+ Project maintainers have the right and responsibility to remove, edit, or
41
+ reject comments, commits, code, wiki edits, issues, and other contributions
42
+ that are not aligned to this Code of Conduct, or to ban temporarily or
43
+ permanently any contributor for other behaviors that they deem inappropriate,
44
+ threatening, offensive, or harmful.
45
+
46
+ ## Scope
47
+
48
+ This Code of Conduct applies both within project spaces and in public spaces
49
+ when an individual is representing the project or its community. Examples of
50
+ representing a project or community include using an official project e-mail
51
+ address, posting via an official social media account, or acting as an appointed
52
+ representative at an online or offline event. Representation of a project may be
53
+ further defined and clarified by project maintainers.
54
+
55
+ ## Enforcement
56
+
57
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be
58
+ reported by contacting the project team at bendawson.rb@gmail.com. All
59
+ complaints will be reviewed and investigated and will result in a response that
60
+ is deemed necessary and appropriate to the circumstances. The project team is
61
+ obligated to maintain confidentiality with regard to the reporter of an incident.
62
+ Further details of specific enforcement policies may be posted separately.
63
+
64
+ Project maintainers who do not follow or enforce the Code of Conduct in good
65
+ faith may face temporary or permanent repercussions as determined by other
66
+ members of the project's leadership.
67
+
68
+ ## Attribution
69
+
70
+ This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71
+ available at [https://contributor-covenant.org/version/1/4][version]
72
+
73
+ [homepage]: https://contributor-covenant.org
74
+ [version]: https://contributor-covenant.org/version/1/4/
data/Gemfile ADDED
@@ -0,0 +1,7 @@
1
+ source "https://rubygems.org"
2
+
3
+ # Specify your gem's dependencies in wonder_scrape.gemspec
4
+ gemspec
5
+
6
+ gem "rake", "~> 12.0"
7
+ gem "rspec", "~> 3.0"
data/Gemfile.lock ADDED
@@ -0,0 +1,87 @@
1
+ PATH
2
+ remote: .
3
+ specs:
4
+ wonder_scrape (0.1.0)
5
+ nokogiri (~> 1.10.9)
6
+ thor
7
+ tty-progressbar
8
+ tty-prompt
9
+ upton (~> 0.3.6)
10
+
11
+ GEM
12
+ remote: https://rubygems.org/
13
+ specs:
14
+ diff-lcs (1.3)
15
+ domain_name (0.5.20190701)
16
+ unf (>= 0.0.5, < 1.0.0)
17
+ equatable (0.6.1)
18
+ http-accept (1.7.0)
19
+ http-cookie (1.0.3)
20
+ domain_name (~> 0.5)
21
+ mime-types (3.3.1)
22
+ mime-types-data (~> 3.2015)
23
+ mime-types-data (3.2020.0425)
24
+ mini_portile2 (2.4.0)
25
+ necromancer (0.5.1)
26
+ netrc (0.11.0)
27
+ nokogiri (1.10.9)
28
+ mini_portile2 (~> 2.4.0)
29
+ pastel (0.7.3)
30
+ equatable (~> 0.6)
31
+ tty-color (~> 0.5)
32
+ rake (12.3.3)
33
+ rest-client (2.1.0)
34
+ http-accept (>= 1.7.0, < 2.0)
35
+ http-cookie (>= 1.0.2, < 2.0)
36
+ mime-types (>= 1.16, < 4.0)
37
+ netrc (~> 0.8)
38
+ rspec (3.9.0)
39
+ rspec-core (~> 3.9.0)
40
+ rspec-expectations (~> 3.9.0)
41
+ rspec-mocks (~> 3.9.0)
42
+ rspec-core (3.9.1)
43
+ rspec-support (~> 3.9.1)
44
+ rspec-expectations (3.9.1)
45
+ diff-lcs (>= 1.2.0, < 2.0)
46
+ rspec-support (~> 3.9.0)
47
+ rspec-mocks (3.9.1)
48
+ diff-lcs (>= 1.2.0, < 2.0)
49
+ rspec-support (~> 3.9.0)
50
+ rspec-support (3.9.2)
51
+ strings-ansi (0.1.0)
52
+ thor (1.0.1)
53
+ tty-color (0.5.1)
54
+ tty-cursor (0.7.1)
55
+ tty-progressbar (0.17.0)
56
+ strings-ansi (~> 0.1.0)
57
+ tty-cursor (~> 0.7)
58
+ tty-screen (~> 0.7)
59
+ unicode-display_width (~> 1.6)
60
+ tty-prompt (0.21.0)
61
+ necromancer (~> 0.5.0)
62
+ pastel (~> 0.7.0)
63
+ tty-reader (~> 0.7.0)
64
+ tty-reader (0.7.0)
65
+ tty-cursor (~> 0.7)
66
+ tty-screen (~> 0.7)
67
+ wisper (~> 2.0.0)
68
+ tty-screen (0.7.1)
69
+ unf (0.1.4)
70
+ unf_ext
71
+ unf_ext (0.0.7.7)
72
+ unicode-display_width (1.7.0)
73
+ upton (0.3.6)
74
+ nokogiri (~> 1.5)
75
+ rest-client (~> 2.0, >= 1.6)
76
+ wisper (2.0.1)
77
+
78
+ PLATFORMS
79
+ ruby
80
+
81
+ DEPENDENCIES
82
+ rake (~> 12.0)
83
+ rspec (~> 3.0)
84
+ wonder_scrape!
85
+
86
+ BUNDLED WITH
87
+ 2.1.4
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2020 Benjamin Dawson
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,47 @@
1
+ # WonderScrape
2
+
3
+ A project to collect useful information from figure collecting websites.
4
+
5
+ ## Installation
6
+
7
+ Add this line to your application's Gemfile:
8
+
9
+ ```ruby
10
+ gem 'wonder_scrape'
11
+ ```
12
+
13
+ And then execute:
14
+
15
+ $ bundle install
16
+
17
+ Or install it yourself as:
18
+
19
+ $ gem install wonder_scrape
20
+
21
+ ## Usage
22
+
23
+ To get started, run:
24
+
25
+ $ wonder_scrape scrape
26
+
27
+ For more configuration options, run:
28
+
29
+ # wonder_scrape help scrape
30
+
31
+ ## Development
32
+
33
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
34
+
35
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
36
+
37
+ ## Contributing
38
+
39
+ Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/wonder_scrape. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/[USERNAME]/wonder_scrape/blob/master/CODE_OF_CONDUCT.md).
40
+
41
+ ## License
42
+
43
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
44
+
45
+ ## Code of Conduct
46
+
47
+ Everyone interacting in the WonderScrape project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/wonder_scrape/blob/master/CODE_OF_CONDUCT.md).
data/Rakefile ADDED
@@ -0,0 +1,8 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'bundler/gem_tasks'
4
+ require 'rspec/core/rake_task'
5
+
6
+ RSpec::Core::RakeTask.new(:spec)
7
+
8
+ task default: :spec
data/bin/console ADDED
@@ -0,0 +1,15 @@
1
+ #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
3
+
4
+ require 'bundler/setup'
5
+ require 'wonder_scrape'
6
+
7
+ # You can add fixtures and/or initialization code here to make experimenting
8
+ # with your gem easier. You can also use a different console, if you like.
9
+
10
+ # (If you use this, don't forget to add pry to your Gemfile!)
11
+ # require "pry"
12
+ # Pry.start
13
+
14
+ require 'irb'
15
+ IRB.start(__FILE__)
data/bin/setup ADDED
@@ -0,0 +1,8 @@
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+ set -vx
5
+
6
+ bundle install
7
+
8
+ # Do any other automated setup that you need to do here
data/exe/wonder_scrape ADDED
@@ -0,0 +1,19 @@
1
+ #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
3
+
4
+ lib_path = File.expand_path('../lib', __dir__)
5
+ $LOAD_PATH.unshift(lib_path) unless $LOAD_PATH.include?(lib_path)
6
+ require 'wonder_scrape'
7
+ require 'wonder_scrape/cli'
8
+
9
+ Signal.trap('INT') do
10
+ warn("\n#{caller.join("\n")}: interrupted")
11
+ exit(1)
12
+ end
13
+
14
+ begin
15
+ WonderScrape::CLI.start
16
+ rescue WonderScrape::CLI::Error => e
17
+ puts "ERROR: #{e.message}"
18
+ exit 1
19
+ end
@@ -0,0 +1,7 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'wonder_scrape/version'
4
+
5
+ module WonderScrape
6
+ class Error < StandardError; end
7
+ end
@@ -0,0 +1,49 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'thor'
4
+ require_relative 'commands/scrape'
5
+
6
+ module WonderScrape
7
+ # Handle the application command line parsing
8
+ # and the dispatch to various command objects
9
+ #
10
+ # @api public
11
+ class CLI < Thor
12
+ # Error raised by this runner
13
+ Error = Class.new(StandardError)
14
+
15
+ desc 'version', 'wonder_scrape version'
16
+ def version
17
+ require_relative 'version'
18
+ puts "v#{WonderScrape::VERSION}"
19
+ end
20
+ map %w[--version -v] => :version
21
+
22
+ desc 'scrape', 'Scrape a target website for item data'
23
+ method_option :target, aliases: '-t', type: :string, banner: 'targetWebsite',
24
+ desc: 'Sets the target website for scraping.',
25
+ enum: WonderScrape::Commands::Scrape::VALID_SCRAPER_NAMES
26
+ method_option :output, aliases: '-o', type: :string, banner: 'csv',
27
+ desc: 'Specifies the output format',
28
+ enum: %w[csv json]
29
+ method_option :file, aliases: '-f', type: :string, banner: 'path/to/file',
30
+ desc: 'Path to the file to write output to. Only necessary for CSV.'
31
+ method_option :num_pages, aliases: '-n', type: :numeric, banner: 2,
32
+ desc: 'Expected number of pages for search results.'
33
+ method_option :start_page, aliases: '-s', type: :numeric, banner: 1,
34
+ desc: 'What page of search results to begin scraping from.'
35
+ method_option :request_delay, aliases: '-d', type: :numeric, banner: 5,
36
+ desc: 'How long in seconds to wait between requests. Useful to avoid tripping rate limits.'
37
+ method_option :verbose, aliases: '-v', type: :boolean,
38
+ desc: 'Runs in verbose mode, outputting in greater detail'
39
+ method_option :help, aliases: '-h', type: :boolean,
40
+ desc: 'Display usage information'
41
+ def scrape(*)
42
+ if options[:help]
43
+ invoke :help, ['scrape']
44
+ else
45
+ WonderScrape::Commands::Scrape.new(options).execute
46
+ end
47
+ end
48
+ end
49
+ end
@@ -0,0 +1,41 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'forwardable'
4
+
5
+ module WonderScrape
6
+ class Command
7
+ extend Forwardable
8
+
9
+ def_delegators :command, :run
10
+
11
+ # Execute this command
12
+ #
13
+ # @api public
14
+ def execute(*)
15
+ raise(
16
+ NotImplementedError,
17
+ "#{self.class}##{__method__} must be implemented"
18
+ )
19
+ end
20
+
21
+ # The external commands runner
22
+ #
23
+ # @see http://www.rubydoc.info/gems/tty-command
24
+ #
25
+ # @api public
26
+ def command(**options)
27
+ require 'tty-command'
28
+ TTY::Command.new(options)
29
+ end
30
+
31
+ # The interactive prompt
32
+ #
33
+ # @see http://www.rubydoc.info/gems/tty-prompt
34
+ #
35
+ # @api public
36
+ def prompt
37
+ require 'tty-prompt'
38
+ TTY::Prompt.new(interrupt: :exit)
39
+ end
40
+ end
41
+ end
@@ -0,0 +1 @@
1
+ #
@@ -0,0 +1,93 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'tty-progressbar'
4
+
5
+ require_relative '../command'
6
+ require_relative '../scrapers/mfc/scraper'
7
+ require_relative '../writers/csv'
8
+ require_relative '../writers/hash'
9
+ require_relative '../recorder'
10
+
11
+ module WonderScrape
12
+ module Commands
13
+ class Scrape < WonderScrape::Command
14
+ VALID_SCRAPER_NAMES = [
15
+ WonderScrape::Scrapers::MFC::Scraper::NAME
16
+ ].freeze
17
+
18
+ VALID_WRITERS = [
19
+ WonderScrape::Writers::CSV::NAME,
20
+ WonderScrape::Writers::Hash::NAME
21
+ ].freeze
22
+
23
+ def initialize(raw_options)
24
+ @raw_options = raw_options
25
+ end
26
+
27
+ def execute(input: $stdin, output: $stdout)
28
+ recorder = WonderScrape::Recorder.new(output, options)
29
+ writer = build_writer
30
+ scraper = build_scraper(writer, recorder)
31
+
32
+ scraper.scrape
33
+ writer.output_results
34
+ recorder.print
35
+ end
36
+
37
+ private
38
+
39
+ attr_reader :raw_options
40
+
41
+ def build_scraper(writer, recorder)
42
+ target_module.new(writer, recorder, options)
43
+ end
44
+
45
+ def build_writer
46
+ case output
47
+ when WonderScrape::Writers::CSV::NAME
48
+ WonderScrape::Writers::CSV.new(file, target_module::FIELDS)
49
+ when WonderScrape::Writers::Hash::NAME
50
+ WonderScrape::Writers::Hash.new
51
+ end
52
+ end
53
+
54
+ def target_module
55
+ @target_module ||= case target
56
+ when WonderScrape::Scrapers::MFC::Scraper::NAME
57
+ WonderScrape::Scrapers::MFC::Scraper
58
+ end
59
+ end
60
+
61
+ def progress_bar
62
+ TTY::ProgressBar.new('[:bar] :percent', total: approximate_records)
63
+ end
64
+
65
+ def target
66
+ @target ||= raw_options[:target] || prompt.select('What website would you like to scrape?', VALID_SCRAPER_NAMES)
67
+ end
68
+
69
+ def output
70
+ @output ||= raw_options[:format] || prompt.select('How would you like to output?', VALID_WRITERS)
71
+ end
72
+
73
+ def file
74
+ @file ||= raw_options[:file] || prompt.ask('Please specify the file path you want to write to:', required: true)
75
+ end
76
+
77
+ def options
78
+ @options ||= raw_options.merge({
79
+ progress_bar: progress_bar,
80
+ num_pages: num_pages
81
+ })
82
+ end
83
+
84
+ def approximate_records
85
+ target_module::RESULTS_PER_PAGE * num_pages
86
+ end
87
+
88
+ def num_pages
89
+ @num_pages ||= raw_options[:num_pages] || prompt.ask('How many pages of search results do you want to scrape?', default: target_module::DEFAULT_MAX_PAGES, convert: :int)
90
+ end
91
+ end
92
+ end
93
+ end
@@ -0,0 +1,47 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'json'
4
+
5
+ class WonderScrape::Recorder
6
+ def initialize(output, options = {})
7
+ @output = output
8
+ @verbose = options[:verbose] || false
9
+ @progress_bar = options[:progress_bar]
10
+ @items_scraped = 0
11
+ @item_issues = {}
12
+ @unexpected_fields = []
13
+ end
14
+
15
+ def print
16
+ output.puts "Successfully processed #{items_scraped} items!"
17
+
18
+ if unexpected_fields.count > 0
19
+ output.puts "Encountered the following unexpected fields: #{unexpected_fields}"
20
+ end
21
+
22
+ if item_issues.count > 0
23
+ output.puts "Had issues with #{item_issues.count} items below"
24
+ output.puts JSON.pretty_generate(item_issues)
25
+ end
26
+ end
27
+
28
+ def increment_items_scraped(item)
29
+ @items_scraped += 1
30
+ if verbose
31
+ output.puts JSON.pretty_generate(item)
32
+ else
33
+ progress_bar&.advance(1)
34
+ end
35
+ end
36
+
37
+ def record_unexpected_field(item_id, field_name)
38
+ item_issues[item_id] ||= []
39
+ item_issues[item_id] << "Unexpected field: #{field_name}"
40
+ unexpected_fields << field_name
41
+ end
42
+
43
+ private
44
+
45
+ attr_reader :output, :verbose, :items_scraped, :progress_bar
46
+ attr_accessor :item_issues, :unexpected_fields
47
+ end
@@ -0,0 +1,71 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative 'mfc'
4
+
5
+ module WonderScrape::Scrapers::MFC
6
+ module FieldParsers
7
+ class Standard
8
+ def self.parse(field_content)
9
+ field_content.text
10
+ end
11
+ end
12
+
13
+ class StandardList
14
+ def self.parse(field_content)
15
+ field_content.search('a').map(&:text)
16
+ end
17
+ end
18
+
19
+ class Price
20
+ def self.parse(field_content)
21
+ field_content.search('.item-price').text
22
+ end
23
+ end
24
+
25
+ class Dates
26
+ def self.parse(field_content)
27
+ field_content.search('a.time').map(&:text)
28
+ end
29
+ end
30
+
31
+ class Events
32
+ def self.parse(field_content)
33
+ field_content.search('a.item-entry > span').map(&:text)
34
+ end
35
+ end
36
+
37
+ class MainImage
38
+ def self.parse(field_content)
39
+ image_url = field_content.search('#content .item-picture a.main img').attr('src')
40
+
41
+ parsed_uri = URI.parse(image_url)
42
+ parsed_uri.query = nil
43
+ parsed_uri.path = parsed_uri.path.gsub('/big/', '/large/')
44
+ parsed_uri.to_s
45
+ end
46
+ end
47
+
48
+ class AdditionalImages
49
+ STYLE_URL_REGEX = /url\(([^\(\)]+)\)/.freeze
50
+
51
+ class << self
52
+ def parse(field_content)
53
+ field_content.search('#content .item-picture a.more').map do |image_link|
54
+ extract_clean_url(image_link.attr('style'))
55
+ end
56
+ end
57
+
58
+ private
59
+
60
+ def extract_clean_url(style_string)
61
+ image_url = style_string.scan(STYLE_URL_REGEX).flatten.first
62
+
63
+ parsed_uri = URI.parse(image_url)
64
+ parsed_uri.query = nil
65
+ parsed_uri.path = parsed_uri.path.gsub('/thumbnails/', '/')
66
+ parsed_uri.to_s
67
+ end
68
+ end
69
+ end
70
+ end
71
+ end
@@ -0,0 +1,146 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'nokogiri'
4
+ require_relative 'mfc'
5
+ require_relative 'field_parsers'
6
+
7
+ module WonderScrape::Scrapers::MFC
8
+ class ItemParser
9
+ DUPLICATE_FIELD_NAMES = {
10
+ 'Artist' => 'Artists',
11
+ 'Character' => 'Characters',
12
+ 'Classification' => 'Classifications',
13
+ 'Event' => 'Events',
14
+ 'Material' => 'Materials',
15
+ 'Release date' => 'Release dates'
16
+ }.freeze
17
+
18
+ VALID_FIELD_NAMES = [
19
+ 'Title',
20
+ 'Artists',
21
+ 'Category',
22
+ 'Characters',
23
+ 'Classifications',
24
+ 'Company',
25
+ 'Events',
26
+ 'JAN',
27
+ 'Materials',
28
+ 'Numbering',
29
+ 'Origin',
30
+ 'Price',
31
+ 'Release dates',
32
+ 'Scale & Dimensions',
33
+ 'Various',
34
+ 'Version',
35
+ 'Images'
36
+ ].freeze
37
+
38
+ ID_SELECTOR = '#content #ariadne > a.current'
39
+ TITLE_SELECTOR = '#content h1 span.headline'
40
+ FIELD_ELEMENTS_SELECTOR = '#content .data > .form > .form-field'
41
+ FIELD_NAME_SELECTOR = '.form-label'
42
+ FIELD_CONTENT_SELECTOR = '.form-input'
43
+
44
+ def self.parse(writer, recorder)
45
+ proc do |item_html_text|
46
+ item_html = ::Nokogiri::HTML(item_html_text)
47
+ new(writer, recorder, item_html).parse
48
+ end
49
+ end
50
+
51
+ def initialize(writer, recorder, item_html)
52
+ @writer = writer
53
+ @recorder = recorder
54
+ @item_html = item_html
55
+ @unexpected_fields = []
56
+ end
57
+
58
+ def parse
59
+ result = {}
60
+ result['Title'] = parsed_title
61
+ result.merge! parsed_fields
62
+ result['Images'] = parsed_images
63
+
64
+ writer.write(result)
65
+ recorder.increment_items_scraped(result)
66
+ end
67
+
68
+ private
69
+
70
+ attr_reader :writer, :recorder, :item_html
71
+
72
+ def parsed_id
73
+ id_element.text
74
+ end
75
+
76
+ def parsed_title
77
+ title_element.text
78
+ end
79
+
80
+ def parsed_fields
81
+ fields = {}
82
+
83
+ field_elements.each do |field_element|
84
+ field_name = dedupe_field_name(field_name_for(field_element))
85
+
86
+ if unexpected_field?(field_name)
87
+ recorder.record_unexpected_field(parsed_id, field_name)
88
+ next
89
+ end
90
+
91
+ field_content_element = field_content_element_for(field_element)
92
+ field_value = case field_name
93
+ when 'Price'
94
+ FieldParsers::Price.parse(field_content_element)
95
+ when 'Release dates'
96
+ FieldParsers::Dates.parse(field_content_element)
97
+ when 'Events'
98
+ FieldParsers::Events.parse(field_content_element)
99
+ when 'Artists', 'Characters', 'Classifications', 'Materials'
100
+ FieldParsers::StandardList.parse(field_content_element)
101
+ else
102
+ FieldParsers::Standard.parse(field_content_element)
103
+ end
104
+
105
+ fields[field_name] = field_value
106
+ end
107
+
108
+ fields
109
+ end
110
+
111
+ def parsed_images
112
+ images = []
113
+ images << FieldParsers::MainImage.parse(item_html)
114
+ images.concat FieldParsers::AdditionalImages.parse(item_html)
115
+ images.compact.uniq
116
+ end
117
+
118
+ def id_element
119
+ item_html.search(ID_SELECTOR)
120
+ end
121
+
122
+ def title_element
123
+ item_html.search(TITLE_SELECTOR)
124
+ end
125
+
126
+ def field_elements
127
+ item_html.search(FIELD_ELEMENTS_SELECTOR)
128
+ end
129
+
130
+ def field_name_for(field_element)
131
+ field_element.search(FIELD_NAME_SELECTOR).text
132
+ end
133
+
134
+ def dedupe_field_name(field_name)
135
+ DUPLICATE_FIELD_NAMES[field_name] || field_name
136
+ end
137
+
138
+ def field_content_element_for(field_element)
139
+ field_element.search(FIELD_CONTENT_SELECTOR)
140
+ end
141
+
142
+ def unexpected_field?(field_name)
143
+ !VALID_FIELD_NAMES.include?(field_name)
144
+ end
145
+ end
146
+ end
@@ -0,0 +1,5 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative '../scrapers'
4
+
5
+ module WonderScrape::Scrapers::MFC; end
@@ -0,0 +1,72 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'upton'
4
+ require_relative 'mfc'
5
+ require_relative 'item_parser'
6
+
7
+ module WonderScrape::Scrapers::MFC
8
+ class Scraper
9
+ NAME = 'MFC'
10
+ FIELDS = ItemParser::VALID_FIELD_NAMES
11
+
12
+ BASE_URL = 'myfigurecollection.net'
13
+ SEARCH_PATH = '/browse.v4.php'
14
+ SEARCH_RESULT_ITEM_SELECTOR = 'ul.listing div.item-icons span.item-icon > a.tbx-tooltip'
15
+ RESULTS_PER_PAGE = 81
16
+
17
+ DEFAULT_DELAY_BETWEEN_REQUESTS = 2 # seconds
18
+ DEFAULT_MAX_PAGES = 2
19
+ DEFAULT_START_PAGE = 1
20
+ DEFAULT_SEARCH_CATEGORY = 4 # Garage kits
21
+
22
+ def initialize(writer, recorder, options = {})
23
+ @writer = writer
24
+ @recorder = recorder
25
+ @options = options
26
+ end
27
+
28
+ def scrape
29
+ scraper.scrape(&ItemParser.parse(writer, recorder))
30
+ end
31
+
32
+ private
33
+
34
+ attr_reader :writer, :recorder, :options
35
+
36
+ def scraper
37
+ @scraper ||= build_scraper
38
+ end
39
+
40
+ def build_scraper
41
+ new_scraper = Upton::Scraper.new(
42
+ search_url,
43
+ SEARCH_RESULT_ITEM_SELECTOR
44
+ )
45
+
46
+ new_scraper.paginated = true
47
+ new_scraper.pagination_start_index = options[:start_page] || DEFAULT_START_PAGE
48
+ new_scraper.pagination_max_pages = options[:num_pages] || DEFAULT_MAX_PAGES
49
+ new_scraper.verbose = options[:verbose] || false
50
+ new_scraper.sleep_time_between_requests = options[:request_delay] || DEFAULT_DELAY_BETWEEN_REQUESTS
51
+
52
+ new_scraper
53
+ end
54
+
55
+ def search_url
56
+ URI::HTTPS.build(
57
+ host: BASE_URL,
58
+ path: SEARCH_PATH,
59
+ query: build_search_query_params
60
+ ).to_s
61
+ end
62
+
63
+ def build_search_query_params
64
+ URI.encode_www_form({
65
+ 'mode': 'search',
66
+ 'categoryId': DEFAULT_SEARCH_CATEGORY,
67
+ 'sort': 'date',
68
+ 'order': 'desc'
69
+ })
70
+ end
71
+ end
72
+ end
@@ -0,0 +1,5 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'wonder_scrape'
4
+
5
+ module WonderScrape::Scrapers; end
@@ -0,0 +1 @@
1
+ #
@@ -0,0 +1,5 @@
1
+ # frozen_string_literal: true
2
+
3
+ module WonderScrape
4
+ VERSION = '0.1.0'
5
+ end
@@ -0,0 +1,32 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'csv'
4
+ require_relative 'writers'
5
+
6
+ class WonderScrape::Writers::CSV
7
+ NAME = 'csv'
8
+
9
+ def initialize(file_name, headers)
10
+ @headers = headers
11
+ @csv = build_csv_writer(file_name)
12
+ end
13
+
14
+ def write(entry)
15
+ csv << entry.values_at(*headers)
16
+ end
17
+
18
+ def output_results
19
+ csv.close
20
+ end
21
+
22
+ private
23
+
24
+ attr_reader :headers
25
+ attr_accessor :csv
26
+
27
+ def build_csv_writer(file_name)
28
+ new_csv = CSV.open(file_name, 'wb')
29
+ new_csv << headers
30
+ new_csv
31
+ end
32
+ end
@@ -0,0 +1,22 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'json'
4
+ require_relative 'writers'
5
+
6
+ class WonderScrape::Writers::Hash
7
+ NAME = 'hash'
8
+
9
+ def initialize
10
+ @results = []
11
+ end
12
+
13
+ attr_reader :results
14
+
15
+ def write(entry)
16
+ @results << entry
17
+ end
18
+
19
+ def output_results
20
+ puts JSON.pretty_generate(@results)
21
+ end
22
+ end
@@ -0,0 +1,3 @@
1
+ # frozen_string_literal: true
2
+
3
+ module WonderScrape::Writers; end
@@ -0,0 +1,38 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative 'lib/wonder_scrape/version'
4
+
5
+ Gem::Specification.new do |spec|
6
+ spec.name = 'wonder_scrape'
7
+ spec.version = WonderScrape::VERSION
8
+ spec.authors = ['Ben Dawson']
9
+ spec.email = ['bendawson.rb@gmail.com']
10
+
11
+ spec.summary = 'A project to collect useful information from figure collecting websites.'
12
+ spec.homepage = 'https://gitlab.com/maleckai/wonder_scrape'
13
+ spec.license = 'MIT'
14
+ spec.required_ruby_version = Gem::Requirement.new('>= 2.3.0')
15
+
16
+ spec.metadata['homepage_uri'] = spec.homepage
17
+ spec.metadata['source_code_uri'] = spec.homepage
18
+ spec.metadata['changelog_uri'] = "#{spec.homepage}/-/blob/master/CHANGELOG.md"
19
+ spec.required_ruby_version = Gem::Requirement.new('>= 2.3.0')
20
+
21
+ spec.metadata['allowed_push_host'] = 'https://rubygems.org'
22
+
23
+ # Specify which files should be added to the gem when it is released.
24
+ # The `git ls-files -z` loads the files in the RubyGem that have been added into git.
25
+ spec.files = Dir.chdir(File.expand_path(__dir__)) do
26
+ `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
27
+ end
28
+ spec.bindir = 'exe'
29
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
30
+ spec.require_paths = ['lib']
31
+
32
+ spec.add_dependency 'thor'
33
+ spec.add_dependency 'tty-progressbar'
34
+ spec.add_dependency 'tty-prompt'
35
+
36
+ spec.add_runtime_dependency 'nokogiri', ['~> 1.10.9']
37
+ spec.add_runtime_dependency 'upton', ['~> 0.3.6']
38
+ end
metadata ADDED
@@ -0,0 +1,150 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: wonder_scrape
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Ben Dawson
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2020-05-05 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: thor
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ">="
18
+ - !ruby/object:Gem::Version
19
+ version: '0'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ">="
25
+ - !ruby/object:Gem::Version
26
+ version: '0'
27
+ - !ruby/object:Gem::Dependency
28
+ name: tty-progressbar
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ">="
32
+ - !ruby/object:Gem::Version
33
+ version: '0'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ">="
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: tty-prompt
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ">="
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ type: :runtime
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - ">="
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ - !ruby/object:Gem::Dependency
56
+ name: nokogiri
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - "~>"
60
+ - !ruby/object:Gem::Version
61
+ version: 1.10.9
62
+ type: :runtime
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - "~>"
67
+ - !ruby/object:Gem::Version
68
+ version: 1.10.9
69
+ - !ruby/object:Gem::Dependency
70
+ name: upton
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - "~>"
74
+ - !ruby/object:Gem::Version
75
+ version: 0.3.6
76
+ type: :runtime
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - "~>"
81
+ - !ruby/object:Gem::Version
82
+ version: 0.3.6
83
+ description:
84
+ email:
85
+ - bendawson.rb@gmail.com
86
+ executables:
87
+ - wonder_scrape
88
+ extensions: []
89
+ extra_rdoc_files: []
90
+ files:
91
+ - ".gitignore"
92
+ - ".rspec"
93
+ - ".ruby-version"
94
+ - ".travis.yml"
95
+ - CHANGELOG.md
96
+ - CODE_OF_CONDUCT.md
97
+ - Gemfile
98
+ - Gemfile.lock
99
+ - LICENSE.txt
100
+ - README.md
101
+ - Rakefile
102
+ - bin/console
103
+ - bin/setup
104
+ - exe/wonder_scrape
105
+ - lib/wonder_scrape.rb
106
+ - lib/wonder_scrape/cli.rb
107
+ - lib/wonder_scrape/command.rb
108
+ - lib/wonder_scrape/commands/.gitkeep
109
+ - lib/wonder_scrape/commands/scrape.rb
110
+ - lib/wonder_scrape/recorder.rb
111
+ - lib/wonder_scrape/scrapers/mfc/field_parsers.rb
112
+ - lib/wonder_scrape/scrapers/mfc/item_parser.rb
113
+ - lib/wonder_scrape/scrapers/mfc/mfc.rb
114
+ - lib/wonder_scrape/scrapers/mfc/scraper.rb
115
+ - lib/wonder_scrape/scrapers/scrapers.rb
116
+ - lib/wonder_scrape/templates/.gitkeep
117
+ - lib/wonder_scrape/templates/scrape/.gitkeep
118
+ - lib/wonder_scrape/version.rb
119
+ - lib/wonder_scrape/writers/csv.rb
120
+ - lib/wonder_scrape/writers/hash.rb
121
+ - lib/wonder_scrape/writers/writers.rb
122
+ - wonder_scrape.gemspec
123
+ homepage: https://gitlab.com/maleckai/wonder_scrape
124
+ licenses:
125
+ - MIT
126
+ metadata:
127
+ homepage_uri: https://gitlab.com/maleckai/wonder_scrape
128
+ source_code_uri: https://gitlab.com/maleckai/wonder_scrape
129
+ changelog_uri: https://gitlab.com/maleckai/wonder_scrape/-/blob/master/CHANGELOG.md
130
+ allowed_push_host: https://rubygems.org
131
+ post_install_message:
132
+ rdoc_options: []
133
+ require_paths:
134
+ - lib
135
+ required_ruby_version: !ruby/object:Gem::Requirement
136
+ requirements:
137
+ - - ">="
138
+ - !ruby/object:Gem::Version
139
+ version: 2.3.0
140
+ required_rubygems_version: !ruby/object:Gem::Requirement
141
+ requirements:
142
+ - - ">="
143
+ - !ruby/object:Gem::Version
144
+ version: '0'
145
+ requirements: []
146
+ rubygems_version: 3.1.2
147
+ signing_key:
148
+ specification_version: 4
149
+ summary: A project to collect useful information from figure collecting websites.
150
+ test_files: []