spidy 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: a1034253dcc3f68d566c3b67ff9ec5c6aeca4ec1b6a2ed66723bda8041154011
4
+ data.tar.gz: '08ef4a5426111b1824c5547465d0473507f68d9f0ea499bacddc4411395dd25a'
5
+ SHA512:
6
+ metadata.gz: '01745823727ff14e7b8a4fc97a0487fa32000ae7e09c0241a4bceeab5722df060162b275e99fe77991139788e41f7cb46d1f9d113c5cf93d96efc98855910af3'
7
+ data.tar.gz: ae0d7b3b6707b939f83e1b8e453c0ad87c60faec363017665ddb38baf77f763ed44994f3223208e92545eaa362b8ae28bcafb5b0c818b050c35aa221d87e00b7
@@ -0,0 +1,14 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /_yardoc/
4
+ /coverage/
5
+ /doc/
6
+ /pkg/
7
+ /spec/reports/
8
+ /tmp/
9
+
10
+ # example crawlers
11
+ examples/
12
+
13
+ # rspec failure tracking
14
+ .rspec_status
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --format documentation
2
+ --color
3
+ --require spec_helper
@@ -0,0 +1,23 @@
1
+ inherit_from: .rubocop_todo.yml
2
+ AllCops:
3
+ DisplayCopNames: true
4
+ TargetRubyVersion: 2.6
5
+
6
+ Style/ClassAndModuleChildren:
7
+ Enabled: false
8
+
9
+ Style/SignalException:
10
+ EnforcedStyle: semantic
11
+
12
+ Naming/UncommunicativeMethodParamName:
13
+ AllowedNames:
14
+ - as
15
+
16
+ Metrics/LineLength:
17
+ Max: 120
18
+
19
+ Metrics/BlockLength:
20
+ Max: 120
21
+
22
+ SignalException:
23
+ EnforcedStyle: semantic
@@ -0,0 +1,13 @@
1
+ # This configuration was generated by
2
+ # `rubocop --auto-gen-config`
3
+ # on 2019-03-29 18:00:03 +0900 using RuboCop version 0.66.0.
4
+ # The point is for the user to remove these configuration records
5
+ # one by one as the offenses are removed from the code base.
6
+ # Note that changes in the inspected code, or installation of new
7
+ # versions of RuboCop, may require this file to be generated again.
8
+
9
+ # Offense count: 7
10
+ # Configuration parameters: AllowHeredoc, AllowURI, URISchemes, IgnoreCopDirectives, IgnoredPatterns.
11
+ # URISchemes: http, https
12
+ Metrics/LineLength:
13
+ Max: 96
@@ -0,0 +1 @@
1
+ 2.6.2
@@ -0,0 +1,7 @@
1
+ ---
2
+ sudo: false
3
+ language: ruby
4
+ cache: bundler
5
+ rvm:
6
+ - 2.6.2
7
+ before_install: gem install bundler -v 2.0.1
@@ -0,0 +1,9 @@
1
+ # Change Log
2
+ All notable changes to this project will be documented in this file.
3
+
4
+ The format is based on [Keep a Changelog](http://keepachangelog.com/)
5
+ and this project adheres to [Semantic Versioning](http://semver.org/).
6
+
7
+ ## [Unreleased]
8
+
9
+ ## [0.1.0]
@@ -0,0 +1,74 @@
1
+ # Contributor Covenant Code of Conduct
2
+
3
+ ## Our Pledge
4
+
5
+ In the interest of fostering an open and welcoming environment, we as
6
+ contributors and maintainers pledge to making participation in our project and
7
+ our community a harassment-free experience for everyone, regardless of age, body
8
+ size, disability, ethnicity, gender identity and expression, level of experience,
9
+ nationality, personal appearance, race, religion, or sexual identity and
10
+ orientation.
11
+
12
+ ## Our Standards
13
+
14
+ Examples of behavior that contributes to creating a positive environment
15
+ include:
16
+
17
+ * Using welcoming and inclusive language
18
+ * Being respectful of differing viewpoints and experiences
19
+ * Gracefully accepting constructive criticism
20
+ * Focusing on what is best for the community
21
+ * Showing empathy towards other community members
22
+
23
+ Examples of unacceptable behavior by participants include:
24
+
25
+ * The use of sexualized language or imagery and unwelcome sexual attention or
26
+ advances
27
+ * Trolling, insulting/derogatory comments, and personal or political attacks
28
+ * Public or private harassment
29
+ * Publishing others' private information, such as a physical or electronic
30
+ address, without explicit permission
31
+ * Other conduct which could reasonably be considered inappropriate in a
32
+ professional setting
33
+
34
+ ## Our Responsibilities
35
+
36
+ Project maintainers are responsible for clarifying the standards of acceptable
37
+ behavior and are expected to take appropriate and fair corrective action in
38
+ response to any instances of unacceptable behavior.
39
+
40
+ Project maintainers have the right and responsibility to remove, edit, or
41
+ reject comments, commits, code, wiki edits, issues, and other contributions
42
+ that are not aligned to this Code of Conduct, or to ban temporarily or
43
+ permanently any contributor for other behaviors that they deem inappropriate,
44
+ threatening, offensive, or harmful.
45
+
46
+ ## Scope
47
+
48
+ This Code of Conduct applies both within project spaces and in public spaces
49
+ when an individual is representing the project or its community. Examples of
50
+ representing a project or community include using an official project e-mail
51
+ address, posting via an official social media account, or acting as an appointed
52
+ representative at an online or offline event. Representation of a project may be
53
+ further defined and clarified by project maintainers.
54
+
55
+ ## Enforcement
56
+
57
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be
58
+ reported by contacting the project team at aileron.cc@gmail.com. All
59
+ complaints will be reviewed and investigated and will result in a response that
60
+ is deemed necessary and appropriate to the circumstances. The project team is
61
+ obligated to maintain confidentiality with regard to the reporter of an incident.
62
+ Further details of specific enforcement policies may be posted separately.
63
+
64
+ Project maintainers who do not follow or enforce the Code of Conduct in good
65
+ faith may face temporary or permanent repercussions as determined by other
66
+ members of the project's leadership.
67
+
68
+ ## Attribution
69
+
70
+ This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71
+ available at [http://contributor-covenant.org/version/1/4][version]
72
+
73
+ [homepage]: http://contributor-covenant.org
74
+ [version]: http://contributor-covenant.org/version/1/4/
data/Gemfile ADDED
@@ -0,0 +1,6 @@
1
+ # frozen_string_literal: true
2
+
3
+ source 'https://rubygems.org'
4
+
5
+ # Specify your gem's dependencies in crawler.gemspec
6
+ gemspec
@@ -0,0 +1,87 @@
1
+ PATH
2
+ remote: .
3
+ specs:
4
+ spidy (0.0.1)
5
+ activemodel (~> 5.2)
6
+ activesupport (~> 5.2)
7
+ mechanize
8
+ pry
9
+
10
+ GEM
11
+ remote: https://rubygems.org/
12
+ specs:
13
+ activemodel (5.2.3)
14
+ activesupport (= 5.2.3)
15
+ activesupport (5.2.3)
16
+ concurrent-ruby (~> 1.0, >= 1.0.2)
17
+ i18n (>= 0.7, < 2)
18
+ minitest (~> 5.1)
19
+ tzinfo (~> 1.1)
20
+ coderay (1.1.2)
21
+ concurrent-ruby (1.1.5)
22
+ connection_pool (2.2.2)
23
+ diff-lcs (1.3)
24
+ domain_name (0.5.20190701)
25
+ unf (>= 0.0.5, < 1.0.0)
26
+ http-cookie (1.0.3)
27
+ domain_name (~> 0.5)
28
+ i18n (1.6.0)
29
+ concurrent-ruby (~> 1.0)
30
+ mechanize (2.7.6)
31
+ domain_name (~> 0.5, >= 0.5.1)
32
+ http-cookie (~> 1.0)
33
+ mime-types (>= 1.17.2)
34
+ net-http-digest_auth (~> 1.1, >= 1.1.1)
35
+ net-http-persistent (>= 2.5.2)
36
+ nokogiri (~> 1.6)
37
+ ntlm-http (~> 0.1, >= 0.1.1)
38
+ webrobots (>= 0.0.9, < 0.2)
39
+ method_source (0.9.2)
40
+ mime-types (3.2.2)
41
+ mime-types-data (~> 3.2015)
42
+ mime-types-data (3.2019.0331)
43
+ mini_portile2 (2.4.0)
44
+ minitest (5.11.3)
45
+ net-http-digest_auth (1.4.1)
46
+ net-http-persistent (3.1.0)
47
+ connection_pool (~> 2.2)
48
+ nokogiri (1.10.4)
49
+ mini_portile2 (~> 2.4.0)
50
+ ntlm-http (0.1.1)
51
+ pry (0.12.2)
52
+ coderay (~> 1.1.0)
53
+ method_source (~> 0.9.0)
54
+ rake (10.5.0)
55
+ rspec (3.8.0)
56
+ rspec-core (~> 3.8.0)
57
+ rspec-expectations (~> 3.8.0)
58
+ rspec-mocks (~> 3.8.0)
59
+ rspec-core (3.8.0)
60
+ rspec-support (~> 3.8.0)
61
+ rspec-expectations (3.8.2)
62
+ diff-lcs (>= 1.2.0, < 2.0)
63
+ rspec-support (~> 3.8.0)
64
+ rspec-mocks (3.8.0)
65
+ diff-lcs (>= 1.2.0, < 2.0)
66
+ rspec-support (~> 3.8.0)
67
+ rspec-support (3.8.0)
68
+ thread_safe (0.3.6)
69
+ tzinfo (1.2.5)
70
+ thread_safe (~> 0.1)
71
+ unf (0.1.4)
72
+ unf_ext
73
+ unf_ext (0.0.7.6)
74
+ webrobots (0.1.2)
75
+
76
+ PLATFORMS
77
+ ruby
78
+
79
+ DEPENDENCIES
80
+ bundler (~> 2.0)
81
+ pry
82
+ rake (~> 10.0)
83
+ rspec (~> 3.0)
84
+ spidy!
85
+
86
+ BUNDLED WITH
87
+ 2.0.2
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2019 aileron
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
@@ -0,0 +1,43 @@
1
+ # Spidy
2
+
3
+ Welcome to your new gem! In this directory, you'll find the files you need to be able to package up your Ruby library into a gem. Put your Ruby code in the file `lib/crawler`. To experiment with that code, run `bin/console` for an interactive prompt.
4
+
5
+ TODO: Delete this and the text above, and describe your gem
6
+
7
+ ## Installation
8
+
9
+ Add this line to your application's Gemfile:
10
+
11
+ ```ruby
12
+ gem 'spidy'
13
+ ```
14
+
15
+ And then execute:
16
+
17
+ $ bundle
18
+
19
+ Or install it yourself as:
20
+
21
+ $ gem install spidy
22
+
23
+ ## Usage
24
+
25
+ TODO: Write usage instructions here
26
+
27
+ ## Development
28
+
29
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
30
+
31
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
32
+
33
+ ## Contributing
34
+
35
+ Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/crawler. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
36
+
37
+ ## License
38
+
39
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
40
+
41
+ ## Code of Conduct
42
+
43
+ Everyone interacting in the Crawler project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/crawler/blob/master/CODE_OF_CONDUCT.md).
@@ -0,0 +1,8 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'bundler/gem_tasks'
4
+ require 'rspec/core/rake_task'
5
+
6
+ RSpec::Core::RakeTask.new(:spec)
7
+
8
+ task default: :spec
@@ -0,0 +1,22 @@
1
+ #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
3
+
4
+ require 'bundler/setup'
5
+ require 'spidy'
6
+
7
+ # You can add fixtures and/or initialization code here to make experimenting
8
+ # with your gem easier. You can also use a different console, if you like.
9
+
10
+ # (If you use this, don't forget to add pry to your Gemfile!)
11
+ require 'pry'
12
+ def reload!
13
+ ActiveSupport::Dependencies.clear
14
+ ActiveSupport::DescendantsTracker.clear
15
+ ActiveSupport::Reloader.reload!
16
+ end
17
+
18
+ if ARGV[0]
19
+ Spidy.open(ARGV[0]).console
20
+ else
21
+ Pry.start
22
+ end
@@ -0,0 +1,8 @@
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+ set -vx
5
+
6
+ bundle install
7
+
8
+ # Do any other automated setup that you need to do here
@@ -0,0 +1,17 @@
1
+ #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
3
+
4
+ require 'spidy'
5
+
6
+ case ARGV[0]&.to_sym
7
+ when :spider then Spidy.open(ARGV[1]).shell.spider(ARGV[2])
8
+ when :scraper then Spidy.open(ARGV[1]).shell.scraper(ARGV[2])
9
+ when :shell then Spidy.open(ARGV[1]).shell.function
10
+ when :new then Spidy.open(ARGV[1]).shell.build
11
+ when :console
12
+ if ARGV[1].blank?
13
+ Spidy.console
14
+ else
15
+ Spidy.open(ARGV[1]).console
16
+ end
17
+ end
@@ -0,0 +1,47 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'spidy/version'
4
+ require 'active_support/all'
5
+ require 'active_model'
6
+ require 'mechanize'
7
+ require 'csv'
8
+ require 'open-uri'
9
+
10
+ #
11
+ # web spider dsl engine
12
+ #
13
+ module Spidy
14
+ extend ActiveSupport::Autoload
15
+ autoload :Shell
16
+ autoload :Console
17
+ autoload :Definition
18
+ autoload :DefinitionFile
19
+ autoload :Binder
20
+ autoload :Spider
21
+ autoload :Looper
22
+ autoload :Connector
23
+ autoload :Result
24
+
25
+ const_set(:Crawler, Module.new) unless const_defined?(:Crawler)
26
+
27
+ def self.console
28
+ require 'pry'
29
+ Pry.start(Spidy::Console.new)
30
+ end
31
+
32
+ def self.open(filepath)
33
+ ::Spidy::DefinitionFile.open(filepath)
34
+ end
35
+
36
+ def self.define(name = nil, domain: nil, &block)
37
+ crawler_definition = Class.new(::Spidy::Definition, &block)
38
+ crawler_definition.domain = domain
39
+
40
+ if name
41
+ crawler_class_name = name.to_s.camelize
42
+ Crawler.class_eval { remove_const(crawler_class_name) } if Crawler.const_defined?(crawler_class_name)
43
+ Crawler.const_set(crawler_class_name, crawler_definition)
44
+ end
45
+ crawler_definition
46
+ end
47
+ end
@@ -0,0 +1,77 @@
1
+ # frozen_string_literal: true
2
+
3
+ #
4
+ # Bind resource received from the connection to the result object
5
+ #
6
+ class Spidy::Binder
7
+ #
8
+ # binding multiple
9
+ #
10
+ class Multiple
11
+ def self.bind(connector:, binder:, query:, block:)
12
+ multiple_binding_class = self
13
+ connector.field.call(binder, query) do |elements|
14
+ multiple_binding_class.new(binder.class).instance_exec(elements, &block)
15
+ end
16
+ end
17
+
18
+ def initialize(binder)
19
+ @binder = binder
20
+ end
21
+
22
+ def field(name)
23
+ @binder.field_names << name
24
+ @binder.field_names.uniq!
25
+ @binder.result_class.define(name)
26
+ result = yield
27
+ @binder.define_method(name) { result }
28
+ end
29
+ end
30
+
31
+ class_attribute :field_names, default: []
32
+ attr_reader :resource
33
+
34
+ def initialize(resource)
35
+ @resource = resource
36
+ self.class.fields_call(self)
37
+ end
38
+
39
+ def result
40
+ definition = self
41
+ fetched_at = Time.current
42
+ result = self.class.result_class.new(fetched_at: fetched_at, fetched_on: fetched_at.beginning_of_day, **attributes)
43
+ result.define_singleton_method(:resource) { definition.resource }
44
+ result
45
+ end
46
+
47
+ def attributes_to_array
48
+ field_names.map { |field_name| send(field_name) }
49
+ end
50
+
51
+ def attributes
52
+ field_names.map { |field_name| [field_name, send(field_name)] }.to_h
53
+ end
54
+
55
+ def self.query(name, query = nil, &block)
56
+ define_method(name) do
57
+ connector.field.call(self, query, &block)
58
+ end
59
+ end
60
+
61
+ def self.field(name, query = nil, optional: false, &block)
62
+ field_names << name
63
+ field_names.uniq!
64
+ result_class.define(name, presence: !optional)
65
+ define_method(name) do
66
+ connector.field.call(self, query, &block)
67
+ end
68
+ end
69
+
70
+ def self.fields(query, &block)
71
+ @fields = { query: query, block: block }
72
+ end
73
+
74
+ def self.fields_call(binder)
75
+ Multiple.bind(connector: connector, binder: binder, query: @fields[:query], block: @fields[:block]) if @fields
76
+ end
77
+ end
@@ -0,0 +1,10 @@
1
+ # frozen_string_literal: true
2
+
3
+ #
4
+ # This class is responsible for actually making a network connection and downloading hypertext
5
+ #
6
+ module Spidy::Connector
7
+ extend ActiveSupport::Autoload
8
+ autoload :Html
9
+ autoload :Xml
10
+ end
@@ -0,0 +1,42 @@
1
+ # frozen_string_literal: true
2
+
3
+ #
4
+ # Mechanize wrapper
5
+ #
6
+ class Spidy::Connector::Html
7
+ class_attribute :field, default: (lambda { |object, query, &block|
8
+ node = object.resource.search(query)
9
+ fail "Could not be located #{query}" if node.nil?
10
+ return node.first.text if block.nil?
11
+
12
+ object.instance_exec(node, &block)
13
+ })
14
+
15
+ USER_AGENT = [
16
+ 'Mozilla/5.0',
17
+ '(Macintosh; Intel Mac OS X 10_12_6)',
18
+ 'AppleWebKit/537.36',
19
+ '(KHTML, like Gecko)',
20
+ 'Chrome/64.0.3282.186',
21
+ 'Safari/537.36'
22
+ ].join(' ')
23
+
24
+ attr_reader :start_url
25
+ attr_reader :agent
26
+
27
+ def initialize(start_url: nil, encoding: nil)
28
+ @start_url = start_url
29
+ @agent = Mechanize.new
30
+ if encoding
31
+ @agent.default_encoding = encoding
32
+ @agent.force_default_encoding = true
33
+ end
34
+ @agent.user_agent = USER_AGENT
35
+ end
36
+
37
+ def call(url = @start_url, &block)
38
+ fail 'URL is undefined' if url.blank?
39
+
40
+ agent.get(url, &block)
41
+ end
42
+ end
@@ -0,0 +1,31 @@
1
+ # frozen_string_literal: true
2
+
3
+ #
4
+ # xml
5
+ #
6
+ class Spidy::Connector::Xml
7
+ class_attribute :field, default: (lambda { |object, query, optional: false, &block|
8
+ return object.instance_exec(object.resource, &block) if query.nil?
9
+
10
+ node = object.resource.search(query)
11
+ return if optional && node.empty?
12
+
13
+ fail "Could not be located #{query}" if node.empty?
14
+ return node.first.text if block.nil?
15
+
16
+ object.instance_exec(node, &block)
17
+ })
18
+
19
+ def initialize(start_url: nil, encoding: nil)
20
+ @start_url = start_url
21
+ @encoding = encoding
22
+ end
23
+
24
+ def call(url = @start_url)
25
+ fail 'URL is undefined' if url.blank?
26
+
27
+ xml =
28
+ Nokogiri::XML(OpenURI.open_uri(url).read.gsub(/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/, ''))
29
+ yield xml
30
+ end
31
+ end
@@ -0,0 +1,21 @@
1
+ # frozen_string_literal: true
2
+
3
+ #
4
+ # spidy console
5
+ #
6
+ class Spidy::Console
7
+ attr_reader :definition_file
8
+ delegate :spiders, :scrapers, to: :definition_file
9
+
10
+ def initialize(definition_file = nil)
11
+ @definition_file = definition_file
12
+ end
13
+
14
+ def open(filepath)
15
+ @definition_file = Spidy::DefinitionFile.open(filepath)
16
+ end
17
+
18
+ def reload!
19
+ @definition_file&.eval_definition
20
+ end
21
+ end
@@ -0,0 +1,103 @@
1
+ # frozen_string_literal: true
2
+
3
+ #
4
+ # Class representing a website defined by DSL
5
+ #
6
+ class Spidy::Definition
7
+ class_attribute :domain
8
+ class_attribute :spiders, default: {}
9
+ class_attribute :scrapers, default: {}
10
+ class_attribute :output, default: ->(result) { STDOUT.puts(result.attributes.to_json) }
11
+
12
+ def output(&block)
13
+ self.output = block
14
+ end
15
+
16
+ # rubocop:disable Metrics/MethodLength, Metrics/AbcSize
17
+ class << self
18
+ def spider(name, start_url = nil, encoding: nil, as: :html, &block)
19
+ connector_class = Spidy::Connector.const_get(as.to_s.classify)
20
+ connector = connector_class.new(start_url: start_url, encoding: encoding)
21
+ spider = Spidy::Spider.new(&block)
22
+ spider_class = Class.new do
23
+ define_singleton_method(:connector) { connector }
24
+ define_singleton_method(:call) do |url = start_url, &spider_block|
25
+ connector.call(url) do |resource|
26
+ spider.call(resource, &spider_block)
27
+ end
28
+ end
29
+ end
30
+ const_set("#{name}_spider".classify, spider_class)
31
+ spiders[name] = spider_class
32
+ end
33
+
34
+ def scraper(name, options, &block)
35
+ if options[:loop]
36
+ loop_scraper(name, options, &block)
37
+ else
38
+ normal_scraper(name, **options, &block)
39
+ end
40
+ end
41
+
42
+ private
43
+
44
+ def loop_scraper(name, options, &block)
45
+ options = { as: :html, start_url: nil, encoding: nil, loop: nil }.merge(options)
46
+ result_class = Class.new(Spidy::Result)
47
+
48
+ # connector
49
+ connector_class = Spidy::Connector.const_get(options[:as].to_s.classify)
50
+ connector = connector_class.new(encoding: options[:encoding])
51
+
52
+ namespace = Class.new do
53
+ binder = Class.new(Spidy::Binder) do
54
+ define_singleton_method(:connector) { connector }
55
+ define_singleton_method(:result_class) { result_class }
56
+ define_method(:connector) { connector }
57
+ instance_exec(&block)
58
+ end
59
+ define_singleton_method(:call) do |url = options[:start_url], &yielder|
60
+ connector.call(url) do |resource|
61
+ looper = Spidy::Looper.new(resource, binder, options[:loop])
62
+ looper.call(&yielder)
63
+ end
64
+ end
65
+ end
66
+ const_set("#{name}_scraper".classify, namespace)
67
+ scrapers[name] = namespace
68
+ end
69
+
70
+ def normal_scraper(name, encoding: nil, as: :html, &block)
71
+ # result
72
+ result_class = Class.new(Spidy::Result)
73
+
74
+ # connector
75
+ connector_class = Spidy::Connector.const_get(as.to_s.classify)
76
+ connector = connector_class.new(encoding: encoding)
77
+
78
+ # namespace
79
+ namespace = Class.new do
80
+ binder = Class.new(Spidy::Binder) do
81
+ define_singleton_method(:connector) { connector }
82
+ define_singleton_method(:result_class) { result_class }
83
+ define_method(:connector) { connector }
84
+ instance_exec(&block)
85
+ end
86
+ define_singleton_method(:bind) do |url|
87
+ connector.call(url) do |resource|
88
+ binder.new(resource)
89
+ end
90
+ end
91
+ define_singleton_method(:call) do |url, &output|
92
+ result = bind(url).result
93
+ fail "#{url}\n#{result.errors.full_messages}" if result.invalid?
94
+
95
+ output.call(result)
96
+ end
97
+ end
98
+ const_set("#{name}_scraper".classify, namespace)
99
+ scrapers[name] = namespace
100
+ end
101
+ end
102
+ # rubocop:enable Metrics/MethodLength, Metrics/AbcSize
103
+ end
@@ -0,0 +1,43 @@
1
+ # frozen_string_literal: true
2
+
3
+ #
4
+ # spidy interface binding
5
+ #
6
+ class Spidy::DefinitionFile
7
+ attr_reader :path
8
+ attr_reader :definition
9
+ delegate :spiders, :scrapers, :output, to: :definition
10
+
11
+ CSV = lambda do |result|
12
+ ::CSV.generate do |csv|
13
+ csv << result.definition.attributes_to_array
14
+ end
15
+ end
16
+
17
+ def self.open(filepath)
18
+ object = new(filepath)
19
+ object.eval_definition
20
+ object
21
+ end
22
+
23
+ # rubocop:disable Security/Eval
24
+ def eval_definition
25
+ @definition = eval(File.open(path).read)
26
+ end
27
+ # rubocop:enable Security/Eval
28
+
29
+ def shell
30
+ @shell ||= Spidy::Shell.new(self)
31
+ end
32
+
33
+ def console
34
+ require 'pry'
35
+ Pry.start(Spidy::Console.new(self))
36
+ end
37
+
38
+ private
39
+
40
+ def initialize(path)
41
+ @path = path
42
+ end
43
+ end
@@ -0,0 +1,22 @@
1
+ # frozen_string_literal: true
2
+
3
+ #
4
+ # looper
5
+ #
6
+ class Spidy::Looper
7
+ def initialize(resource, binder, loop_block)
8
+ @resource = resource
9
+ @binder = binder
10
+ @loop_block = loop_block
11
+ end
12
+
13
+ def call
14
+ yielder = lambda do |element|
15
+ result = @binder.new(element).result
16
+ fail "#{element}\n\n#{result.errors.full_messages}" if result.invalid?
17
+
18
+ yield result
19
+ end
20
+ @loop_block.call(@resource, yielder)
21
+ end
22
+ end
@@ -0,0 +1,23 @@
1
+ # frozen_string_literal: true
2
+
3
+ #
4
+ # Scrape results
5
+ #
6
+ class Spidy::Result
7
+ include ActiveModel::Model
8
+ include ActiveModel::Attributes
9
+
10
+ def self.define(name, presence: true)
11
+ case name
12
+ when /.*\?/
13
+ attribute name, :boolean
14
+ validates name, inclusion: { in: [true, false] } if presence
15
+ else
16
+ attribute name
17
+ validates name, presence: true, allow_blank: true if presence
18
+ end
19
+ end
20
+
21
+ attribute :fetched_at
22
+ attribute :fetched_on
23
+ end
@@ -0,0 +1,79 @@
1
+ # frozen_string_literal: true
2
+
3
+ #
4
+ # spidy shell interface
5
+ #
6
+ class Spidy::Shell
7
+ attr_reader :definition_file
8
+ delegate :spiders, :scrapers, to: :definition_file
9
+
10
+ def initialize(definition_file)
11
+ @definition_file = definition_file
12
+ end
13
+
14
+ # rubocop:disable Lint/AssignmentInCondition, Style/RescueStandardError
15
+ def scraper(name)
16
+ command = scrapers[name.to_sym]
17
+ fail "undefined commmand[#{name}]" if command.nil?
18
+
19
+ while line = STDIN.gets
20
+ url = line.strip
21
+ begin
22
+ command.call(url, &definition_file.output)
23
+ rescue => e
24
+ STDERR.puts "ERROR #{url}: #{e}\n#{e.backtrace}"
25
+ end
26
+ end
27
+ end
28
+ # rubocop:enable Lint/AssignmentInCondition, Style/RescueStandardError
29
+
30
+ def spider(name)
31
+ command = spiders[name.to_sym]
32
+ if File.pipe?(STDIN)
33
+ STDIN.each_line do |line|
34
+ start_url = line.strip
35
+ command.call(start_url) { |url| puts url }
36
+ end
37
+ else
38
+ command.call { |url| puts url }
39
+ end
40
+ end
41
+
42
+ def function
43
+ print <<~SHELL
44
+ function spider() {
45
+ spidy spider #{definition_file.path} $1
46
+ }
47
+ function scraper() {
48
+ spidy scraper #{definition_file.path} $1
49
+ }
50
+ SHELL
51
+ end
52
+
53
+ # rubocop:disable Metrics/MethodLength
54
+ def build(name)
55
+ File.open("#{name}.rb", 'w') do |f|
56
+ f.write <<~RUBY
57
+ # frozen_string_literal: true
58
+
59
+ Spidy.define(:#{name}) do
60
+ spider(:example, 'http://example.com') do |html, yielder|
61
+ # yielder.call(url or resource)
62
+ end
63
+
64
+ scraper(:example) do
65
+ end
66
+ end
67
+ RUBY
68
+ end
69
+
70
+ File.open("#{name}.sh", 'w') do |f|
71
+ f.write <<~SHELL
72
+ #!/bin/bash
73
+ eval "$(spidy $(dirname "${0}")/#{name}.rb shell)"
74
+ spider example
75
+ SHELL
76
+ end
77
+ end
78
+ # rubocop:enable Metrics/MethodLength
79
+ end
@@ -0,0 +1,17 @@
1
+ # frozen_string_literal: true
2
+
3
+ #
4
+ # Spider
5
+ #
6
+ class Spidy::Spider
7
+ def initialize(&block)
8
+ define_singleton_method(:bind, &block)
9
+ end
10
+
11
+ def call(resource)
12
+ yielder = lambda do |url|
13
+ yield url if block_given?
14
+ end
15
+ bind(resource, yielder)
16
+ end
17
+ end
@@ -0,0 +1,5 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Spidy
4
+ VERSION = '0.0.1'
5
+ end
@@ -0,0 +1,36 @@
1
+ # frozen_string_literal: true
2
+
3
+ lib = File.expand_path('lib', __dir__)
4
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
5
+ require 'spidy/version'
6
+
7
+ Gem::Specification.new do |spec|
8
+ spec.name = 'spidy'
9
+ spec.version = Spidy::VERSION
10
+ spec.authors = ['aileron']
11
+ spec.email = ['aileron.cc@gmail.com']
12
+
13
+ spec.summary = 'web spider dsl'
14
+ # spec.description = 'TODO: Write a longer description or delete this line.'
15
+ spec.homepage = 'https://github.com/aileron-inc/spidy'
16
+ spec.license = 'MIT'
17
+
18
+ # Specify which files should be added to the gem when it is released.
19
+ # The `git ls-files -z` loads the files in the RubyGem that have been added into git.
20
+ spec.files = Dir.chdir(File.expand_path(__dir__)) do
21
+ `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
22
+ end
23
+ spec.bindir = 'exe'
24
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
25
+ spec.require_paths = ['lib']
26
+
27
+ spec.add_development_dependency 'bundler', '~> 2.0'
28
+ spec.add_development_dependency 'pry'
29
+ spec.add_development_dependency 'rake', '~> 10.0'
30
+ spec.add_development_dependency 'rspec', '~> 3.0'
31
+
32
+ spec.add_runtime_dependency 'activemodel', '~> 5.2'
33
+ spec.add_runtime_dependency 'activesupport', '~> 5.2'
34
+ spec.add_runtime_dependency 'mechanize'
35
+ spec.add_runtime_dependency 'pry'
36
+ end
metadata ADDED
@@ -0,0 +1,186 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: spidy
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ platform: ruby
6
+ authors:
7
+ - aileron
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2019-08-21 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: bundler
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '2.0'
20
+ type: :development
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '2.0'
27
+ - !ruby/object:Gem::Dependency
28
+ name: pry
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ">="
32
+ - !ruby/object:Gem::Version
33
+ version: '0'
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ">="
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: rake
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - "~>"
46
+ - !ruby/object:Gem::Version
47
+ version: '10.0'
48
+ type: :development
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - "~>"
53
+ - !ruby/object:Gem::Version
54
+ version: '10.0'
55
+ - !ruby/object:Gem::Dependency
56
+ name: rspec
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - "~>"
60
+ - !ruby/object:Gem::Version
61
+ version: '3.0'
62
+ type: :development
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - "~>"
67
+ - !ruby/object:Gem::Version
68
+ version: '3.0'
69
+ - !ruby/object:Gem::Dependency
70
+ name: activemodel
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - "~>"
74
+ - !ruby/object:Gem::Version
75
+ version: '5.2'
76
+ type: :runtime
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - "~>"
81
+ - !ruby/object:Gem::Version
82
+ version: '5.2'
83
+ - !ruby/object:Gem::Dependency
84
+ name: activesupport
85
+ requirement: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - "~>"
88
+ - !ruby/object:Gem::Version
89
+ version: '5.2'
90
+ type: :runtime
91
+ prerelease: false
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - "~>"
95
+ - !ruby/object:Gem::Version
96
+ version: '5.2'
97
+ - !ruby/object:Gem::Dependency
98
+ name: mechanize
99
+ requirement: !ruby/object:Gem::Requirement
100
+ requirements:
101
+ - - ">="
102
+ - !ruby/object:Gem::Version
103
+ version: '0'
104
+ type: :runtime
105
+ prerelease: false
106
+ version_requirements: !ruby/object:Gem::Requirement
107
+ requirements:
108
+ - - ">="
109
+ - !ruby/object:Gem::Version
110
+ version: '0'
111
+ - !ruby/object:Gem::Dependency
112
+ name: pry
113
+ requirement: !ruby/object:Gem::Requirement
114
+ requirements:
115
+ - - ">="
116
+ - !ruby/object:Gem::Version
117
+ version: '0'
118
+ type: :runtime
119
+ prerelease: false
120
+ version_requirements: !ruby/object:Gem::Requirement
121
+ requirements:
122
+ - - ">="
123
+ - !ruby/object:Gem::Version
124
+ version: '0'
125
+ description:
126
+ email:
127
+ - aileron.cc@gmail.com
128
+ executables:
129
+ - spidy
130
+ extensions: []
131
+ extra_rdoc_files: []
132
+ files:
133
+ - ".gitignore"
134
+ - ".rspec"
135
+ - ".rubocop.yml"
136
+ - ".rubocop_todo.yml"
137
+ - ".ruby-version"
138
+ - ".travis.yml"
139
+ - CHANGELOG.md
140
+ - CODE_OF_CONDUCT.md
141
+ - Gemfile
142
+ - Gemfile.lock
143
+ - LICENSE.txt
144
+ - README.md
145
+ - Rakefile
146
+ - bin/console
147
+ - bin/setup
148
+ - exe/spidy
149
+ - lib/spidy.rb
150
+ - lib/spidy/binder.rb
151
+ - lib/spidy/connector.rb
152
+ - lib/spidy/connector/html.rb
153
+ - lib/spidy/connector/xml.rb
154
+ - lib/spidy/console.rb
155
+ - lib/spidy/definition.rb
156
+ - lib/spidy/definition_file.rb
157
+ - lib/spidy/looper.rb
158
+ - lib/spidy/result.rb
159
+ - lib/spidy/shell.rb
160
+ - lib/spidy/spider.rb
161
+ - lib/spidy/version.rb
162
+ - spidy.gemspec
163
+ homepage: https://github.com/aileron-inc/spidy
164
+ licenses:
165
+ - MIT
166
+ metadata: {}
167
+ post_install_message:
168
+ rdoc_options: []
169
+ require_paths:
170
+ - lib
171
+ required_ruby_version: !ruby/object:Gem::Requirement
172
+ requirements:
173
+ - - ">="
174
+ - !ruby/object:Gem::Version
175
+ version: '0'
176
+ required_rubygems_version: !ruby/object:Gem::Requirement
177
+ requirements:
178
+ - - ">="
179
+ - !ruby/object:Gem::Version
180
+ version: '0'
181
+ requirements: []
182
+ rubygems_version: 3.0.3
183
+ signing_key:
184
+ specification_version: 4
185
+ summary: web spider dsl
186
+ test_files: []