Dynamised 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: b6a50d089e06a0aed555c420f05550140421382c
4
+ data.tar.gz: 1e3230c210211f9a1e300d2adafffda0c03f2ee3
5
+ SHA512:
6
+ metadata.gz: 23a5a47d22dfc676790017a97bde656a842988115722a079788a41f53399cbb8114a9781024b4d60fc32e5b0653209785cb4cdcb00758a0076f5fec71bdbf7b4
7
+ data.tar.gz: b654c59470598500794eab450194ebf93e72ec5ba607b94dac910b7ea3dfd1afa5b784304214b03942aea968c9329c5dc6a8632586ad6784f667dd921310f953
data/.gitignore ADDED
@@ -0,0 +1,13 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /tmp/
10
+ tags
11
+ *.db
12
+ /*.rb
13
+ *.gem
@@ -0,0 +1,74 @@
1
+ # Contributor Covenant Code of Conduct
2
+
3
+ ## Our Pledge
4
+
5
+ In the interest of fostering an open and welcoming environment, we as
6
+ contributors and maintainers pledge to making participation in our project and
7
+ our community a harassment-free experience for everyone, regardless of age, body
8
+ size, disability, ethnicity, gender identity and expression, level of experience,
9
+ nationality, personal appearance, race, religion, or sexual identity and
10
+ orientation.
11
+
12
+ ## Our Standards
13
+
14
+ Examples of behavior that contributes to creating a positive environment
15
+ include:
16
+
17
+ * Using welcoming and inclusive language
18
+ * Being respectful of differing viewpoints and experiences
19
+ * Gracefully accepting constructive criticism
20
+ * Focusing on what is best for the community
21
+ * Showing empathy towards other community members
22
+
23
+ Examples of unacceptable behavior by participants include:
24
+
25
+ * The use of sexualized language or imagery and unwelcome sexual attention or
26
+ advances
27
+ * Trolling, insulting/derogatory comments, and personal or political attacks
28
+ * Public or private harassment
29
+ * Publishing others' private information, such as a physical or electronic
30
+ address, without explicit permission
31
+ * Other conduct which could reasonably be considered inappropriate in a
32
+ professional setting
33
+
34
+ ## Our Responsibilities
35
+
36
+ Project maintainers are responsible for clarifying the standards of acceptable
37
+ behavior and are expected to take appropriate and fair corrective action in
38
+ response to any instances of unacceptable behavior.
39
+
40
+ Project maintainers have the right and responsibility to remove, edit, or
41
+ reject comments, commits, code, wiki edits, issues, and other contributions
42
+ that are not aligned to this Code of Conduct, or to ban temporarily or
43
+ permanently any contributor for other behaviors that they deem inappropriate,
44
+ threatening, offensive, or harmful.
45
+
46
+ ## Scope
47
+
48
+ This Code of Conduct applies both within project spaces and in public spaces
49
+ when an individual is representing the project or its community. Examples of
50
+ representing a project or community include using an official project e-mail
51
+ address, posting via an official social media account, or acting as an appointed
52
+ representative at an online or offline event. Representation of a project may be
53
+ further defined and clarified by project maintainers.
54
+
55
+ ## Enforcement
56
+
57
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be
58
+ reported by contacting the project team at mbeckerwork@gmail.com. All
59
+ complaints will be reviewed and investigated and will result in a response that
60
+ is deemed necessary and appropriate to the circumstances. The project team is
61
+ obligated to maintain confidentiality with regard to the reporter of an incident.
62
+ Further details of specific enforcement policies may be posted separately.
63
+
64
+ Project maintainers who do not follow or enforce the Code of Conduct in good
65
+ faith may face temporary or permanent repercussions as determined by other
66
+ members of the project's leadership.
67
+
68
+ ## Attribution
69
+
70
+ This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71
+ available at [http://contributor-covenant.org/version/1/4][version]
72
+
73
+ [homepage]: http://contributor-covenant.org
74
+ [version]: http://contributor-covenant.org/version/1/4/
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in dynamised.gemspec
4
+ gemspec
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2017 Martin Becker
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,41 @@
1
+ # Dynamised
2
+
3
+ Welcome to your new gem! In this directory, you'll find the files you need to be able to package up your Ruby library into a gem. Put your Ruby code in the file `lib/dynamised`. To experiment with that code, run `bin/console` for an interactive prompt.
4
+
5
+ TODO: Delete this and the text above, and describe your gem
6
+
7
+ ## Installation
8
+
9
+ Add this line to your application's Gemfile:
10
+
11
+ ```ruby
12
+ gem 'dynamised'
13
+ ```
14
+
15
+ And then execute:
16
+
17
+ $ bundle
18
+
19
+ Or install it yourself as:
20
+
21
+ $ gem install dynamised
22
+
23
+ ## Usage
24
+
25
+ TODO: Write usage instructions here
26
+
27
+ ## Development
28
+
29
+ After checking out the repo, run `bin/setup` to install dependencies. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
30
+
31
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
32
+
33
+ ## Contributing
34
+
35
+ Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/dynamised. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
36
+
37
+
38
+ ## License
39
+
40
+ The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
41
+
data/Rakefile ADDED
@@ -0,0 +1,2 @@
1
+ require "bundler/gem_tasks"
2
+ task :default => :spec
data/bin/console ADDED
@@ -0,0 +1,14 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "dynamised"
5
+
6
+ # You can add fixtures and/or initialization code here to make experimenting
7
+ # with your gem easier. You can also use a different console, if you like.
8
+
9
+ # (If you use this, don't forget to add pry to your Gemfile!)
10
+ # require "pry"
11
+ # Pry.start
12
+
13
+ require "irb"
14
+ IRB.start
data/bin/dynamised ADDED
@@ -0,0 +1,72 @@
1
+ #!/usr/bin/env ruby
2
+ require_relative '../lib/dynamised'
3
+ require 'commander'
4
+
5
+
6
+
7
+ module Dynamised
8
+ class CLI
9
+ include Commander::Methods
10
+
11
+ def run
12
+ program :name, "Dynamised"
13
+ program :version, META::Version
14
+ program :description, META::Description
15
+
16
+ command :run do |c|
17
+ c.syntax = 'dynamised run <script>'
18
+ c.description = 'scrapes with given scraper'
19
+ c.action do |args,options|
20
+ script_path = check_and_convert(args.first)
21
+ class_name = get_class_name(args.first)
22
+ create_temp_class(class_name,File.read(script_path))
23
+ class_ref = Scraper.fetch(class_name)
24
+ spinner = TTY::Spinner.new("[:spinner] scraping with %s" % scraper_name)
25
+ class_ref.new.pull_and_store do
26
+ spinner.spin
27
+ end
28
+ spinner.success("(Successfull)")
29
+ end
30
+ end
31
+
32
+ command :test do |c|
33
+ c.syntax = 'dynamised test <script>'
34
+ c.description = "tests given scraper"
35
+ c.action do |args,options|
36
+ script_path = check_and_convert(args.first)
37
+ class_name = get_class_name(args.first)
38
+ create_temp_class(class_name,File.read(script_path))
39
+ class_ref = Scraper.fetch(class_name)
40
+ class_ref.new.pull_and_check
41
+ end
42
+ end
43
+
44
+ alias_command :r, :run
45
+ alias_command :t, :test
46
+ default_command :help
47
+ run!
48
+ end
49
+
50
+ def check_and_convert(path)
51
+ script_path = File.expand_path(path, Dir.pwd )
52
+ (puts "File name %s doesn't exist" % script_path and exit) unless File.exists?(script_path)
53
+ script_path
54
+ end
55
+
56
+ def get_class_name(string)
57
+ string.split('/').last.split('.').first.gsub(/ /,'_').capitalize
58
+ end
59
+
60
+
61
+ def create_temp_class(class_name,script)
62
+ Dynamised.module_eval <<-RUBY
63
+ class #{class_name} < Scraper
64
+ #{script}
65
+ end
66
+ RUBY
67
+ end
68
+
69
+ end
70
+ end
71
+
72
+ Dynamised::CLI.new.run if $0 == __FILE__
data/bin/setup ADDED
@@ -0,0 +1,8 @@
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+ set -vx
5
+
6
+ bundle install
7
+
8
+ # Do any other automated setup that you need to do here
data/dynamised.gemspec ADDED
@@ -0,0 +1,32 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'dynamised/meta'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "Dynamised"
8
+ spec.version = Dynamised::META::Version
9
+ spec.authors = ["Martin Becker"]
10
+ spec.email = ["mbeckerwork@gmail.com"]
11
+
12
+ spec.summary = %q{A tool to allow you to build site crawling page scrapers.}
13
+ spec.description = Dynamised::META::Description
14
+ spec.homepage = "https://github.com/Thermatix/dynamised-rb"
15
+ spec.license = "MIT"
16
+
17
+
18
+ spec.files = `git ls-files -z`.split("\x0").reject do |f|
19
+ f.match(%r{^(test|spec|features)/})
20
+ end
21
+ spec.bindir = "exe"
22
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
23
+ spec.require_paths = ["lib"]
24
+ # {tty-spinner nokogiri awesome_print
25
+ spec.add_runtime_dependency "tty-spinner", "~> 0.4"
26
+ spec.add_runtime_dependency "nokogiri", "~> 1.7"
27
+ spec.add_runtime_dependency "awesome_print", "~> 1.7"
28
+ spec.add_runtime_dependency "commander", "~> 4.4"
29
+
30
+ spec.add_development_dependency "bundler", "~> 1.13"
31
+ spec.add_development_dependency "rake", "~> 10.0"
32
+ end
@@ -0,0 +1,13 @@
1
+ module Dynamised
2
+ class Scraper
3
+ module After_Scrape
4
+ def scrub_tags(string,field_data)
5
+ string.gsub(/<\/?[^>]*>/, "").strip.gsub(/ ?\\r\\n/,'')
6
+ end
7
+
8
+ def page_url(string,field_data)
9
+ @current_url
10
+ end
11
+ end
12
+ end
13
+ end
@@ -0,0 +1,7 @@
1
+ module Dynamised
2
+ class Scraper
3
+ module Before_Scrape
4
+
5
+ end
6
+ end
7
+ end
@@ -0,0 +1,100 @@
1
+ require "curb"
2
+
3
+ module Dynamised
4
+ module Curb_DSL
5
+
6
+ def self.included(base)
7
+ base.extend Singleton
8
+ base.instance_eval do
9
+ attr_reader :curl, :headers,:payload, :username, :password, :auth_type, :uri, :ssl, :redirects, :type_converter
10
+
11
+ [:get, :post, :put, :delete, :head, :options, :patch, :link, :unlink].each do |func_name|
12
+ define_method func_name do |&block|
13
+ make_request_of func_name.to_s.upcase, &block
14
+ end
15
+ end
16
+
17
+ [:password,:username,:payload, :auth_type, :uri, :ssl, :redirects,:type_converter].each do |func_name|
18
+ define_method "set_#{func_name}" do |value|
19
+ self.instance_variable_set :"@#{func_name}", value
20
+ end
21
+ end
22
+ end
23
+
24
+
25
+
26
+ end
27
+
28
+ module Singleton
29
+ def request(&block)
30
+ puts block
31
+ self.new(&block).body
32
+ end
33
+
34
+ def query_params(value)
35
+ Curl::postalize(value)
36
+ end
37
+ end
38
+
39
+
40
+
41
+ def initialize(&block)
42
+ @headers = {}
43
+ instance_eval(&block) if block
44
+ end
45
+
46
+ def header(name, content)
47
+ @headers[name] = content
48
+ end
49
+
50
+ def make_request_of(request_method,&block)
51
+ @curl = Curl::Easy.new(@uri) do |http|
52
+ setup_request request_method, http
53
+ end
54
+ @curl.ssl_verify_peer = @ssl ||false
55
+ # @curl.ignore_content_length = true
56
+ @curl.http request_method
57
+ if @curl.response_code == 301
58
+ @uri = @curl.redirect_url
59
+ make_request_of request_method
60
+ end
61
+ end
62
+
63
+ def status_code
64
+ @curl.response_code
65
+ end
66
+
67
+ def body
68
+ @curl.body
69
+ end
70
+
71
+ def query_params(value)
72
+ Curl::postalize(value)
73
+ end
74
+
75
+
76
+ private
77
+
78
+ def setup_request(method,http)
79
+ http.headers['request-method'] = method.to_s
80
+ http.headers.update(headers || {})
81
+ http.max_redirects = @redirects || 3
82
+ http.post_body = get_payload || nil
83
+ http.http_auth_types = @auth_type || nil
84
+ http.username = @username || nil
85
+ http.password = @password || nil
86
+ http.useragent = "curb"
87
+ http
88
+ end
89
+
90
+
91
+ def get_payload
92
+ if @type_converter
93
+ @type_converter.call(@payload)
94
+ else
95
+ @payload
96
+ end
97
+ end
98
+
99
+ end
100
+ end
@@ -0,0 +1,44 @@
1
+ #wrapper taken from: https://gist.github.com/stephan-nordnes-eriksen/6c9c56f63f36d5d100b://gist.github.com/stephan-nordnes-eriksen/6c9c56f63f36d5d100b2
2
+ class DBM_Wrapper
3
+
4
+ def initialize(file_name)
5
+ # @store = DBM.open("testDBM", 666, DBM::WRCREAT)
6
+ @store = DBM.new(file_name)
7
+ end
8
+
9
+ def []=(key,val)
10
+ @store[key] = val
11
+ end
12
+
13
+ def [](key)
14
+ @store[key]
15
+ end
16
+
17
+ def each(&block)
18
+ @store.each(&block)
19
+ end
20
+
21
+ def values
22
+ @store.values
23
+ end
24
+
25
+ def keys
26
+ @store.keys
27
+ end
28
+
29
+ def delete(key)
30
+ @store.delete(key)
31
+ end
32
+
33
+ def stop
34
+ @store.close unless @store.closed?
35
+ end
36
+
37
+ def destroy
38
+ stop
39
+ FileUtils.rm("testDBM.db")
40
+ end
41
+
42
+ def sync_lock
43
+ end
44
+ end
@@ -0,0 +1,26 @@
1
+ module Dynamised
2
+ class Scraper
3
+ module Helpers
4
+ def to_doc(html)
5
+ Nokogiri::HTML(html)
6
+ end
7
+
8
+ def sub_page(html_listing)
9
+ html_listing.xpath(".%s" % get_sub_page_tag[:path]).attr('href').to_s
10
+ end
11
+
12
+ def mpc(doc)
13
+ get_mpc(doc.xpath(get_mpc_tag[:path]))
14
+ end
15
+
16
+ def get_mpc(doc)
17
+ doc[-2].respond_to?(:inner_text) ? doc[-2].inner_text.to_i : 0
18
+ end
19
+
20
+ def field_keys
21
+ @current_page.data[:fields].keys
22
+ end
23
+
24
+ end
25
+ end
26
+ end
@@ -0,0 +1,9 @@
1
+ module Dynamised
2
+ module META
3
+ Version = "0.1.0"
4
+ Description = <<-DESC.gsub(/^\s*/, '')
5
+ A tool that allows a user to build a web scraper that works by recursively crawling pages until
6
+ it finds the requested infomation.
7
+ DESC
8
+ end
9
+ end
@@ -0,0 +1,46 @@
1
+ module Dynamised
2
+ class Node
3
+ include Enumerable
4
+
5
+ attr_accessor :childs,:init, :data, :ident,:siblings
6
+
7
+
8
+ def initialize(init={},ident=nil)
9
+ @ident = ident
10
+ @childs = {}
11
+ @sibilngs = {}
12
+ @init = init.clone
13
+ @data = init.clone
14
+ end
15
+
16
+ def each(&block)
17
+ block.call(self)
18
+ @childs.map do |key,child|
19
+ child.each(&block)
20
+ end
21
+ end
22
+
23
+ def <=>(other_node)
24
+ @data <=> other_node.data
25
+ end
26
+
27
+ def [](*keys)
28
+ return self if @childs.empty?
29
+ [*keys.flatten].inject(self) do |node,ident|
30
+ node.find {|n| n.ident == ident}
31
+ end
32
+ end
33
+
34
+ def new_child(ident,&block)
35
+ child = self.class.new(@init,ident)
36
+ child.siblings = self.childs
37
+ child.tap(&block) if block_given?
38
+ @childs[ident] = child
39
+ end
40
+
41
+ def pretty_print(pp)
42
+ self.each {|node| pp.text(node.ident || "" );puts "\n";pp.pp_hash node.data}
43
+ end
44
+
45
+ end
46
+ end
@@ -0,0 +1,227 @@
1
+ module Dynamised
2
+ class Scraper
3
+ XPATH_Anchor = ".%s"
4
+ extend DSL
5
+
6
+ class << self
7
+ def inherited(base)
8
+ @scrapers ||= {}
9
+ @scrapers[base.to_s.split('::').last.downcase] = base
10
+ base.instance_exec do
11
+ set_up_tree
12
+ end
13
+ end
14
+
15
+ def list
16
+ @scrapers ||= {}
17
+ @scrapers.map {|i,s| i}
18
+ end
19
+
20
+
21
+ def each(&block)
22
+ @scrapers ||= {}
23
+ @scrapers.each(&block)
24
+ end
25
+
26
+ def fetch(*args,&block)
27
+ @scrapers ||= {}
28
+ @scrapers.fetch(args.first.downcase) {|name|raise "No scraper called %s was found" % name }
29
+ end
30
+
31
+ end
32
+
33
+ include Curb_DSL
34
+ include Helpers
35
+ include Before_Scrape
36
+ include After_Scrape
37
+ include Writers
38
+
39
+ def initialize(args=[],&block)
40
+ @args = args
41
+ @tree_pointer = []
42
+ @use_store = false
43
+ @scraped_data = DBM_Wrapper.new("%s_scraped_data" % self.class.to_s)
44
+ [:inc,:uri,:tree,:tree_pointer,:base_url,:writer].each do |attr|
45
+ varb_name = "@%s" % attr
46
+ self.instance_variable_set(varb_name,self.class.instance_variable_get(varb_name))
47
+ end
48
+ super(&block)
49
+ end
50
+
51
+ def pull_and_store(&spinner)
52
+ raise "No writer detected" unless @writer
53
+ @use_store = true
54
+ scrape_data(&spinner)
55
+ write_data(&spinner)
56
+ end
57
+
58
+ def pull_and_check
59
+ doc = pull_initial
60
+ seperator = "}#{'-' * 40}{"
61
+ ap seperator
62
+ pull(doc,@tree) do |hash|
63
+ ap hash
64
+ ap seperator
65
+ sleep 0.5
66
+ end
67
+ end
68
+
69
+
70
+
71
+ private
72
+
73
+ def scrape_data(&spinner)
74
+ pull(pull_initial) do |hash|
75
+ spinner.call
76
+ end
77
+ end
78
+
79
+ def write_data(&spinner)
80
+ parsed_data = @scraped_data.map {|r| JSON.parse(r) }
81
+ @writer.each do |type,data|
82
+ case type
83
+ when :csv
84
+ write_csv(parsed_data, data, &spinner)
85
+ when :custom
86
+ data.call(parsed_data, &spinner)
87
+ else
88
+ raise '%s is a non supported writer type'
89
+ end
90
+ end
91
+ end
92
+
93
+
94
+ def pull(doc,tree,&block)
95
+ if fields?(tree)
96
+ scrape(doc,tree,&block)
97
+ end
98
+ childs(tree) do |pos,node,sub_tr|
99
+ @current_child = node
100
+ spt = node.data[:meta][:sub_page_tag]
101
+ scrape_tag_set(doc,spt[:xpath],spt[:meta]) do |url,i|
102
+ pull(get_doc(segment?(url)),sub_tr||node,&block)
103
+ end
104
+ end
105
+ end
106
+
107
+ def segment?(url)
108
+ url =~ /http/ ? url : "%s/%s" % [@base_url.gsub(/\/$|\z/,''), url.gsub(/\A\//,'')]
109
+ end
110
+
111
+ def tree_down(key,tree=false)
112
+ @tree_pointer << key
113
+ yield
114
+ @tree_pointer.pop
115
+ end
116
+
117
+ def fields?(tree)
118
+ not tree.data[:fields].empty?
119
+ end
120
+
121
+ def scrape(doc,tree,&block)
122
+ c_url = @current_url
123
+ if (@use_store ? !@scraped_data[c_url] : true) && can_scrape(doc,tree)
124
+ fields =
125
+ tree.data[:fields].each_with_object({}) do |(field,data),res_hash|
126
+ target = execute_method(data[:meta][:before],remove_style_tags(doc),res_hash)
127
+ value = scrape_tag(target,data[:xpath],data[:meta])
128
+ res_hash[field] = value ? execute_method(data[:meta][:after],value,res_hash) : data[:meta].fetch(:default,nil)
129
+ end
130
+ @scraped_data[c_url] = fields.to_json if @use_store
131
+ block.call(res_hash)
132
+ end
133
+ end
134
+
135
+ def remove_style_tags(doc)
136
+ doc.css("style").remove
137
+ doc
138
+ end
139
+
140
+ def get_by_ident(tree,ident)
141
+ return false unless tree
142
+ tree.find {|ch_i,ch| ch_i == ident}.last
143
+ end
144
+
145
+ def can_scrape(doc,tree)
146
+ scrape_if = tree.data[:scrape_if]
147
+ case true
148
+ when scrape_if.respond_to?(:call)
149
+ scrape_if.call(doc)
150
+ when scrape_if.respond_to?(:keys)
151
+ case true
152
+ when scrape_if.keys.include?(:fields)
153
+ check_for_fields(doc,tree,scrape_if)
154
+ end
155
+ else
156
+ @tree[@tree_pointer].data[:fields].length > 0
157
+ end
158
+ end
159
+
160
+ def check_for_fields(doc,tree,scrape_if)
161
+ [*scrape_if[:fields]].find do |field|
162
+ f = (tree || @tree[@tree_pointer]).data[:fields][field]
163
+ search_for_tag(doc,f[:xpath])
164
+ end
165
+ end
166
+
167
+ def execute_method(meth_name=nil,*args)
168
+ if meth_name
169
+ self.send(meth_name,*args)
170
+ else
171
+ args.first
172
+ end
173
+ end
174
+
175
+
176
+ def childs(node,tree=nil,&block)
177
+ if node.is_a? Array
178
+ tree.each do |child_node|
179
+ childs?(child_node,tree,&block)
180
+ end
181
+ else
182
+ unless node.childs.empty? && node.siblings.empty?
183
+ (node.childs.empty? ? node.siblings : node.childs).each do |ident,child_node|
184
+ block.call(ident,child_node,tree)
185
+ end
186
+ end
187
+ end
188
+ end
189
+
190
+
191
+ def scrape_tag_set(doc,xpath,meta={})
192
+ (doc.xpath(xpath)).each_with_index do |node,i|
193
+ yield(pull_from_node(node,meta),i)
194
+ end
195
+ end
196
+
197
+ def search_for_tag(doc,xpath)
198
+ doc.at_xpath(XPATH_Anchor % xpath)
199
+ end
200
+
201
+ def scrape_tag(doc,xpath,meta={})
202
+ pull_from_node(doc.xpath(XPATH_Anchor % xpath),meta)
203
+ end
204
+
205
+
206
+ def pull_from_node(node,meta)
207
+ (return nil if node.empty?) if node.respond_to?(:empty?)
208
+ (node.respond_to?(:empty?) ? node.first : node).send(*meta.fetch(:attr,:inner_text))
209
+ .send(meta.fetch(:r_type,:to_s))
210
+ end
211
+
212
+ def get_doc(url)
213
+ @current_url = url
214
+ set_uri(url)
215
+ get
216
+ to_doc(body)
217
+ end
218
+
219
+
220
+ def pull_initial
221
+ @inital_pull ||= begin
222
+ get_doc(@base_url)
223
+ end
224
+ end
225
+
226
+ end
227
+ end
@@ -0,0 +1,109 @@
1
+ module Dynamised
2
+ class Scraper
3
+ module DSL
4
+
5
+ # include Tree
6
+ def set_up_tree
7
+ unless @tree
8
+ @tree = Node.new({
9
+ fields: {},
10
+ meta: {},
11
+ recursive_select: false,
12
+ select: false,
13
+ scrape_if: nil
14
+ })
15
+ @tree_pointer = []
16
+ @xpath_prefix = []
17
+ @useables = {}
18
+ @writer = nil
19
+ @base_url = ""
20
+ @inc = 1
21
+ end
22
+ end
23
+
24
+ def tree_down(key,childs=false)
25
+ @tree_pointer << key
26
+ yield
27
+ @tree_pointer.pop
28
+ end
29
+
30
+ def re_useable(name,&block)
31
+ check_for_block(&block)
32
+ @useables[name] = block
33
+ end
34
+
35
+ def use(name)
36
+ instance_exec(&@useables[name])
37
+ end
38
+
39
+ def set_base_url(url)
40
+ @base_url = url
41
+ end
42
+
43
+ def set_pag_increment(value)
44
+ @inc = value
45
+ end
46
+
47
+
48
+ def xpath_prefix(prefix,&block)
49
+ check_for_block(&block)
50
+ @xpath_prefix << prefix
51
+ yield
52
+ @xpath_prefix.pop
53
+ end
54
+
55
+
56
+ def scrape_here_if(args=nil,&block)
57
+ @tree[@tree_pointer].data[:scrape_if] = args || {block: block}
58
+ end
59
+
60
+ def select_sub_page
61
+ @tree[@tree_pointer].data[:select] = true
62
+ end
63
+
64
+
65
+ #recursivly drill into page
66
+ def sub_page(items,&block)
67
+ items.each do |item,path|
68
+ @tree[@tree_pointer].new_child(item)
69
+ tree_down(item) do
70
+ set_meta_tag(:sub_page_tag,join_xpath(path),{attr: [:attr,:href]})
71
+ block.call
72
+ end
73
+ end
74
+ end
75
+
76
+
77
+ def set_field(name,xpath,meta={})
78
+ set_info(:fields,name,xpath,meta)
79
+ end
80
+
81
+ def set_meta_tag(name,xpath,meta={})
82
+ set_info(:meta,name,xpath,meta)
83
+ end
84
+
85
+ def writer(writers)
86
+ @writer = writers
87
+ end
88
+
89
+ private
90
+
91
+ def check_for_block(&block)
92
+ raise "No block given for #%s" % caller[0][/`.*'/][1..-2] unless block_given?
93
+ end
94
+
95
+ def set_info(type,name,xpath,meta)
96
+ @tree[@tree_pointer].data[type] = @tree[@tree_pointer].data[type].merge({name => {
97
+ xpath: join_xpath(xpath),
98
+ meta: meta
99
+ }})
100
+ end
101
+
102
+ def join_xpath(tag)
103
+ tag.empty? ? tag : @xpath_prefix.join + tag
104
+ end
105
+
106
+
107
+ end
108
+ end
109
+ end
@@ -0,0 +1,24 @@
1
+ require "csv"
2
+ require "json"
3
+ module Dynamised
4
+ class Scraper
5
+ module Writers
6
+ def write_csv(scraped_data,file_name,&spinner)
7
+ CSV.open(file_name, "wb") do |csv|
8
+ headers_written = false
9
+ title = ""
10
+ @scraped_data.each do |url,json|
11
+ hash = JSON.parse(json)
12
+ # the next two lines are a temporary hack to solve the double scrape issue
13
+ next unless title != hash[:title]
14
+ title = hash[:title]
15
+ (csv << hash.keys && headers_written = true) unless headers_written
16
+ csv << hash.values
17
+ spinner.call
18
+ end
19
+ end
20
+ end
21
+ end
22
+
23
+ end
24
+ end
data/lib/dynamised.rb ADDED
@@ -0,0 +1,8 @@
1
+ %w{tty-spinner nokogiri awesome_print gdbm json}.each {|lib| require lib}
2
+ %w{meta after_scrape_methods before_scrape_methods curb_dsl helpers node scraper_dsl writers dbm_wrapper scraper}
3
+ .each do |f|
4
+ require_relative "dynamised/%s" % f
5
+ end
6
+ module Dynamised
7
+ # Your code goes here...
8
+ end
metadata ADDED
@@ -0,0 +1,152 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: Dynamised
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Martin Becker
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2017-03-08 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: tty-spinner
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '0.4'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '0.4'
27
+ - !ruby/object:Gem::Dependency
28
+ name: nokogiri
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '1.7'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '1.7'
41
+ - !ruby/object:Gem::Dependency
42
+ name: awesome_print
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - "~>"
46
+ - !ruby/object:Gem::Version
47
+ version: '1.7'
48
+ type: :runtime
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - "~>"
53
+ - !ruby/object:Gem::Version
54
+ version: '1.7'
55
+ - !ruby/object:Gem::Dependency
56
+ name: commander
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - "~>"
60
+ - !ruby/object:Gem::Version
61
+ version: '4.4'
62
+ type: :runtime
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - "~>"
67
+ - !ruby/object:Gem::Version
68
+ version: '4.4'
69
+ - !ruby/object:Gem::Dependency
70
+ name: bundler
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - "~>"
74
+ - !ruby/object:Gem::Version
75
+ version: '1.13'
76
+ type: :development
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - "~>"
81
+ - !ruby/object:Gem::Version
82
+ version: '1.13'
83
+ - !ruby/object:Gem::Dependency
84
+ name: rake
85
+ requirement: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - "~>"
88
+ - !ruby/object:Gem::Version
89
+ version: '10.0'
90
+ type: :development
91
+ prerelease: false
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - "~>"
95
+ - !ruby/object:Gem::Version
96
+ version: '10.0'
97
+ description: |
98
+ A tool that allows a user to build a web scraper that works by recursively crawling pages until
99
+ it finds the requested infomation.
100
+ email:
101
+ - mbeckerwork@gmail.com
102
+ executables: []
103
+ extensions: []
104
+ extra_rdoc_files: []
105
+ files:
106
+ - ".gitignore"
107
+ - CODE_OF_CONDUCT.md
108
+ - Gemfile
109
+ - LICENSE.txt
110
+ - README.md
111
+ - Rakefile
112
+ - bin/console
113
+ - bin/dynamised
114
+ - bin/setup
115
+ - dynamised.gemspec
116
+ - lib/dynamised.rb
117
+ - lib/dynamised/after_scrape_methods.rb
118
+ - lib/dynamised/before_scrape_methods.rb
119
+ - lib/dynamised/curb_dsl.rb
120
+ - lib/dynamised/dbm_wrapper.rb
121
+ - lib/dynamised/helpers.rb
122
+ - lib/dynamised/meta.rb
123
+ - lib/dynamised/node.rb
124
+ - lib/dynamised/scraper.rb
125
+ - lib/dynamised/scraper_dsl.rb
126
+ - lib/dynamised/writers.rb
127
+ homepage: https://github.com/Thermatix/dynamised-rb
128
+ licenses:
129
+ - MIT
130
+ metadata: {}
131
+ post_install_message:
132
+ rdoc_options: []
133
+ require_paths:
134
+ - lib
135
+ required_ruby_version: !ruby/object:Gem::Requirement
136
+ requirements:
137
+ - - ">="
138
+ - !ruby/object:Gem::Version
139
+ version: '0'
140
+ required_rubygems_version: !ruby/object:Gem::Requirement
141
+ requirements:
142
+ - - ">="
143
+ - !ruby/object:Gem::Version
144
+ version: '0'
145
+ requirements: []
146
+ rubyforge_project:
147
+ rubygems_version: 2.5.1
148
+ signing_key:
149
+ specification_version: 4
150
+ summary: A tool to allow you to build site crawling page scrapers.
151
+ test_files: []
152
+ has_rdoc: