restaurant_crawler 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: a5ddd95835d7f83fd3369f41a88f02407b2cc7ae
4
+ data.tar.gz: 57717cb548b3ce0b4626714d15d64f434482fb7d
5
+ SHA512:
6
+ metadata.gz: fc50c5f9869ded2b6c8e2d6ee2f6e4c5718b8590d24feaa2f8e6ebd90944e0c028adb3e6f7ffd323bc147dc22a5bade7b15b4aaf1949f6c2b3130b5a401cf566
7
+ data.tar.gz: d39363a860485024fe20907f28214f8e7846a21d3cb4d1900c3d78ac3ffbc69806a8acbf0cb678598bd251db2241faea022fff5957990cd0efdcfc667ce06dc2
data/.gitignore ADDED
@@ -0,0 +1,14 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /tmp/
10
+
11
+ # rspec failure tracking
12
+ .rspec_status
13
+
14
+ restaurants.sqlite3
data/.rspec ADDED
@@ -0,0 +1,2 @@
1
+ --format documentation
2
+ --color
data/.travis.yml ADDED
@@ -0,0 +1,5 @@
1
+ sudo: false
2
+ language: ruby
3
+ rvm:
4
+ - 2.3.1
5
+ before_install: gem install bundler -v 1.14.3
@@ -0,0 +1,74 @@
1
+ # Contributor Covenant Code of Conduct
2
+
3
+ ## Our Pledge
4
+
5
+ In the interest of fostering an open and welcoming environment, we as
6
+ contributors and maintainers pledge to making participation in our project and
7
+ our community a harassment-free experience for everyone, regardless of age, body
8
+ size, disability, ethnicity, gender identity and expression, level of experience,
9
+ nationality, personal appearance, race, religion, or sexual identity and
10
+ orientation.
11
+
12
+ ## Our Standards
13
+
14
+ Examples of behavior that contributes to creating a positive environment
15
+ include:
16
+
17
+ * Using welcoming and inclusive language
18
+ * Being respectful of differing viewpoints and experiences
19
+ * Gracefully accepting constructive criticism
20
+ * Focusing on what is best for the community
21
+ * Showing empathy towards other community members
22
+
23
+ Examples of unacceptable behavior by participants include:
24
+
25
+ * The use of sexualized language or imagery and unwelcome sexual attention or
26
+ advances
27
+ * Trolling, insulting/derogatory comments, and personal or political attacks
28
+ * Public or private harassment
29
+ * Publishing others' private information, such as a physical or electronic
30
+ address, without explicit permission
31
+ * Other conduct which could reasonably be considered inappropriate in a
32
+ professional setting
33
+
34
+ ## Our Responsibilities
35
+
36
+ Project maintainers are responsible for clarifying the standards of acceptable
37
+ behavior and are expected to take appropriate and fair corrective action in
38
+ response to any instances of unacceptable behavior.
39
+
40
+ Project maintainers have the right and responsibility to remove, edit, or
41
+ reject comments, commits, code, wiki edits, issues, and other contributions
42
+ that are not aligned to this Code of Conduct, or to ban temporarily or
43
+ permanently any contributor for other behaviors that they deem inappropriate,
44
+ threatening, offensive, or harmful.
45
+
46
+ ## Scope
47
+
48
+ This Code of Conduct applies both within project spaces and in public spaces
49
+ when an individual is representing the project or its community. Examples of
50
+ representing a project or community include using an official project e-mail
51
+ address, posting via an official social media account, or acting as an appointed
52
+ representative at an online or offline event. Representation of a project may be
53
+ further defined and clarified by project maintainers.
54
+
55
+ ## Enforcement
56
+
57
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be
58
+ reported by contacting the project team at arousseau@gac-technology.com. All
59
+ complaints will be reviewed and investigated and will result in a response that
60
+ is deemed necessary and appropriate to the circumstances. The project team is
61
+ obligated to maintain confidentiality with regard to the reporter of an incident.
62
+ Further details of specific enforcement policies may be posted separately.
63
+
64
+ Project maintainers who do not follow or enforce the Code of Conduct in good
65
+ faith may face temporary or permanent repercussions as determined by other
66
+ members of the project's leadership.
67
+
68
+ ## Attribution
69
+
70
+ This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71
+ available at [http://contributor-covenant.org/version/1/4][version]
72
+
73
+ [homepage]: http://contributor-covenant.org
74
+ [version]: http://contributor-covenant.org/version/1/4/
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in restaurant_crawler.gemspec
4
+ gemspec
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2017 Alex Rousseau
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,62 @@
1
+ # RestaurantCrawler
2
+
3
+ J'ai eu besoin de récupérer les nom, siteweb et addresse mail de restaurants francais pour faire de la prospection:
4
+
5
+ > Moi: Quoi? 1500 € pour acheter une simple liste de restaurants?
6
+
7
+ > Ruby: Bouge pas, je vais t'aider!
8
+
9
+ Quelques heures plus tard: plus de 800 résultats gratuits juste avec **Nokogiri** et **Anemone**.
10
+
11
+ Si cela peu servir à quelqu'un, voici les sources ;) .
12
+
13
+ ## usage
14
+
15
+ ### From instalation
16
+
17
+ Add this line to your application's Gemfile:
18
+
19
+ ```ruby
20
+ gem 'restaurant_crawler'
21
+ ```
22
+
23
+ And then execute:
24
+
25
+ $ bundle
26
+
27
+ Or install it yourself as:
28
+
29
+ $ gem install restaurant_crawler
30
+
31
+ and then you'll be able to run from console
32
+
33
+ $ restaurant_crawler.rb --h
34
+ Usage: restaurant_crawler [options]
35
+ -c, --crawl Start to crawl restopolitan.com
36
+ -e, --email Start to fetch email from database (need to run crawl before)
37
+
38
+
39
+ ### From source
40
+
41
+ $ git clone https://github.com/madeindjs/restaurant_crawler.git
42
+ $ cd restaurant_crawler
43
+ $ bundle install
44
+ $ rake -T
45
+ rake crawl # start crawler
46
+ rake find_emails # find emails of restaurants crawled
47
+
48
+ ## Development
49
+
50
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
51
+
52
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
53
+
54
+ ## Contributing
55
+
56
+ Bug reports and pull requests are welcome on GitHub at https://github.com/Alex Rousseau/restaurant_crawler. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
57
+
58
+
59
+ ## License
60
+
61
+ The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
62
+
data/Rakefile ADDED
@@ -0,0 +1,20 @@
1
+ require "bundler/gem_tasks"
2
+ require "rspec/core/rake_task"
3
+ require "restaurant_crawler"
4
+
5
+
6
+ RSpec::Core::RakeTask.new(:spec)
7
+
8
+ task :default => :spec
9
+
10
+
11
+ desc "start crawler"
12
+ task :crawl do
13
+ RestaurantCrawler.crawl
14
+ end
15
+
16
+
17
+ desc "find emails of restaurants crawled"
18
+ task :find_emails do
19
+ RestaurantCrawler.find_emails
20
+ end
data/bin/console ADDED
@@ -0,0 +1,14 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "restaurant_crawler"
5
+
6
+ # You can add fixtures and/or initialization code here to make experimenting
7
+ # with your gem easier. You can also use a different console, if you like.
8
+
9
+ # (If you use this, don't forget to add pry to your Gemfile!)
10
+ # require "pry"
11
+ # Pry.start
12
+
13
+ require "irb"
14
+ IRB.start(__FILE__)
@@ -0,0 +1,10 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "restaurant_crawler"
4
+ require "optparse"
5
+
6
+ # Parse options
7
+ OptionParser.new do |opts|
8
+ opts.on("-c", "--crawl", "Start to crawl restopolitan.com") { RestaurantCrawler.crawl }
9
+ opts.on("-e", "--email", "Start to fetch email from database (need to run crawl before)") { |x| options[:url] = x}
10
+ end.parse!
data/bin/setup ADDED
@@ -0,0 +1,8 @@
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+ set -vx
5
+
6
+ bundle install
7
+
8
+ # Do any other automated setup that you need to do here
@@ -0,0 +1,67 @@
1
+ require 'nokogiri'
2
+ require 'sqlite3'
3
+
4
+ module RestaurantCrawler
5
+
6
+
7
+ class Restaurant
8
+
9
+ attr_reader :doc
10
+ attr_accessor :name, :website, :address
11
+
12
+ def initialize nokogiri_doc
13
+ @doc = nokogiri_doc
14
+ # found name
15
+ if h1 = @doc.at_css("h1")
16
+ @name = sanitize h1.text
17
+ else
18
+ raise RuntimeError.new "T"
19
+ end
20
+ # found website
21
+ @doc.css("a").each do |link|
22
+ if link.text.include? "Site du restaurant"
23
+ @website = sanitize link['href']
24
+ break
25
+ end
26
+ end
27
+
28
+ raise RuntimeError.new "Restaurant's website not found" unless @website
29
+
30
+ # found address
31
+ if p = @doc.at_css("div.addressInfo")
32
+ @address = sanitize p.text.split('Cuisine').first
33
+ else
34
+ raise RuntimeError.new "Restaurant's address not found"
35
+ end
36
+ end
37
+
38
+
39
+ def to_s
40
+ "#{@name}: #{@website}"
41
+ end
42
+
43
+
44
+ def save database
45
+ database.execute "CREATE TABLE IF NOT EXISTS restaurants(Id INTEGER PRIMARY KEY, name TEXT, website TEXT, address TEXT)"
46
+ stm = database.prepare "INSERT INTO restaurants(name, website, address) VALUES(:name, :website, :address)"
47
+ stm.bind_param 'name', @name
48
+ stm.bind_param 'website', @website
49
+ stm.bind_param 'address', @address
50
+ stm.execute
51
+ end
52
+
53
+
54
+ private
55
+
56
+ def sanitize string
57
+ string.gsub!("\n", '')
58
+ string.gsub!("\r", '')
59
+ string.gsub!(" ", '')
60
+ return string
61
+ end
62
+
63
+
64
+ end
65
+
66
+
67
+ end
@@ -0,0 +1,3 @@
1
+ module RestaurantCrawler
2
+ VERSION = "0.1.0"
3
+ end
@@ -0,0 +1,78 @@
1
+ require "restaurant_crawler/version"
2
+ require "restaurant_crawler/restaurant"
3
+ require 'sqlite3'
4
+ require 'anemone'
5
+ require 'nokogiri'
6
+ require 'open_uri_redirections'
7
+
8
+ module RestaurantCrawler
9
+
10
+ RESTOPOLITAN_URL = 'http://www.restopolitan.com'
11
+
12
+ def self.crawl
13
+ database = SQLite3::Database.new "restaurants.sqlite3"
14
+ Anemone.crawl(RESTOPOLITAN_URL, delay: 0.5) do |anemone|
15
+ anemone.on_pages_like(/.*\/restaurant\/.*/) do |page|
16
+ begin
17
+ restaurant = Restaurant.new page.doc
18
+ if restaurant.save database
19
+ puts "[x] " + restaurant.to_s + " saved"
20
+ else
21
+ puts "[ ] failed to save " + restaurant.to_s
22
+ end
23
+ rescue RuntimeError => e
24
+ puts "[ ] #{e} : #{page.url} craweld"
25
+ end
26
+ end
27
+ end
28
+ end
29
+
30
+ def self.find_emails
31
+ database = SQLite3::Database.new "restaurants.sqlite3"
32
+
33
+ # add columns if needed
34
+ ['email', 'telephone', 'error'].each do |column|
35
+ begin
36
+ database.execute "ALTER TABLE restaurants ADD COLUMN #{column} TEXT"
37
+ rescue SQLite3::SQLException
38
+ end
39
+ end
40
+
41
+
42
+ database.execute("SELECT * FROM restaurants").each do |row|
43
+ id = row[0]
44
+ name = row[1]
45
+ website = row[2]
46
+ email = telephone = nil
47
+
48
+ begin
49
+ doc = Nokogiri::HTML(open website)
50
+ # get all link
51
+ doc.css('a').each do |link|
52
+ # get mailto / telto
53
+ email = link['href'] if link['href'].include? 'mailto:'
54
+ telephone = link['href'] if link['href'].include? 'telto:'
55
+ end
56
+
57
+ if email || telephone
58
+ stm = database.prepare "UPDATE restaurants SET email = :email, telephone = :telephone WHERE id = :id"
59
+ stm.bind_param 'id', id
60
+ stm.bind_param 'email', email
61
+ stm.bind_param 'telephone', telephone
62
+ stm.execute
63
+ puts "[x] #{name} => #{email} / #{telephone}"
64
+ else
65
+ raise RuntimeError.new "Restaurant's email / telephone not found"
66
+ end
67
+ rescue Exception => e
68
+ stm = database.prepare "UPDATE restaurants SET error = :error WHERE id = :id"
69
+ stm.bind_param 'id', id
70
+ stm.bind_param 'error', e.message
71
+ stm.execute
72
+ puts "[ ] #{name} => " + e.message
73
+ end
74
+ end
75
+ database.close
76
+ end
77
+
78
+ end
@@ -0,0 +1,41 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'restaurant_crawler/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "restaurant_crawler"
8
+ spec.version = RestaurantCrawler::VERSION
9
+ spec.authors = ["Alexandre Rousseau"]
10
+ spec.email = ["madeindjs@gmail.com"]
11
+
12
+ spec.summary = %q{Find restaurants websites on http://www.restopolitan.com.}
13
+ spec.description = %q{A simply web crawler.}
14
+ spec.homepage = "https://github.com/madeindjs/restaurant_crawler"
15
+ spec.license = "MIT"
16
+
17
+ # Prevent pushing this gem to RubyGems.org. To allow pushes either set the 'allowed_push_host'
18
+ # to allow pushing to a single host or delete this section to allow pushing to any host.
19
+ if spec.respond_to?(:metadata)
20
+ spec.metadata['allowed_push_host'] = "https://rubygems.org"
21
+ else
22
+ raise "RubyGems 2.0 or newer is required to protect against " \
23
+ "public gem pushes."
24
+ end
25
+
26
+ spec.files = `git ls-files -z`.split("\x0").reject do |f|
27
+ f.match(%r{^(test|spec|features)/})
28
+ end
29
+ spec.bindir = "bin"
30
+ spec.executables = ['restaurant_crawler.rb']
31
+ spec.require_paths = ["lib"]
32
+
33
+
34
+ spec.add_runtime_dependency 'anemone', '~> 0'
35
+ spec.add_runtime_dependency 'nokogiri', '~> 0'
36
+ spec.add_runtime_dependency 'open_uri_redirections', '~> 0'
37
+
38
+ spec.add_development_dependency "bundler", "~> 1.14"
39
+ spec.add_development_dependency "rake", "~> 10.0"
40
+ spec.add_development_dependency "rspec", "~> 3.0"
41
+ end
metadata ADDED
@@ -0,0 +1,145 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: restaurant_crawler
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Alexandre Rousseau
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2017-04-30 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: anemone
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '0'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '0'
27
+ - !ruby/object:Gem::Dependency
28
+ name: nokogiri
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '0'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: open_uri_redirections
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - "~>"
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ type: :runtime
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - "~>"
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ - !ruby/object:Gem::Dependency
56
+ name: bundler
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - "~>"
60
+ - !ruby/object:Gem::Version
61
+ version: '1.14'
62
+ type: :development
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - "~>"
67
+ - !ruby/object:Gem::Version
68
+ version: '1.14'
69
+ - !ruby/object:Gem::Dependency
70
+ name: rake
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - "~>"
74
+ - !ruby/object:Gem::Version
75
+ version: '10.0'
76
+ type: :development
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - "~>"
81
+ - !ruby/object:Gem::Version
82
+ version: '10.0'
83
+ - !ruby/object:Gem::Dependency
84
+ name: rspec
85
+ requirement: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - "~>"
88
+ - !ruby/object:Gem::Version
89
+ version: '3.0'
90
+ type: :development
91
+ prerelease: false
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - "~>"
95
+ - !ruby/object:Gem::Version
96
+ version: '3.0'
97
+ description: A simply web crawler.
98
+ email:
99
+ - madeindjs@gmail.com
100
+ executables:
101
+ - restaurant_crawler.rb
102
+ extensions: []
103
+ extra_rdoc_files: []
104
+ files:
105
+ - ".gitignore"
106
+ - ".rspec"
107
+ - ".travis.yml"
108
+ - CODE_OF_CONDUCT.md
109
+ - Gemfile
110
+ - LICENSE.txt
111
+ - README.md
112
+ - Rakefile
113
+ - bin/console
114
+ - bin/restaurant_crawler.rb
115
+ - bin/setup
116
+ - lib/restaurant_crawler.rb
117
+ - lib/restaurant_crawler/restaurant.rb
118
+ - lib/restaurant_crawler/version.rb
119
+ - restaurant_crawler.gemspec
120
+ homepage: https://github.com/madeindjs/restaurant_crawler
121
+ licenses:
122
+ - MIT
123
+ metadata:
124
+ allowed_push_host: https://rubygems.org
125
+ post_install_message:
126
+ rdoc_options: []
127
+ require_paths:
128
+ - lib
129
+ required_ruby_version: !ruby/object:Gem::Requirement
130
+ requirements:
131
+ - - ">="
132
+ - !ruby/object:Gem::Version
133
+ version: '0'
134
+ required_rubygems_version: !ruby/object:Gem::Requirement
135
+ requirements:
136
+ - - ">="
137
+ - !ruby/object:Gem::Version
138
+ version: '0'
139
+ requirements: []
140
+ rubyforge_project:
141
+ rubygems_version: 2.6.11
142
+ signing_key:
143
+ specification_version: 4
144
+ summary: Find restaurants websites on http://www.restopolitan.com.
145
+ test_files: []