recipe_crawler 3.1.2 → 4.0.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +5 -5
- data/README.md +6 -7
- data/bin/recipe_crawler +0 -0
- data/lib/recipe_crawler.rb +2 -2
- data/lib/recipe_crawler/crawler.rb +153 -168
- data/lib/recipe_crawler/version.rb +1 -1
- data/recipe_crawler.gemspec +15 -17
- metadata +26 -12
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
|
-
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: d2185c1d31c0fd91ddf2df44770a587a5fa9b4bb3b12106f0dcb1c04b4ad0f94
|
4
|
+
data.tar.gz: 02b21cabf006eb6f6430d2a91f6ee4879077496abd6b4234638cc5c03dff1448
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e10ff78ee97a4e8bb830275768477cb77cc1441d0dbbcbce8008e18c79f0db85d6e97923140ee7cfb9483b09efe5b806dc2ed878d193723c0e7636a0bf0b989e
|
7
|
+
data.tar.gz: c947a04b528b40d5ab396bcc16d9e7ae8a5e21bc25d5295b339e97c23cb7132f3b7c543cae2f601bb5f1963503a73ba9661ba9cac31b50a307954075112558b4
|
data/README.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
# RecipeCrawler
|
2
2
|
|
3
|
-
A **web crawler** to save recipes from [marmiton.org](http://www.marmiton.org/), [750g.com](http://www.750g.com) or [cuisineaz.com](http://www.cuisineaz.com) into an **SQlite3** database.
|
3
|
+
A **web crawler** to save recipes from [marmiton.org](http://www.marmiton.org/), [750g.com](http://www.750g.com) or [cuisineaz.com](http://www.cuisineaz.com) into an **SQlite3** database.
|
4
4
|
|
5
5
|
> For the moment, it works only with [cuisineaz.com](http://www.cuisineaz.com)
|
6
6
|
|
@@ -29,7 +29,7 @@ Or install it yourself as:
|
|
29
29
|
|
30
30
|
### Command line
|
31
31
|
|
32
|
-
Install this gem and run
|
32
|
+
Install this gem and run
|
33
33
|
|
34
34
|
$ recipe_crawler -h
|
35
35
|
Usage: recipe_crawler [options]
|
@@ -60,9 +60,9 @@ Then you just need to instanciate a `RecipeCrawler::Crawler` with url of a Cuisi
|
|
60
60
|
url = 'http://www.cuisineaz.com/recettes/pate-a-pizza-legere-55004.aspx'
|
61
61
|
r = RecipeCrawler::Crawler.new url
|
62
62
|
|
63
|
-
Then you just need to run the crawl with a limit number of recipe to fetch. All recipes will be saved in a *export.sqlite3* file. You can pass a block to play with `
|
63
|
+
Then you just need to run the crawl with a limit number of recipe to fetch. All recipes will be saved in a *export.sqlite3* file. You can pass a block to play with `RecipeScraper::Recipe` objects.
|
64
64
|
|
65
|
-
r.crawl!(10) do |recipe|
|
65
|
+
r.crawl!(limit: 10) do |recipe|
|
66
66
|
puts recipe.to_hash
|
67
67
|
# will return
|
68
68
|
# --------------
|
@@ -91,7 +91,6 @@ Bug reports and pull requests are welcome on GitHub at https://github.com/[USERN
|
|
91
91
|
|
92
92
|
The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
|
93
93
|
|
94
|
-
Author
|
95
|
-
----------
|
94
|
+
## Author
|
96
95
|
|
97
|
-
[Rousseau Alexandre](https://github.com/madeindjs)
|
96
|
+
[Rousseau Alexandre](https://github.com/madeindjs)
|
data/bin/recipe_crawler
CHANGED
File without changes
|
data/lib/recipe_crawler.rb
CHANGED
@@ -3,175 +3,160 @@ require 'nokogiri'
|
|
3
3
|
require 'open-uri'
|
4
4
|
require 'sqlite3'
|
5
5
|
|
6
|
-
|
7
6
|
module RecipeCrawler
|
8
|
-
|
9
|
-
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
|
40
|
-
|
41
|
-
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
|
49
|
-
|
50
|
-
|
51
|
-
ingredients TEXT,
|
52
|
-
steps TEXT,
|
7
|
+
# This is the main class to crawl recipes from a given url
|
8
|
+
# 1. Crawler will crawl url to find others recipes urls on the website
|
9
|
+
# 2. it will crawl urls founded to find other url again & again
|
10
|
+
# 3. it will scrape urls founded to get data
|
11
|
+
#
|
12
|
+
# @attr_reader url [String] first url parsed
|
13
|
+
# @attr_reader host [Symbol] of url's host
|
14
|
+
# @attr_reader scraped_urls [Array<String>] of url's host
|
15
|
+
# @attr_reader crawled_urls [Array<String>] of url's host
|
16
|
+
# @attr_reader to_crawl_urls [Array<String>] of url's host
|
17
|
+
# @attr_reader recipes [Array<RecipeScraper::Recipe>] recipes fetched
|
18
|
+
# @attr_reader db [SQLite3::Database] Sqlite database where recipe will be saved
|
19
|
+
class Crawler
|
20
|
+
# URL than crawler can parse
|
21
|
+
ALLOWED_URLS = {
|
22
|
+
cuisineaz: 'cuisineaz.com/recettes/',
|
23
|
+
marmiton: 'marmiton.org/recettes/',
|
24
|
+
g750: '750g.com/'
|
25
|
+
}.freeze
|
26
|
+
|
27
|
+
attr_reader :url, :host, :scraped_urls, :crawled_urls, :to_crawl_urls, :recipes
|
28
|
+
attr_accessor :interval_sleep_time
|
29
|
+
|
30
|
+
#
|
31
|
+
# Create a Crawler
|
32
|
+
# @param url [String] a url a recipe to scrawl other one
|
33
|
+
def initialize(url)
|
34
|
+
@url = url
|
35
|
+
if url_valid?
|
36
|
+
@recipes = []
|
37
|
+
@crawled_urls = []
|
38
|
+
@scraped_urls = []
|
39
|
+
@to_crawl_urls = []
|
40
|
+
@to_crawl_urls << url
|
41
|
+
@interval_sleep_time = 0
|
42
|
+
@db = SQLite3::Database.new 'results.sqlite3'
|
43
|
+
@db.execute "CREATE TABLE IF NOT EXISTS recipes(
|
44
|
+
Id INTEGER PRIMARY KEY,
|
45
|
+
title TEXT,
|
46
|
+
preptime INTEGER,
|
47
|
+
cooktime INTEGER,
|
48
|
+
ingredients TEXT,
|
49
|
+
steps TEXT,
|
53
50
|
image TEXT
|
54
51
|
)"
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
|
74
|
-
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
|
79
|
-
|
80
|
-
|
81
|
-
|
82
|
-
|
83
|
-
|
84
|
-
|
85
|
-
|
86
|
-
|
87
|
-
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
|
108
|
-
|
109
|
-
|
110
|
-
|
111
|
-
|
112
|
-
|
113
|
-
|
114
|
-
|
115
|
-
|
116
|
-
|
117
|
-
|
118
|
-
|
119
|
-
|
120
|
-
|
121
|
-
|
122
|
-
|
123
|
-
|
124
|
-
|
125
|
-
|
126
|
-
|
127
|
-
|
128
|
-
|
129
|
-
|
130
|
-
|
131
|
-
|
132
|
-
|
133
|
-
|
134
|
-
|
135
|
-
|
136
|
-
|
137
|
-
|
138
|
-
|
139
|
-
|
140
|
-
|
141
|
-
|
142
|
-
|
143
|
-
|
144
|
-
|
145
|
-
|
146
|
-
|
147
|
-
|
148
|
-
|
149
|
-
|
150
|
-
|
151
|
-
|
152
|
-
#
|
153
|
-
# Save recipe
|
154
|
-
# @param recipe [RecipeSraper::Recipe] as recipe to save
|
155
|
-
#
|
156
|
-
# @return [Boolean] as true if success
|
157
|
-
def save recipe
|
158
|
-
begin
|
159
|
-
@db.execute "INSERT INTO recipes (title, preptime, cooktime, ingredients, steps, image)
|
52
|
+
else
|
53
|
+
raise ArgumentError, 'This url cannot be used'
|
54
|
+
end
|
55
|
+
end
|
56
|
+
|
57
|
+
#
|
58
|
+
# Check if the url can be parsed and set the host
|
59
|
+
#
|
60
|
+
# @return [Boolean] true if url can be parsed
|
61
|
+
def url_valid?
|
62
|
+
ALLOWED_URLS.each do |host, url_allowed|
|
63
|
+
if url.include? url_allowed
|
64
|
+
@host = host
|
65
|
+
return true
|
66
|
+
end
|
67
|
+
end
|
68
|
+
false
|
69
|
+
end
|
70
|
+
|
71
|
+
# Start the crawl
|
72
|
+
#
|
73
|
+
# @param limit [Integer] the maximum number of scraped recipes
|
74
|
+
# @param interval_sleep_time [Integer] waiting time between scraping
|
75
|
+
# @yield [RecipeScraper::Recipe] as recipe scraped
|
76
|
+
def crawl!(limit: 2, interval_sleep_time: 0)
|
77
|
+
recipes_returned = 0
|
78
|
+
|
79
|
+
if @host == :cuisineaz
|
80
|
+
|
81
|
+
while !@to_crawl_urls.empty? && (limit > @recipes.count)
|
82
|
+
# find all link on url given (and urls of theses)
|
83
|
+
url = @to_crawl_urls.first
|
84
|
+
next if url.nil?
|
85
|
+
|
86
|
+
get_links url
|
87
|
+
# now scrape an url
|
88
|
+
recipe = scrape url
|
89
|
+
yield recipe if recipe && block_given?
|
90
|
+
sleep interval_sleep_time
|
91
|
+
end
|
92
|
+
|
93
|
+
else
|
94
|
+
raise NotImplementedError
|
95
|
+
end
|
96
|
+
end
|
97
|
+
|
98
|
+
#
|
99
|
+
# Scrape given url
|
100
|
+
# param url [String] as url to scrape
|
101
|
+
#
|
102
|
+
# @return [RecipeScraper::Recipe] as recipe scraped
|
103
|
+
# @return [nil] if recipe connat be fetched
|
104
|
+
def scrape(url)
|
105
|
+
recipe = RecipeScraper::Recipe.new url
|
106
|
+
@scraped_urls << url
|
107
|
+
@recipes << recipe
|
108
|
+
if save recipe
|
109
|
+
return recipe
|
110
|
+
else
|
111
|
+
raise SQLite3::Exception, 'cannot save recipe'
|
112
|
+
end
|
113
|
+
rescue OpenURI::HTTPError
|
114
|
+
nil
|
115
|
+
end
|
116
|
+
|
117
|
+
#
|
118
|
+
# Get recipes links from the given url
|
119
|
+
# @param url [String] as url to scrape
|
120
|
+
#
|
121
|
+
# @return [void]
|
122
|
+
def get_links(url)
|
123
|
+
# catch 404 error from host
|
124
|
+
|
125
|
+
doc = Nokogiri::HTML(open(url))
|
126
|
+
# find internal links on page
|
127
|
+
doc.css('#tagCloud a').each do |link|
|
128
|
+
link = link.attr('href')
|
129
|
+
# If link correspond to a recipe we add it to recipe to scraw
|
130
|
+
if link.include?(ALLOWED_URLS[@host]) && !@crawled_urls.include?(url)
|
131
|
+
@to_crawl_urls << link
|
132
|
+
end
|
133
|
+
end
|
134
|
+
@to_crawl_urls.delete url
|
135
|
+
@crawled_urls << url
|
136
|
+
@to_crawl_urls.uniq!
|
137
|
+
rescue OpenURI::HTTPError
|
138
|
+
@to_crawl_urls.delete url
|
139
|
+
warn "#{url} cannot be reached"
|
140
|
+
end
|
141
|
+
|
142
|
+
#
|
143
|
+
# Save recipe
|
144
|
+
# @param recipe [RecipeScraper::Recipe] as recipe to save
|
145
|
+
#
|
146
|
+
# @return [Boolean] as true if success
|
147
|
+
def save(recipe)
|
148
|
+
@db.execute "INSERT INTO recipes (title, preptime, cooktime, ingredients, steps, image)
|
160
149
|
VALUES (:title, :preptime, :cooktime, :ingredients, :steps, :image)",
|
161
|
-
|
162
|
-
|
163
|
-
|
164
|
-
|
165
|
-
|
166
|
-
|
167
|
-
|
168
|
-
|
169
|
-
|
170
|
-
|
171
|
-
|
172
|
-
|
173
|
-
|
174
|
-
end
|
175
|
-
|
176
|
-
|
177
|
-
end
|
150
|
+
title: recipe.title,
|
151
|
+
preptime: recipe.preptime,
|
152
|
+
ingredients: recipe.ingredients.join("\n"),
|
153
|
+
steps: recipe.steps.join("\n"),
|
154
|
+
image: recipe.image
|
155
|
+
|
156
|
+
true
|
157
|
+
rescue SQLite3::Exception => e
|
158
|
+
puts "Exception occurred #{e}"
|
159
|
+
false
|
160
|
+
end
|
161
|
+
end
|
162
|
+
end
|
data/recipe_crawler.gemspec
CHANGED
@@ -1,29 +1,27 @@
|
|
1
|
-
|
2
|
-
lib = File.expand_path('../lib', __FILE__)
|
1
|
+
lib = File.expand_path('lib', __dir__)
|
3
2
|
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
|
4
3
|
require 'recipe_crawler/version'
|
5
4
|
|
6
5
|
Gem::Specification.new do |spec|
|
7
|
-
spec.name =
|
6
|
+
spec.name = 'recipe_crawler'
|
8
7
|
spec.version = RecipeCrawler::VERSION
|
9
|
-
spec.authors = [
|
10
|
-
spec.email = [
|
11
|
-
|
12
|
-
spec.summary = %q{Get all recipes from famous french cooking websites}
|
13
|
-
spec.description = %q{This crawler will use my personnal scraper named 'RecipeScraper' to dowload recipes data from Marmiton, 750g or cuisineaz}
|
14
|
-
spec.homepage = "https://github.com/madeindjs/recipe_crawler."
|
15
|
-
spec.license = "MIT"
|
8
|
+
spec.authors = ['Alexandre Rousseau']
|
9
|
+
spec.email = ['contact@rousseau-alexandre.fr']
|
16
10
|
|
11
|
+
spec.summary = 'Get all recipes from famous french cooking websites'
|
12
|
+
spec.description = "This crawler will use my personnal scraper named 'RecipeScraper' to dowload recipes data from Marmiton, 750g or cuisineaz"
|
13
|
+
spec.homepage = 'https://github.com/madeindjs/recipe_crawler'
|
14
|
+
spec.license = 'MIT'
|
17
15
|
|
18
16
|
spec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
|
19
17
|
spec.executables = ['recipe_crawler']
|
20
|
-
spec.require_paths = [
|
21
|
-
|
22
|
-
spec.add_dependency "recipe_scraper", '>= 2.2.0'
|
18
|
+
spec.require_paths = ['lib']
|
23
19
|
|
20
|
+
spec.add_dependency 'recipe_scraper', '~> 2.0'
|
21
|
+
spec.add_dependency 'sqlite3', '~> 1.3'
|
24
22
|
|
25
|
-
spec.add_development_dependency
|
26
|
-
spec.add_development_dependency
|
27
|
-
spec.add_development_dependency
|
28
|
-
spec.add_development_dependency
|
23
|
+
spec.add_development_dependency 'bundler', '~> 1.17'
|
24
|
+
spec.add_development_dependency 'rake', '~> 10.0'
|
25
|
+
spec.add_development_dependency 'rspec', '~> 3.0'
|
26
|
+
spec.add_development_dependency 'yard'
|
29
27
|
end
|
metadata
CHANGED
@@ -1,43 +1,57 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: recipe_crawler
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: 4.0.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
|
-
-
|
7
|
+
- Alexandre Rousseau
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2018-12-08 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: recipe_scraper
|
15
15
|
requirement: !ruby/object:Gem::Requirement
|
16
16
|
requirements:
|
17
|
-
- - "
|
17
|
+
- - "~>"
|
18
18
|
- !ruby/object:Gem::Version
|
19
|
-
version: 2.
|
19
|
+
version: '2.0'
|
20
20
|
type: :runtime
|
21
21
|
prerelease: false
|
22
22
|
version_requirements: !ruby/object:Gem::Requirement
|
23
23
|
requirements:
|
24
|
-
- - "
|
24
|
+
- - "~>"
|
25
|
+
- !ruby/object:Gem::Version
|
26
|
+
version: '2.0'
|
27
|
+
- !ruby/object:Gem::Dependency
|
28
|
+
name: sqlite3
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
30
|
+
requirements:
|
31
|
+
- - "~>"
|
32
|
+
- !ruby/object:Gem::Version
|
33
|
+
version: '1.3'
|
34
|
+
type: :runtime
|
35
|
+
prerelease: false
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
37
|
+
requirements:
|
38
|
+
- - "~>"
|
25
39
|
- !ruby/object:Gem::Version
|
26
|
-
version:
|
40
|
+
version: '1.3'
|
27
41
|
- !ruby/object:Gem::Dependency
|
28
42
|
name: bundler
|
29
43
|
requirement: !ruby/object:Gem::Requirement
|
30
44
|
requirements:
|
31
45
|
- - "~>"
|
32
46
|
- !ruby/object:Gem::Version
|
33
|
-
version: '1.
|
47
|
+
version: '1.17'
|
34
48
|
type: :development
|
35
49
|
prerelease: false
|
36
50
|
version_requirements: !ruby/object:Gem::Requirement
|
37
51
|
requirements:
|
38
52
|
- - "~>"
|
39
53
|
- !ruby/object:Gem::Version
|
40
|
-
version: '1.
|
54
|
+
version: '1.17'
|
41
55
|
- !ruby/object:Gem::Dependency
|
42
56
|
name: rake
|
43
57
|
requirement: !ruby/object:Gem::Requirement
|
@@ -83,7 +97,7 @@ dependencies:
|
|
83
97
|
description: This crawler will use my personnal scraper named 'RecipeScraper' to dowload
|
84
98
|
recipes data from Marmiton, 750g or cuisineaz
|
85
99
|
email:
|
86
|
-
-
|
100
|
+
- contact@rousseau-alexandre.fr
|
87
101
|
executables:
|
88
102
|
- recipe_crawler
|
89
103
|
extensions: []
|
@@ -104,7 +118,7 @@ files:
|
|
104
118
|
- lib/recipe_crawler/crawler.rb
|
105
119
|
- lib/recipe_crawler/version.rb
|
106
120
|
- recipe_crawler.gemspec
|
107
|
-
homepage: https://github.com/madeindjs/recipe_crawler
|
121
|
+
homepage: https://github.com/madeindjs/recipe_crawler
|
108
122
|
licenses:
|
109
123
|
- MIT
|
110
124
|
metadata: {}
|
@@ -124,7 +138,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
124
138
|
version: '0'
|
125
139
|
requirements: []
|
126
140
|
rubyforge_project:
|
127
|
-
rubygems_version: 2.
|
141
|
+
rubygems_version: 2.7.8
|
128
142
|
signing_key:
|
129
143
|
specification_version: 4
|
130
144
|
summary: Get all recipes from famous french cooking websites
|