recipe_crawler 3.1.2 → 4.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +5 -5
- data/README.md +6 -7
- data/bin/recipe_crawler +0 -0
- data/lib/recipe_crawler.rb +2 -2
- data/lib/recipe_crawler/crawler.rb +153 -168
- data/lib/recipe_crawler/version.rb +1 -1
- data/recipe_crawler.gemspec +15 -17
- metadata +26 -12
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
|
-
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: d2185c1d31c0fd91ddf2df44770a587a5fa9b4bb3b12106f0dcb1c04b4ad0f94
|
4
|
+
data.tar.gz: 02b21cabf006eb6f6430d2a91f6ee4879077496abd6b4234638cc5c03dff1448
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e10ff78ee97a4e8bb830275768477cb77cc1441d0dbbcbce8008e18c79f0db85d6e97923140ee7cfb9483b09efe5b806dc2ed878d193723c0e7636a0bf0b989e
|
7
|
+
data.tar.gz: c947a04b528b40d5ab396bcc16d9e7ae8a5e21bc25d5295b339e97c23cb7132f3b7c543cae2f601bb5f1963503a73ba9661ba9cac31b50a307954075112558b4
|
data/README.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
# RecipeCrawler
|
2
2
|
|
3
|
-
A **web crawler** to save recipes from [marmiton.org](http://www.marmiton.org/), [750g.com](http://www.750g.com) or [cuisineaz.com](http://www.cuisineaz.com) into an **SQlite3** database.
|
3
|
+
A **web crawler** to save recipes from [marmiton.org](http://www.marmiton.org/), [750g.com](http://www.750g.com) or [cuisineaz.com](http://www.cuisineaz.com) into an **SQlite3** database.
|
4
4
|
|
5
5
|
> For the moment, it works only with [cuisineaz.com](http://www.cuisineaz.com)
|
6
6
|
|
@@ -29,7 +29,7 @@ Or install it yourself as:
|
|
29
29
|
|
30
30
|
### Command line
|
31
31
|
|
32
|
-
Install this gem and run
|
32
|
+
Install this gem and run
|
33
33
|
|
34
34
|
$ recipe_crawler -h
|
35
35
|
Usage: recipe_crawler [options]
|
@@ -60,9 +60,9 @@ Then you just need to instanciate a `RecipeCrawler::Crawler` with url of a Cuisi
|
|
60
60
|
url = 'http://www.cuisineaz.com/recettes/pate-a-pizza-legere-55004.aspx'
|
61
61
|
r = RecipeCrawler::Crawler.new url
|
62
62
|
|
63
|
-
Then you just need to run the crawl with a limit number of recipe to fetch. All recipes will be saved in a *export.sqlite3* file. You can pass a block to play with `
|
63
|
+
Then you just need to run the crawl with a limit number of recipe to fetch. All recipes will be saved in a *export.sqlite3* file. You can pass a block to play with `RecipeScraper::Recipe` objects.
|
64
64
|
|
65
|
-
r.crawl!(10) do |recipe|
|
65
|
+
r.crawl!(limit: 10) do |recipe|
|
66
66
|
puts recipe.to_hash
|
67
67
|
# will return
|
68
68
|
# --------------
|
@@ -91,7 +91,6 @@ Bug reports and pull requests are welcome on GitHub at https://github.com/[USERN
|
|
91
91
|
|
92
92
|
The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
|
93
93
|
|
94
|
-
Author
|
95
|
-
----------
|
94
|
+
## Author
|
96
95
|
|
97
|
-
[Rousseau Alexandre](https://github.com/madeindjs)
|
96
|
+
[Rousseau Alexandre](https://github.com/madeindjs)
|
data/bin/recipe_crawler
CHANGED
File without changes
|
data/lib/recipe_crawler.rb
CHANGED
@@ -3,175 +3,160 @@ require 'nokogiri'
|
|
3
3
|
require 'open-uri'
|
4
4
|
require 'sqlite3'
|
5
5
|
|
6
|
-
|
7
6
|
module RecipeCrawler
|
8
|
-
|
9
|
-
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
|
40
|
-
|
41
|
-
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
|
49
|
-
|
50
|
-
|
51
|
-
ingredients TEXT,
|
52
|
-
steps TEXT,
|
7
|
+
# This is the main class to crawl recipes from a given url
|
8
|
+
# 1. Crawler will crawl url to find others recipes urls on the website
|
9
|
+
# 2. it will crawl urls founded to find other url again & again
|
10
|
+
# 3. it will scrape urls founded to get data
|
11
|
+
#
|
12
|
+
# @attr_reader url [String] first url parsed
|
13
|
+
# @attr_reader host [Symbol] of url's host
|
14
|
+
# @attr_reader scraped_urls [Array<String>] of url's host
|
15
|
+
# @attr_reader crawled_urls [Array<String>] of url's host
|
16
|
+
# @attr_reader to_crawl_urls [Array<String>] of url's host
|
17
|
+
# @attr_reader recipes [Array<RecipeScraper::Recipe>] recipes fetched
|
18
|
+
# @attr_reader db [SQLite3::Database] Sqlite database where recipe will be saved
|
19
|
+
class Crawler
|
20
|
+
# URL than crawler can parse
|
21
|
+
ALLOWED_URLS = {
|
22
|
+
cuisineaz: 'cuisineaz.com/recettes/',
|
23
|
+
marmiton: 'marmiton.org/recettes/',
|
24
|
+
g750: '750g.com/'
|
25
|
+
}.freeze
|
26
|
+
|
27
|
+
attr_reader :url, :host, :scraped_urls, :crawled_urls, :to_crawl_urls, :recipes
|
28
|
+
attr_accessor :interval_sleep_time
|
29
|
+
|
30
|
+
#
|
31
|
+
# Create a Crawler
|
32
|
+
# @param url [String] a url a recipe to scrawl other one
|
33
|
+
def initialize(url)
|
34
|
+
@url = url
|
35
|
+
if url_valid?
|
36
|
+
@recipes = []
|
37
|
+
@crawled_urls = []
|
38
|
+
@scraped_urls = []
|
39
|
+
@to_crawl_urls = []
|
40
|
+
@to_crawl_urls << url
|
41
|
+
@interval_sleep_time = 0
|
42
|
+
@db = SQLite3::Database.new 'results.sqlite3'
|
43
|
+
@db.execute "CREATE TABLE IF NOT EXISTS recipes(
|
44
|
+
Id INTEGER PRIMARY KEY,
|
45
|
+
title TEXT,
|
46
|
+
preptime INTEGER,
|
47
|
+
cooktime INTEGER,
|
48
|
+
ingredients TEXT,
|
49
|
+
steps TEXT,
|
53
50
|
image TEXT
|
54
51
|
)"
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
|
74
|
-
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
|
79
|
-
|
80
|
-
|
81
|
-
|
82
|
-
|
83
|
-
|
84
|
-
|
85
|
-
|
86
|
-
|
87
|
-
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
|
108
|
-
|
109
|
-
|
110
|
-
|
111
|
-
|
112
|
-
|
113
|
-
|
114
|
-
|
115
|
-
|
116
|
-
|
117
|
-
|
118
|
-
|
119
|
-
|
120
|
-
|
121
|
-
|
122
|
-
|
123
|
-
|
124
|
-
|
125
|
-
|
126
|
-
|
127
|
-
|
128
|
-
|
129
|
-
|
130
|
-
|
131
|
-
|
132
|
-
|
133
|
-
|
134
|
-
|
135
|
-
|
136
|
-
|
137
|
-
|
138
|
-
|
139
|
-
|
140
|
-
|
141
|
-
|
142
|
-
|
143
|
-
|
144
|
-
|
145
|
-
|
146
|
-
|
147
|
-
|
148
|
-
|
149
|
-
|
150
|
-
|
151
|
-
|
152
|
-
#
|
153
|
-
# Save recipe
|
154
|
-
# @param recipe [RecipeSraper::Recipe] as recipe to save
|
155
|
-
#
|
156
|
-
# @return [Boolean] as true if success
|
157
|
-
def save recipe
|
158
|
-
begin
|
159
|
-
@db.execute "INSERT INTO recipes (title, preptime, cooktime, ingredients, steps, image)
|
52
|
+
else
|
53
|
+
raise ArgumentError, 'This url cannot be used'
|
54
|
+
end
|
55
|
+
end
|
56
|
+
|
57
|
+
#
|
58
|
+
# Check if the url can be parsed and set the host
|
59
|
+
#
|
60
|
+
# @return [Boolean] true if url can be parsed
|
61
|
+
def url_valid?
|
62
|
+
ALLOWED_URLS.each do |host, url_allowed|
|
63
|
+
if url.include? url_allowed
|
64
|
+
@host = host
|
65
|
+
return true
|
66
|
+
end
|
67
|
+
end
|
68
|
+
false
|
69
|
+
end
|
70
|
+
|
71
|
+
# Start the crawl
|
72
|
+
#
|
73
|
+
# @param limit [Integer] the maximum number of scraped recipes
|
74
|
+
# @param interval_sleep_time [Integer] waiting time between scraping
|
75
|
+
# @yield [RecipeScraper::Recipe] as recipe scraped
|
76
|
+
def crawl!(limit: 2, interval_sleep_time: 0)
|
77
|
+
recipes_returned = 0
|
78
|
+
|
79
|
+
if @host == :cuisineaz
|
80
|
+
|
81
|
+
while !@to_crawl_urls.empty? && (limit > @recipes.count)
|
82
|
+
# find all link on url given (and urls of theses)
|
83
|
+
url = @to_crawl_urls.first
|
84
|
+
next if url.nil?
|
85
|
+
|
86
|
+
get_links url
|
87
|
+
# now scrape an url
|
88
|
+
recipe = scrape url
|
89
|
+
yield recipe if recipe && block_given?
|
90
|
+
sleep interval_sleep_time
|
91
|
+
end
|
92
|
+
|
93
|
+
else
|
94
|
+
raise NotImplementedError
|
95
|
+
end
|
96
|
+
end
|
97
|
+
|
98
|
+
#
|
99
|
+
# Scrape given url
|
100
|
+
# param url [String] as url to scrape
|
101
|
+
#
|
102
|
+
# @return [RecipeScraper::Recipe] as recipe scraped
|
103
|
+
# @return [nil] if recipe connat be fetched
|
104
|
+
def scrape(url)
|
105
|
+
recipe = RecipeScraper::Recipe.new url
|
106
|
+
@scraped_urls << url
|
107
|
+
@recipes << recipe
|
108
|
+
if save recipe
|
109
|
+
return recipe
|
110
|
+
else
|
111
|
+
raise SQLite3::Exception, 'cannot save recipe'
|
112
|
+
end
|
113
|
+
rescue OpenURI::HTTPError
|
114
|
+
nil
|
115
|
+
end
|
116
|
+
|
117
|
+
#
|
118
|
+
# Get recipes links from the given url
|
119
|
+
# @param url [String] as url to scrape
|
120
|
+
#
|
121
|
+
# @return [void]
|
122
|
+
def get_links(url)
|
123
|
+
# catch 404 error from host
|
124
|
+
|
125
|
+
doc = Nokogiri::HTML(open(url))
|
126
|
+
# find internal links on page
|
127
|
+
doc.css('#tagCloud a').each do |link|
|
128
|
+
link = link.attr('href')
|
129
|
+
# If link correspond to a recipe we add it to recipe to scraw
|
130
|
+
if link.include?(ALLOWED_URLS[@host]) && !@crawled_urls.include?(url)
|
131
|
+
@to_crawl_urls << link
|
132
|
+
end
|
133
|
+
end
|
134
|
+
@to_crawl_urls.delete url
|
135
|
+
@crawled_urls << url
|
136
|
+
@to_crawl_urls.uniq!
|
137
|
+
rescue OpenURI::HTTPError
|
138
|
+
@to_crawl_urls.delete url
|
139
|
+
warn "#{url} cannot be reached"
|
140
|
+
end
|
141
|
+
|
142
|
+
#
|
143
|
+
# Save recipe
|
144
|
+
# @param recipe [RecipeScraper::Recipe] as recipe to save
|
145
|
+
#
|
146
|
+
# @return [Boolean] as true if success
|
147
|
+
def save(recipe)
|
148
|
+
@db.execute "INSERT INTO recipes (title, preptime, cooktime, ingredients, steps, image)
|
160
149
|
VALUES (:title, :preptime, :cooktime, :ingredients, :steps, :image)",
|
161
|
-
|
162
|
-
|
163
|
-
|
164
|
-
|
165
|
-
|
166
|
-
|
167
|
-
|
168
|
-
|
169
|
-
|
170
|
-
|
171
|
-
|
172
|
-
|
173
|
-
|
174
|
-
end
|
175
|
-
|
176
|
-
|
177
|
-
end
|
150
|
+
title: recipe.title,
|
151
|
+
preptime: recipe.preptime,
|
152
|
+
ingredients: recipe.ingredients.join("\n"),
|
153
|
+
steps: recipe.steps.join("\n"),
|
154
|
+
image: recipe.image
|
155
|
+
|
156
|
+
true
|
157
|
+
rescue SQLite3::Exception => e
|
158
|
+
puts "Exception occurred #{e}"
|
159
|
+
false
|
160
|
+
end
|
161
|
+
end
|
162
|
+
end
|
data/recipe_crawler.gemspec
CHANGED
@@ -1,29 +1,27 @@
|
|
1
|
-
|
2
|
-
lib = File.expand_path('../lib', __FILE__)
|
1
|
+
lib = File.expand_path('lib', __dir__)
|
3
2
|
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
|
4
3
|
require 'recipe_crawler/version'
|
5
4
|
|
6
5
|
Gem::Specification.new do |spec|
|
7
|
-
spec.name =
|
6
|
+
spec.name = 'recipe_crawler'
|
8
7
|
spec.version = RecipeCrawler::VERSION
|
9
|
-
spec.authors = [
|
10
|
-
spec.email = [
|
11
|
-
|
12
|
-
spec.summary = %q{Get all recipes from famous french cooking websites}
|
13
|
-
spec.description = %q{This crawler will use my personnal scraper named 'RecipeScraper' to dowload recipes data from Marmiton, 750g or cuisineaz}
|
14
|
-
spec.homepage = "https://github.com/madeindjs/recipe_crawler."
|
15
|
-
spec.license = "MIT"
|
8
|
+
spec.authors = ['Alexandre Rousseau']
|
9
|
+
spec.email = ['contact@rousseau-alexandre.fr']
|
16
10
|
|
11
|
+
spec.summary = 'Get all recipes from famous french cooking websites'
|
12
|
+
spec.description = "This crawler will use my personnal scraper named 'RecipeScraper' to dowload recipes data from Marmiton, 750g or cuisineaz"
|
13
|
+
spec.homepage = 'https://github.com/madeindjs/recipe_crawler'
|
14
|
+
spec.license = 'MIT'
|
17
15
|
|
18
16
|
spec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
|
19
17
|
spec.executables = ['recipe_crawler']
|
20
|
-
spec.require_paths = [
|
21
|
-
|
22
|
-
spec.add_dependency "recipe_scraper", '>= 2.2.0'
|
18
|
+
spec.require_paths = ['lib']
|
23
19
|
|
20
|
+
spec.add_dependency 'recipe_scraper', '~> 2.0'
|
21
|
+
spec.add_dependency 'sqlite3', '~> 1.3'
|
24
22
|
|
25
|
-
spec.add_development_dependency
|
26
|
-
spec.add_development_dependency
|
27
|
-
spec.add_development_dependency
|
28
|
-
spec.add_development_dependency
|
23
|
+
spec.add_development_dependency 'bundler', '~> 1.17'
|
24
|
+
spec.add_development_dependency 'rake', '~> 10.0'
|
25
|
+
spec.add_development_dependency 'rspec', '~> 3.0'
|
26
|
+
spec.add_development_dependency 'yard'
|
29
27
|
end
|
metadata
CHANGED
@@ -1,43 +1,57 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: recipe_crawler
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: 4.0.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
|
-
-
|
7
|
+
- Alexandre Rousseau
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2018-12-08 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: recipe_scraper
|
15
15
|
requirement: !ruby/object:Gem::Requirement
|
16
16
|
requirements:
|
17
|
-
- - "
|
17
|
+
- - "~>"
|
18
18
|
- !ruby/object:Gem::Version
|
19
|
-
version: 2.
|
19
|
+
version: '2.0'
|
20
20
|
type: :runtime
|
21
21
|
prerelease: false
|
22
22
|
version_requirements: !ruby/object:Gem::Requirement
|
23
23
|
requirements:
|
24
|
-
- - "
|
24
|
+
- - "~>"
|
25
|
+
- !ruby/object:Gem::Version
|
26
|
+
version: '2.0'
|
27
|
+
- !ruby/object:Gem::Dependency
|
28
|
+
name: sqlite3
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
30
|
+
requirements:
|
31
|
+
- - "~>"
|
32
|
+
- !ruby/object:Gem::Version
|
33
|
+
version: '1.3'
|
34
|
+
type: :runtime
|
35
|
+
prerelease: false
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
37
|
+
requirements:
|
38
|
+
- - "~>"
|
25
39
|
- !ruby/object:Gem::Version
|
26
|
-
version:
|
40
|
+
version: '1.3'
|
27
41
|
- !ruby/object:Gem::Dependency
|
28
42
|
name: bundler
|
29
43
|
requirement: !ruby/object:Gem::Requirement
|
30
44
|
requirements:
|
31
45
|
- - "~>"
|
32
46
|
- !ruby/object:Gem::Version
|
33
|
-
version: '1.
|
47
|
+
version: '1.17'
|
34
48
|
type: :development
|
35
49
|
prerelease: false
|
36
50
|
version_requirements: !ruby/object:Gem::Requirement
|
37
51
|
requirements:
|
38
52
|
- - "~>"
|
39
53
|
- !ruby/object:Gem::Version
|
40
|
-
version: '1.
|
54
|
+
version: '1.17'
|
41
55
|
- !ruby/object:Gem::Dependency
|
42
56
|
name: rake
|
43
57
|
requirement: !ruby/object:Gem::Requirement
|
@@ -83,7 +97,7 @@ dependencies:
|
|
83
97
|
description: This crawler will use my personnal scraper named 'RecipeScraper' to dowload
|
84
98
|
recipes data from Marmiton, 750g or cuisineaz
|
85
99
|
email:
|
86
|
-
-
|
100
|
+
- contact@rousseau-alexandre.fr
|
87
101
|
executables:
|
88
102
|
- recipe_crawler
|
89
103
|
extensions: []
|
@@ -104,7 +118,7 @@ files:
|
|
104
118
|
- lib/recipe_crawler/crawler.rb
|
105
119
|
- lib/recipe_crawler/version.rb
|
106
120
|
- recipe_crawler.gemspec
|
107
|
-
homepage: https://github.com/madeindjs/recipe_crawler
|
121
|
+
homepage: https://github.com/madeindjs/recipe_crawler
|
108
122
|
licenses:
|
109
123
|
- MIT
|
110
124
|
metadata: {}
|
@@ -124,7 +138,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
124
138
|
version: '0'
|
125
139
|
requirements: []
|
126
140
|
rubyforge_project:
|
127
|
-
rubygems_version: 2.
|
141
|
+
rubygems_version: 2.7.8
|
128
142
|
signing_key:
|
129
143
|
specification_version: 4
|
130
144
|
summary: Get all recipes from famous french cooking websites
|