recipe_crawler 3.1.2 → 4.0.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
- SHA1:
3
- metadata.gz: 3ba3aefb304f9ab25b0939192318bde3184d9ee9
4
- data.tar.gz: 7f0f73cb2c3b29498fdc5eb47d62c862970ee813
2
+ SHA256:
3
+ metadata.gz: d2185c1d31c0fd91ddf2df44770a587a5fa9b4bb3b12106f0dcb1c04b4ad0f94
4
+ data.tar.gz: 02b21cabf006eb6f6430d2a91f6ee4879077496abd6b4234638cc5c03dff1448
5
5
  SHA512:
6
- metadata.gz: a4849ca0b5ee1f5d2521d0d77f1308c9c11f56f2247a45d45defc4b12fa6ed9d2cb4a36f8831162221792b9a8d2b48eb57a8bfb9d05e4bcd18953d998f17f50f
7
- data.tar.gz: 3124cbe521de9c8e45ea5f11f585600f38c9399d4301de6809e08eb9f9ce4768a0cbde3bfb4cd75e1205559dc3bc7cc4a33be725d871f33328ed394f91ea98d3
6
+ metadata.gz: e10ff78ee97a4e8bb830275768477cb77cc1441d0dbbcbce8008e18c79f0db85d6e97923140ee7cfb9483b09efe5b806dc2ed878d193723c0e7636a0bf0b989e
7
+ data.tar.gz: c947a04b528b40d5ab396bcc16d9e7ae8a5e21bc25d5295b339e97c23cb7132f3b7c543cae2f601bb5f1963503a73ba9661ba9cac31b50a307954075112558b4
data/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # RecipeCrawler
2
2
 
3
- A **web crawler** to save recipes from [marmiton.org](http://www.marmiton.org/), [750g.com](http://www.750g.com) or [cuisineaz.com](http://www.cuisineaz.com) into an **SQlite3** database.
3
+ A **web crawler** to save recipes from [marmiton.org](http://www.marmiton.org/), [750g.com](http://www.750g.com) or [cuisineaz.com](http://www.cuisineaz.com) into an **SQlite3** database.
4
4
 
5
5
  > For the moment, it works only with [cuisineaz.com](http://www.cuisineaz.com)
6
6
 
@@ -29,7 +29,7 @@ Or install it yourself as:
29
29
 
30
30
  ### Command line
31
31
 
32
- Install this gem and run
32
+ Install this gem and run
33
33
 
34
34
  $ recipe_crawler -h
35
35
  Usage: recipe_crawler [options]
@@ -60,9 +60,9 @@ Then you just need to instanciate a `RecipeCrawler::Crawler` with url of a Cuisi
60
60
  url = 'http://www.cuisineaz.com/recettes/pate-a-pizza-legere-55004.aspx'
61
61
  r = RecipeCrawler::Crawler.new url
62
62
 
63
- Then you just need to run the crawl with a limit number of recipe to fetch. All recipes will be saved in a *export.sqlite3* file. You can pass a block to play with `RecipeSraper::Recipe` objects.
63
+ Then you just need to run the crawl with a limit number of recipe to fetch. All recipes will be saved in a *export.sqlite3* file. You can pass a block to play with `RecipeScraper::Recipe` objects.
64
64
 
65
- r.crawl!(10) do |recipe|
65
+ r.crawl!(limit: 10) do |recipe|
66
66
  puts recipe.to_hash
67
67
  # will return
68
68
  # --------------
@@ -91,7 +91,6 @@ Bug reports and pull requests are welcome on GitHub at https://github.com/[USERN
91
91
 
92
92
  The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
93
93
 
94
- Author
95
- ----------
94
+ ## Author
96
95
 
97
- [Rousseau Alexandre](https://github.com/madeindjs)
96
+ [Rousseau Alexandre](https://github.com/madeindjs)
data/bin/recipe_crawler CHANGED
File without changes
@@ -1,5 +1,5 @@
1
- require "recipe_crawler/version"
2
- require "recipe_crawler/crawler"
1
+ require 'recipe_crawler/version'
2
+ require 'recipe_crawler/crawler'
3
3
 
4
4
  module RecipeCrawler
5
5
  # Your code goes here...
@@ -3,175 +3,160 @@ require 'nokogiri'
3
3
  require 'open-uri'
4
4
  require 'sqlite3'
5
5
 
6
-
7
6
  module RecipeCrawler
8
-
9
- # This is the main class to crawl recipes from a given url
10
- # 1. Crawler will crawl url to find others recipes urls on the website
11
- # 2. it will crawl urls founded to find other url again & again
12
- # 3. it will scrape urls founded to get data
13
- #
14
- # @attr_reader url [String] first url parsed
15
- # @attr_reader host [Symbol] of url's host
16
- # @attr_reader scraped_urls [Array<String>] of url's host
17
- # @attr_reader crawled_urls [Array<String>] of url's host
18
- # @attr_reader to_crawl_urls [Array<String>] of url's host
19
- # @attr_reader recipes [Array<RecipeSraper::Recipe>] recipes fetched
20
- # @attr_reader db [SQLite3::Database] Sqlite database where recipe will be saved
21
- class Crawler
22
-
23
- # URL than crawler can parse
24
- ALLOWED_URLS = {
25
- cuisineaz: 'http://www.cuisineaz.com/recettes/',
26
- marmiton: 'http://www.marmiton.org/recettes/',
27
- g750: 'http://www.750g.com/'
28
- }
29
-
30
- attr_reader :url, :host, :scraped_urls, :crawled_urls, :to_crawl_urls, :recipes
31
- attr_accessor :interval_sleep_time
32
-
33
- #
34
- # Create a Crawler
35
- # @param url [String] a url a recipe to scrawl other one
36
- def initialize url
37
- @url = url
38
- if url_valid?
39
- @recipes = []
40
- @crawled_urls = []
41
- @scraped_urls = []
42
- @to_crawl_urls = []
43
- @to_crawl_urls << url
44
- @interval_sleep_time = 0
45
- @db = SQLite3::Database.new "results.sqlite3"
46
- @db.execute "CREATE TABLE IF NOT EXISTS recipes(
47
- Id INTEGER PRIMARY KEY,
48
- title TEXT,
49
- preptime INTEGER,
50
- cooktime INTEGER,
51
- ingredients TEXT,
52
- steps TEXT,
7
+ # This is the main class to crawl recipes from a given url
8
+ # 1. Crawler will crawl url to find others recipes urls on the website
9
+ # 2. it will crawl urls founded to find other url again & again
10
+ # 3. it will scrape urls founded to get data
11
+ #
12
+ # @attr_reader url [String] first url parsed
13
+ # @attr_reader host [Symbol] of url's host
14
+ # @attr_reader scraped_urls [Array<String>] of url's host
15
+ # @attr_reader crawled_urls [Array<String>] of url's host
16
+ # @attr_reader to_crawl_urls [Array<String>] of url's host
17
+ # @attr_reader recipes [Array<RecipeScraper::Recipe>] recipes fetched
18
+ # @attr_reader db [SQLite3::Database] Sqlite database where recipe will be saved
19
+ class Crawler
20
+ # URL than crawler can parse
21
+ ALLOWED_URLS = {
22
+ cuisineaz: 'cuisineaz.com/recettes/',
23
+ marmiton: 'marmiton.org/recettes/',
24
+ g750: '750g.com/'
25
+ }.freeze
26
+
27
+ attr_reader :url, :host, :scraped_urls, :crawled_urls, :to_crawl_urls, :recipes
28
+ attr_accessor :interval_sleep_time
29
+
30
+ #
31
+ # Create a Crawler
32
+ # @param url [String] a url a recipe to scrawl other one
33
+ def initialize(url)
34
+ @url = url
35
+ if url_valid?
36
+ @recipes = []
37
+ @crawled_urls = []
38
+ @scraped_urls = []
39
+ @to_crawl_urls = []
40
+ @to_crawl_urls << url
41
+ @interval_sleep_time = 0
42
+ @db = SQLite3::Database.new 'results.sqlite3'
43
+ @db.execute "CREATE TABLE IF NOT EXISTS recipes(
44
+ Id INTEGER PRIMARY KEY,
45
+ title TEXT,
46
+ preptime INTEGER,
47
+ cooktime INTEGER,
48
+ ingredients TEXT,
49
+ steps TEXT,
53
50
  image TEXT
54
51
  )"
55
- else
56
- raise ArgumentError , 'This url cannot be used'
57
- end
58
- end
59
-
60
-
61
- #
62
- # Check if the url can be parsed and set the host
63
- #
64
- # @return [Boolean] true if url can be parsed
65
- def url_valid?
66
- ALLOWED_URLS.each do |host, url_allowed|
67
- if url.include? url_allowed
68
- @host = host
69
- return true
70
- end
71
- end
72
- return false
73
- end
74
-
75
-
76
- #
77
- # Start the crawl
78
- # @param limit [Integer] the maximum number of scraped recipes
79
- # @param interval_sleep_time [Integer] waiting time between scraping
80
- #
81
- # @yield [RecipeSraper::Recipe] as recipe scraped
82
- def crawl! limit=2, interval_sleep_time=0
83
- recipes_returned = 0
84
-
85
- if @host == :cuisineaz
86
-
87
- while !@to_crawl_urls.empty? and limit > @recipes.count
88
- # find all link on url given (and urls of theses)
89
- get_links @to_crawl_urls[0]
90
- # now scrape an url
91
- recipe = scrape @to_crawl_urls[0]
92
- yield recipe if recipe and block_given?
93
- sleep interval_sleep_time
94
- end
95
-
96
- else
97
- raise NotImplementedError
98
- end
99
- end
100
-
101
-
102
- #
103
- # Scrape given url
104
- # param url [String] as url to scrape
105
- #
106
- # @return [RecipeSraper::Recipe] as recipe scraped
107
- # @return [nil] if recipe connat be fetched
108
- def scrape url
109
- begin
110
- recipe = RecipeSraper::Recipe.new url
111
- @scraped_urls << url
112
- @recipes << recipe
113
- if save recipe
114
- return recipe
115
- else
116
- raise SQLite3::Exception, 'cannot save recipe'
117
- end
118
- rescue OpenURI::HTTPError
119
- return nil
120
- end
121
- end
122
-
123
-
124
- #
125
- # Get recipes links from the given url
126
- # @param url [String] as url to scrape
127
- #
128
- # @return [void]
129
- def get_links url
130
- # catch 404 error from host
131
- begin
132
- doc = Nokogiri::HTML(open(url))
133
- # find internal links on page
134
- doc.css('#tagCloud a').each do |link|
135
- link = link.attr('href')
136
- # If link correspond to a recipe we add it to recipe to scraw
137
- if link.include?(ALLOWED_URLS[@host]) and !@crawled_urls.include?(url)
138
- @to_crawl_urls << link
139
- end
140
- end
141
- @to_crawl_urls.delete url
142
- @crawled_urls << url
143
- @to_crawl_urls.uniq!
144
-
145
- rescue OpenURI::HTTPError
146
- @to_crawl_urls.delete url
147
- warn "#{url} cannot be reached"
148
- end
149
- end
150
-
151
-
152
- #
153
- # Save recipe
154
- # @param recipe [RecipeSraper::Recipe] as recipe to save
155
- #
156
- # @return [Boolean] as true if success
157
- def save recipe
158
- begin
159
- @db.execute "INSERT INTO recipes (title, preptime, cooktime, ingredients, steps, image)
52
+ else
53
+ raise ArgumentError, 'This url cannot be used'
54
+ end
55
+ end
56
+
57
+ #
58
+ # Check if the url can be parsed and set the host
59
+ #
60
+ # @return [Boolean] true if url can be parsed
61
+ def url_valid?
62
+ ALLOWED_URLS.each do |host, url_allowed|
63
+ if url.include? url_allowed
64
+ @host = host
65
+ return true
66
+ end
67
+ end
68
+ false
69
+ end
70
+
71
+ # Start the crawl
72
+ #
73
+ # @param limit [Integer] the maximum number of scraped recipes
74
+ # @param interval_sleep_time [Integer] waiting time between scraping
75
+ # @yield [RecipeScraper::Recipe] as recipe scraped
76
+ def crawl!(limit: 2, interval_sleep_time: 0)
77
+ recipes_returned = 0
78
+
79
+ if @host == :cuisineaz
80
+
81
+ while !@to_crawl_urls.empty? && (limit > @recipes.count)
82
+ # find all link on url given (and urls of theses)
83
+ url = @to_crawl_urls.first
84
+ next if url.nil?
85
+
86
+ get_links url
87
+ # now scrape an url
88
+ recipe = scrape url
89
+ yield recipe if recipe && block_given?
90
+ sleep interval_sleep_time
91
+ end
92
+
93
+ else
94
+ raise NotImplementedError
95
+ end
96
+ end
97
+
98
+ #
99
+ # Scrape given url
100
+ # param url [String] as url to scrape
101
+ #
102
+ # @return [RecipeScraper::Recipe] as recipe scraped
103
+ # @return [nil] if recipe connat be fetched
104
+ def scrape(url)
105
+ recipe = RecipeScraper::Recipe.new url
106
+ @scraped_urls << url
107
+ @recipes << recipe
108
+ if save recipe
109
+ return recipe
110
+ else
111
+ raise SQLite3::Exception, 'cannot save recipe'
112
+ end
113
+ rescue OpenURI::HTTPError
114
+ nil
115
+ end
116
+
117
+ #
118
+ # Get recipes links from the given url
119
+ # @param url [String] as url to scrape
120
+ #
121
+ # @return [void]
122
+ def get_links(url)
123
+ # catch 404 error from host
124
+
125
+ doc = Nokogiri::HTML(open(url))
126
+ # find internal links on page
127
+ doc.css('#tagCloud a').each do |link|
128
+ link = link.attr('href')
129
+ # If link correspond to a recipe we add it to recipe to scraw
130
+ if link.include?(ALLOWED_URLS[@host]) && !@crawled_urls.include?(url)
131
+ @to_crawl_urls << link
132
+ end
133
+ end
134
+ @to_crawl_urls.delete url
135
+ @crawled_urls << url
136
+ @to_crawl_urls.uniq!
137
+ rescue OpenURI::HTTPError
138
+ @to_crawl_urls.delete url
139
+ warn "#{url} cannot be reached"
140
+ end
141
+
142
+ #
143
+ # Save recipe
144
+ # @param recipe [RecipeScraper::Recipe] as recipe to save
145
+ #
146
+ # @return [Boolean] as true if success
147
+ def save(recipe)
148
+ @db.execute "INSERT INTO recipes (title, preptime, cooktime, ingredients, steps, image)
160
149
  VALUES (:title, :preptime, :cooktime, :ingredients, :steps, :image)",
161
- title: recipe.title,
162
- preptime: recipe.preptime,
163
- ingredients: recipe.ingredients.join("\n"),
164
- steps: recipe.steps.join("\n"),
165
- image: recipe.image
166
-
167
- return true
168
-
169
- rescue SQLite3::Exception => e
170
- puts "Exception occurred #{e}"
171
- return false
172
- end
173
- end
174
- end
175
-
176
-
177
- end
150
+ title: recipe.title,
151
+ preptime: recipe.preptime,
152
+ ingredients: recipe.ingredients.join("\n"),
153
+ steps: recipe.steps.join("\n"),
154
+ image: recipe.image
155
+
156
+ true
157
+ rescue SQLite3::Exception => e
158
+ puts "Exception occurred #{e}"
159
+ false
160
+ end
161
+ end
162
+ end
@@ -1,3 +1,3 @@
1
1
  module RecipeCrawler
2
- VERSION = "3.1.2"
2
+ VERSION = '4.0.0'.freeze
3
3
  end
@@ -1,29 +1,27 @@
1
- # coding: utf-8
2
- lib = File.expand_path('../lib', __FILE__)
1
+ lib = File.expand_path('lib', __dir__)
3
2
  $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
3
  require 'recipe_crawler/version'
5
4
 
6
5
  Gem::Specification.new do |spec|
7
- spec.name = "recipe_crawler"
6
+ spec.name = 'recipe_crawler'
8
7
  spec.version = RecipeCrawler::VERSION
9
- spec.authors = ["madeindjs"]
10
- spec.email = ["madeindjs@gmail.com"]
11
-
12
- spec.summary = %q{Get all recipes from famous french cooking websites}
13
- spec.description = %q{This crawler will use my personnal scraper named 'RecipeScraper' to dowload recipes data from Marmiton, 750g or cuisineaz}
14
- spec.homepage = "https://github.com/madeindjs/recipe_crawler."
15
- spec.license = "MIT"
8
+ spec.authors = ['Alexandre Rousseau']
9
+ spec.email = ['contact@rousseau-alexandre.fr']
16
10
 
11
+ spec.summary = 'Get all recipes from famous french cooking websites'
12
+ spec.description = "This crawler will use my personnal scraper named 'RecipeScraper' to dowload recipes data from Marmiton, 750g or cuisineaz"
13
+ spec.homepage = 'https://github.com/madeindjs/recipe_crawler'
14
+ spec.license = 'MIT'
17
15
 
18
16
  spec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
19
17
  spec.executables = ['recipe_crawler']
20
- spec.require_paths = ["lib"]
21
-
22
- spec.add_dependency "recipe_scraper", '>= 2.2.0'
18
+ spec.require_paths = ['lib']
23
19
 
20
+ spec.add_dependency 'recipe_scraper', '~> 2.0'
21
+ spec.add_dependency 'sqlite3', '~> 1.3'
24
22
 
25
- spec.add_development_dependency "bundler", "~> 1.11"
26
- spec.add_development_dependency "rake", "~> 10.0"
27
- spec.add_development_dependency "rspec", "~> 3.0"
28
- spec.add_development_dependency "yard"
23
+ spec.add_development_dependency 'bundler', '~> 1.17'
24
+ spec.add_development_dependency 'rake', '~> 10.0'
25
+ spec.add_development_dependency 'rspec', '~> 3.0'
26
+ spec.add_development_dependency 'yard'
29
27
  end
metadata CHANGED
@@ -1,43 +1,57 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: recipe_crawler
3
3
  version: !ruby/object:Gem::Version
4
- version: 3.1.2
4
+ version: 4.0.0
5
5
  platform: ruby
6
6
  authors:
7
- - madeindjs
7
+ - Alexandre Rousseau
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2016-12-05 00:00:00.000000000 Z
11
+ date: 2018-12-08 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: recipe_scraper
15
15
  requirement: !ruby/object:Gem::Requirement
16
16
  requirements:
17
- - - ">="
17
+ - - "~>"
18
18
  - !ruby/object:Gem::Version
19
- version: 2.2.0
19
+ version: '2.0'
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
- - - ">="
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '2.0'
27
+ - !ruby/object:Gem::Dependency
28
+ name: sqlite3
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '1.3'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
25
39
  - !ruby/object:Gem::Version
26
- version: 2.2.0
40
+ version: '1.3'
27
41
  - !ruby/object:Gem::Dependency
28
42
  name: bundler
29
43
  requirement: !ruby/object:Gem::Requirement
30
44
  requirements:
31
45
  - - "~>"
32
46
  - !ruby/object:Gem::Version
33
- version: '1.11'
47
+ version: '1.17'
34
48
  type: :development
35
49
  prerelease: false
36
50
  version_requirements: !ruby/object:Gem::Requirement
37
51
  requirements:
38
52
  - - "~>"
39
53
  - !ruby/object:Gem::Version
40
- version: '1.11'
54
+ version: '1.17'
41
55
  - !ruby/object:Gem::Dependency
42
56
  name: rake
43
57
  requirement: !ruby/object:Gem::Requirement
@@ -83,7 +97,7 @@ dependencies:
83
97
  description: This crawler will use my personnal scraper named 'RecipeScraper' to dowload
84
98
  recipes data from Marmiton, 750g or cuisineaz
85
99
  email:
86
- - madeindjs@gmail.com
100
+ - contact@rousseau-alexandre.fr
87
101
  executables:
88
102
  - recipe_crawler
89
103
  extensions: []
@@ -104,7 +118,7 @@ files:
104
118
  - lib/recipe_crawler/crawler.rb
105
119
  - lib/recipe_crawler/version.rb
106
120
  - recipe_crawler.gemspec
107
- homepage: https://github.com/madeindjs/recipe_crawler.
121
+ homepage: https://github.com/madeindjs/recipe_crawler
108
122
  licenses:
109
123
  - MIT
110
124
  metadata: {}
@@ -124,7 +138,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
124
138
  version: '0'
125
139
  requirements: []
126
140
  rubyforge_project:
127
- rubygems_version: 2.5.1
141
+ rubygems_version: 2.7.8
128
142
  signing_key:
129
143
  specification_version: 4
130
144
  summary: Get all recipes from famous french cooking websites