recipe_crawler 3.1.2 → 4.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
- SHA1:
3
- metadata.gz: 3ba3aefb304f9ab25b0939192318bde3184d9ee9
4
- data.tar.gz: 7f0f73cb2c3b29498fdc5eb47d62c862970ee813
2
+ SHA256:
3
+ metadata.gz: d2185c1d31c0fd91ddf2df44770a587a5fa9b4bb3b12106f0dcb1c04b4ad0f94
4
+ data.tar.gz: 02b21cabf006eb6f6430d2a91f6ee4879077496abd6b4234638cc5c03dff1448
5
5
  SHA512:
6
- metadata.gz: a4849ca0b5ee1f5d2521d0d77f1308c9c11f56f2247a45d45defc4b12fa6ed9d2cb4a36f8831162221792b9a8d2b48eb57a8bfb9d05e4bcd18953d998f17f50f
7
- data.tar.gz: 3124cbe521de9c8e45ea5f11f585600f38c9399d4301de6809e08eb9f9ce4768a0cbde3bfb4cd75e1205559dc3bc7cc4a33be725d871f33328ed394f91ea98d3
6
+ metadata.gz: e10ff78ee97a4e8bb830275768477cb77cc1441d0dbbcbce8008e18c79f0db85d6e97923140ee7cfb9483b09efe5b806dc2ed878d193723c0e7636a0bf0b989e
7
+ data.tar.gz: c947a04b528b40d5ab396bcc16d9e7ae8a5e21bc25d5295b339e97c23cb7132f3b7c543cae2f601bb5f1963503a73ba9661ba9cac31b50a307954075112558b4
data/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # RecipeCrawler
2
2
 
3
- A **web crawler** to save recipes from [marmiton.org](http://www.marmiton.org/), [750g.com](http://www.750g.com) or [cuisineaz.com](http://www.cuisineaz.com) into an **SQlite3** database.
3
+ A **web crawler** to save recipes from [marmiton.org](http://www.marmiton.org/), [750g.com](http://www.750g.com) or [cuisineaz.com](http://www.cuisineaz.com) into an **SQlite3** database.
4
4
 
5
5
  > For the moment, it works only with [cuisineaz.com](http://www.cuisineaz.com)
6
6
 
@@ -29,7 +29,7 @@ Or install it yourself as:
29
29
 
30
30
  ### Command line
31
31
 
32
- Install this gem and run
32
+ Install this gem and run
33
33
 
34
34
  $ recipe_crawler -h
35
35
  Usage: recipe_crawler [options]
@@ -60,9 +60,9 @@ Then you just need to instanciate a `RecipeCrawler::Crawler` with url of a Cuisi
60
60
  url = 'http://www.cuisineaz.com/recettes/pate-a-pizza-legere-55004.aspx'
61
61
  r = RecipeCrawler::Crawler.new url
62
62
 
63
- Then you just need to run the crawl with a limit number of recipe to fetch. All recipes will be saved in a *export.sqlite3* file. You can pass a block to play with `RecipeSraper::Recipe` objects.
63
+ Then you just need to run the crawl with a limit number of recipe to fetch. All recipes will be saved in a *export.sqlite3* file. You can pass a block to play with `RecipeScraper::Recipe` objects.
64
64
 
65
- r.crawl!(10) do |recipe|
65
+ r.crawl!(limit: 10) do |recipe|
66
66
  puts recipe.to_hash
67
67
  # will return
68
68
  # --------------
@@ -91,7 +91,6 @@ Bug reports and pull requests are welcome on GitHub at https://github.com/[USERN
91
91
 
92
92
  The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
93
93
 
94
- Author
95
- ----------
94
+ ## Author
96
95
 
97
- [Rousseau Alexandre](https://github.com/madeindjs)
96
+ [Rousseau Alexandre](https://github.com/madeindjs)
data/bin/recipe_crawler CHANGED
File without changes
@@ -1,5 +1,5 @@
1
- require "recipe_crawler/version"
2
- require "recipe_crawler/crawler"
1
+ require 'recipe_crawler/version'
2
+ require 'recipe_crawler/crawler'
3
3
 
4
4
  module RecipeCrawler
5
5
  # Your code goes here...
@@ -3,175 +3,160 @@ require 'nokogiri'
3
3
  require 'open-uri'
4
4
  require 'sqlite3'
5
5
 
6
-
7
6
  module RecipeCrawler
8
-
9
- # This is the main class to crawl recipes from a given url
10
- # 1. Crawler will crawl url to find others recipes urls on the website
11
- # 2. it will crawl urls founded to find other url again & again
12
- # 3. it will scrape urls founded to get data
13
- #
14
- # @attr_reader url [String] first url parsed
15
- # @attr_reader host [Symbol] of url's host
16
- # @attr_reader scraped_urls [Array<String>] of url's host
17
- # @attr_reader crawled_urls [Array<String>] of url's host
18
- # @attr_reader to_crawl_urls [Array<String>] of url's host
19
- # @attr_reader recipes [Array<RecipeSraper::Recipe>] recipes fetched
20
- # @attr_reader db [SQLite3::Database] Sqlite database where recipe will be saved
21
- class Crawler
22
-
23
- # URL than crawler can parse
24
- ALLOWED_URLS = {
25
- cuisineaz: 'http://www.cuisineaz.com/recettes/',
26
- marmiton: 'http://www.marmiton.org/recettes/',
27
- g750: 'http://www.750g.com/'
28
- }
29
-
30
- attr_reader :url, :host, :scraped_urls, :crawled_urls, :to_crawl_urls, :recipes
31
- attr_accessor :interval_sleep_time
32
-
33
- #
34
- # Create a Crawler
35
- # @param url [String] a url a recipe to scrawl other one
36
- def initialize url
37
- @url = url
38
- if url_valid?
39
- @recipes = []
40
- @crawled_urls = []
41
- @scraped_urls = []
42
- @to_crawl_urls = []
43
- @to_crawl_urls << url
44
- @interval_sleep_time = 0
45
- @db = SQLite3::Database.new "results.sqlite3"
46
- @db.execute "CREATE TABLE IF NOT EXISTS recipes(
47
- Id INTEGER PRIMARY KEY,
48
- title TEXT,
49
- preptime INTEGER,
50
- cooktime INTEGER,
51
- ingredients TEXT,
52
- steps TEXT,
7
+ # This is the main class to crawl recipes from a given url
8
+ # 1. Crawler will crawl url to find others recipes urls on the website
9
+ # 2. it will crawl urls founded to find other url again & again
10
+ # 3. it will scrape urls founded to get data
11
+ #
12
+ # @attr_reader url [String] first url parsed
13
+ # @attr_reader host [Symbol] of url's host
14
+ # @attr_reader scraped_urls [Array<String>] of url's host
15
+ # @attr_reader crawled_urls [Array<String>] of url's host
16
+ # @attr_reader to_crawl_urls [Array<String>] of url's host
17
+ # @attr_reader recipes [Array<RecipeScraper::Recipe>] recipes fetched
18
+ # @attr_reader db [SQLite3::Database] Sqlite database where recipe will be saved
19
+ class Crawler
20
+ # URL than crawler can parse
21
+ ALLOWED_URLS = {
22
+ cuisineaz: 'cuisineaz.com/recettes/',
23
+ marmiton: 'marmiton.org/recettes/',
24
+ g750: '750g.com/'
25
+ }.freeze
26
+
27
+ attr_reader :url, :host, :scraped_urls, :crawled_urls, :to_crawl_urls, :recipes
28
+ attr_accessor :interval_sleep_time
29
+
30
+ #
31
+ # Create a Crawler
32
+ # @param url [String] a url a recipe to scrawl other one
33
+ def initialize(url)
34
+ @url = url
35
+ if url_valid?
36
+ @recipes = []
37
+ @crawled_urls = []
38
+ @scraped_urls = []
39
+ @to_crawl_urls = []
40
+ @to_crawl_urls << url
41
+ @interval_sleep_time = 0
42
+ @db = SQLite3::Database.new 'results.sqlite3'
43
+ @db.execute "CREATE TABLE IF NOT EXISTS recipes(
44
+ Id INTEGER PRIMARY KEY,
45
+ title TEXT,
46
+ preptime INTEGER,
47
+ cooktime INTEGER,
48
+ ingredients TEXT,
49
+ steps TEXT,
53
50
  image TEXT
54
51
  )"
55
- else
56
- raise ArgumentError , 'This url cannot be used'
57
- end
58
- end
59
-
60
-
61
- #
62
- # Check if the url can be parsed and set the host
63
- #
64
- # @return [Boolean] true if url can be parsed
65
- def url_valid?
66
- ALLOWED_URLS.each do |host, url_allowed|
67
- if url.include? url_allowed
68
- @host = host
69
- return true
70
- end
71
- end
72
- return false
73
- end
74
-
75
-
76
- #
77
- # Start the crawl
78
- # @param limit [Integer] the maximum number of scraped recipes
79
- # @param interval_sleep_time [Integer] waiting time between scraping
80
- #
81
- # @yield [RecipeSraper::Recipe] as recipe scraped
82
- def crawl! limit=2, interval_sleep_time=0
83
- recipes_returned = 0
84
-
85
- if @host == :cuisineaz
86
-
87
- while !@to_crawl_urls.empty? and limit > @recipes.count
88
- # find all link on url given (and urls of theses)
89
- get_links @to_crawl_urls[0]
90
- # now scrape an url
91
- recipe = scrape @to_crawl_urls[0]
92
- yield recipe if recipe and block_given?
93
- sleep interval_sleep_time
94
- end
95
-
96
- else
97
- raise NotImplementedError
98
- end
99
- end
100
-
101
-
102
- #
103
- # Scrape given url
104
- # param url [String] as url to scrape
105
- #
106
- # @return [RecipeSraper::Recipe] as recipe scraped
107
- # @return [nil] if recipe connat be fetched
108
- def scrape url
109
- begin
110
- recipe = RecipeSraper::Recipe.new url
111
- @scraped_urls << url
112
- @recipes << recipe
113
- if save recipe
114
- return recipe
115
- else
116
- raise SQLite3::Exception, 'cannot save recipe'
117
- end
118
- rescue OpenURI::HTTPError
119
- return nil
120
- end
121
- end
122
-
123
-
124
- #
125
- # Get recipes links from the given url
126
- # @param url [String] as url to scrape
127
- #
128
- # @return [void]
129
- def get_links url
130
- # catch 404 error from host
131
- begin
132
- doc = Nokogiri::HTML(open(url))
133
- # find internal links on page
134
- doc.css('#tagCloud a').each do |link|
135
- link = link.attr('href')
136
- # If link correspond to a recipe we add it to recipe to scraw
137
- if link.include?(ALLOWED_URLS[@host]) and !@crawled_urls.include?(url)
138
- @to_crawl_urls << link
139
- end
140
- end
141
- @to_crawl_urls.delete url
142
- @crawled_urls << url
143
- @to_crawl_urls.uniq!
144
-
145
- rescue OpenURI::HTTPError
146
- @to_crawl_urls.delete url
147
- warn "#{url} cannot be reached"
148
- end
149
- end
150
-
151
-
152
- #
153
- # Save recipe
154
- # @param recipe [RecipeSraper::Recipe] as recipe to save
155
- #
156
- # @return [Boolean] as true if success
157
- def save recipe
158
- begin
159
- @db.execute "INSERT INTO recipes (title, preptime, cooktime, ingredients, steps, image)
52
+ else
53
+ raise ArgumentError, 'This url cannot be used'
54
+ end
55
+ end
56
+
57
+ #
58
+ # Check if the url can be parsed and set the host
59
+ #
60
+ # @return [Boolean] true if url can be parsed
61
+ def url_valid?
62
+ ALLOWED_URLS.each do |host, url_allowed|
63
+ if url.include? url_allowed
64
+ @host = host
65
+ return true
66
+ end
67
+ end
68
+ false
69
+ end
70
+
71
+ # Start the crawl
72
+ #
73
+ # @param limit [Integer] the maximum number of scraped recipes
74
+ # @param interval_sleep_time [Integer] waiting time between scraping
75
+ # @yield [RecipeScraper::Recipe] as recipe scraped
76
+ def crawl!(limit: 2, interval_sleep_time: 0)
77
+ recipes_returned = 0
78
+
79
+ if @host == :cuisineaz
80
+
81
+ while !@to_crawl_urls.empty? && (limit > @recipes.count)
82
+ # find all link on url given (and urls of theses)
83
+ url = @to_crawl_urls.first
84
+ next if url.nil?
85
+
86
+ get_links url
87
+ # now scrape an url
88
+ recipe = scrape url
89
+ yield recipe if recipe && block_given?
90
+ sleep interval_sleep_time
91
+ end
92
+
93
+ else
94
+ raise NotImplementedError
95
+ end
96
+ end
97
+
98
+ #
99
+ # Scrape given url
100
+ # param url [String] as url to scrape
101
+ #
102
+ # @return [RecipeScraper::Recipe] as recipe scraped
103
+ # @return [nil] if recipe connat be fetched
104
+ def scrape(url)
105
+ recipe = RecipeScraper::Recipe.new url
106
+ @scraped_urls << url
107
+ @recipes << recipe
108
+ if save recipe
109
+ return recipe
110
+ else
111
+ raise SQLite3::Exception, 'cannot save recipe'
112
+ end
113
+ rescue OpenURI::HTTPError
114
+ nil
115
+ end
116
+
117
+ #
118
+ # Get recipes links from the given url
119
+ # @param url [String] as url to scrape
120
+ #
121
+ # @return [void]
122
+ def get_links(url)
123
+ # catch 404 error from host
124
+
125
+ doc = Nokogiri::HTML(open(url))
126
+ # find internal links on page
127
+ doc.css('#tagCloud a').each do |link|
128
+ link = link.attr('href')
129
+ # If link correspond to a recipe we add it to recipe to scraw
130
+ if link.include?(ALLOWED_URLS[@host]) && !@crawled_urls.include?(url)
131
+ @to_crawl_urls << link
132
+ end
133
+ end
134
+ @to_crawl_urls.delete url
135
+ @crawled_urls << url
136
+ @to_crawl_urls.uniq!
137
+ rescue OpenURI::HTTPError
138
+ @to_crawl_urls.delete url
139
+ warn "#{url} cannot be reached"
140
+ end
141
+
142
+ #
143
+ # Save recipe
144
+ # @param recipe [RecipeScraper::Recipe] as recipe to save
145
+ #
146
+ # @return [Boolean] as true if success
147
+ def save(recipe)
148
+ @db.execute "INSERT INTO recipes (title, preptime, cooktime, ingredients, steps, image)
160
149
  VALUES (:title, :preptime, :cooktime, :ingredients, :steps, :image)",
161
- title: recipe.title,
162
- preptime: recipe.preptime,
163
- ingredients: recipe.ingredients.join("\n"),
164
- steps: recipe.steps.join("\n"),
165
- image: recipe.image
166
-
167
- return true
168
-
169
- rescue SQLite3::Exception => e
170
- puts "Exception occurred #{e}"
171
- return false
172
- end
173
- end
174
- end
175
-
176
-
177
- end
150
+ title: recipe.title,
151
+ preptime: recipe.preptime,
152
+ ingredients: recipe.ingredients.join("\n"),
153
+ steps: recipe.steps.join("\n"),
154
+ image: recipe.image
155
+
156
+ true
157
+ rescue SQLite3::Exception => e
158
+ puts "Exception occurred #{e}"
159
+ false
160
+ end
161
+ end
162
+ end
@@ -1,3 +1,3 @@
1
1
  module RecipeCrawler
2
- VERSION = "3.1.2"
2
+ VERSION = '4.0.0'.freeze
3
3
  end
@@ -1,29 +1,27 @@
1
- # coding: utf-8
2
- lib = File.expand_path('../lib', __FILE__)
1
+ lib = File.expand_path('lib', __dir__)
3
2
  $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
3
  require 'recipe_crawler/version'
5
4
 
6
5
  Gem::Specification.new do |spec|
7
- spec.name = "recipe_crawler"
6
+ spec.name = 'recipe_crawler'
8
7
  spec.version = RecipeCrawler::VERSION
9
- spec.authors = ["madeindjs"]
10
- spec.email = ["madeindjs@gmail.com"]
11
-
12
- spec.summary = %q{Get all recipes from famous french cooking websites}
13
- spec.description = %q{This crawler will use my personnal scraper named 'RecipeScraper' to dowload recipes data from Marmiton, 750g or cuisineaz}
14
- spec.homepage = "https://github.com/madeindjs/recipe_crawler."
15
- spec.license = "MIT"
8
+ spec.authors = ['Alexandre Rousseau']
9
+ spec.email = ['contact@rousseau-alexandre.fr']
16
10
 
11
+ spec.summary = 'Get all recipes from famous french cooking websites'
12
+ spec.description = "This crawler will use my personnal scraper named 'RecipeScraper' to dowload recipes data from Marmiton, 750g or cuisineaz"
13
+ spec.homepage = 'https://github.com/madeindjs/recipe_crawler'
14
+ spec.license = 'MIT'
17
15
 
18
16
  spec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
19
17
  spec.executables = ['recipe_crawler']
20
- spec.require_paths = ["lib"]
21
-
22
- spec.add_dependency "recipe_scraper", '>= 2.2.0'
18
+ spec.require_paths = ['lib']
23
19
 
20
+ spec.add_dependency 'recipe_scraper', '~> 2.0'
21
+ spec.add_dependency 'sqlite3', '~> 1.3'
24
22
 
25
- spec.add_development_dependency "bundler", "~> 1.11"
26
- spec.add_development_dependency "rake", "~> 10.0"
27
- spec.add_development_dependency "rspec", "~> 3.0"
28
- spec.add_development_dependency "yard"
23
+ spec.add_development_dependency 'bundler', '~> 1.17'
24
+ spec.add_development_dependency 'rake', '~> 10.0'
25
+ spec.add_development_dependency 'rspec', '~> 3.0'
26
+ spec.add_development_dependency 'yard'
29
27
  end
metadata CHANGED
@@ -1,43 +1,57 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: recipe_crawler
3
3
  version: !ruby/object:Gem::Version
4
- version: 3.1.2
4
+ version: 4.0.0
5
5
  platform: ruby
6
6
  authors:
7
- - madeindjs
7
+ - Alexandre Rousseau
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2016-12-05 00:00:00.000000000 Z
11
+ date: 2018-12-08 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: recipe_scraper
15
15
  requirement: !ruby/object:Gem::Requirement
16
16
  requirements:
17
- - - ">="
17
+ - - "~>"
18
18
  - !ruby/object:Gem::Version
19
- version: 2.2.0
19
+ version: '2.0'
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
- - - ">="
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '2.0'
27
+ - !ruby/object:Gem::Dependency
28
+ name: sqlite3
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '1.3'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
25
39
  - !ruby/object:Gem::Version
26
- version: 2.2.0
40
+ version: '1.3'
27
41
  - !ruby/object:Gem::Dependency
28
42
  name: bundler
29
43
  requirement: !ruby/object:Gem::Requirement
30
44
  requirements:
31
45
  - - "~>"
32
46
  - !ruby/object:Gem::Version
33
- version: '1.11'
47
+ version: '1.17'
34
48
  type: :development
35
49
  prerelease: false
36
50
  version_requirements: !ruby/object:Gem::Requirement
37
51
  requirements:
38
52
  - - "~>"
39
53
  - !ruby/object:Gem::Version
40
- version: '1.11'
54
+ version: '1.17'
41
55
  - !ruby/object:Gem::Dependency
42
56
  name: rake
43
57
  requirement: !ruby/object:Gem::Requirement
@@ -83,7 +97,7 @@ dependencies:
83
97
  description: This crawler will use my personnal scraper named 'RecipeScraper' to dowload
84
98
  recipes data from Marmiton, 750g or cuisineaz
85
99
  email:
86
- - madeindjs@gmail.com
100
+ - contact@rousseau-alexandre.fr
87
101
  executables:
88
102
  - recipe_crawler
89
103
  extensions: []
@@ -104,7 +118,7 @@ files:
104
118
  - lib/recipe_crawler/crawler.rb
105
119
  - lib/recipe_crawler/version.rb
106
120
  - recipe_crawler.gemspec
107
- homepage: https://github.com/madeindjs/recipe_crawler.
121
+ homepage: https://github.com/madeindjs/recipe_crawler
108
122
  licenses:
109
123
  - MIT
110
124
  metadata: {}
@@ -124,7 +138,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
124
138
  version: '0'
125
139
  requirements: []
126
140
  rubyforge_project:
127
- rubygems_version: 2.5.1
141
+ rubygems_version: 2.7.8
128
142
  signing_key:
129
143
  specification_version: 4
130
144
  summary: Get all recipes from famous french cooking websites