alexrabarts-big_sitemap 0.1.3

Sign up to get free protection for your applications and to get access to all the features.
data/README.markdown ADDED
@@ -0,0 +1,90 @@
1
+ # BigSitemap
2
+
3
+ ## DESCRIPTION:
4
+
5
+ BigSitemap is a Sitemap generator specifically designed for large sites (although it works equally well with small sites). It splits large Sitemaps into multiple files, gzips the files to minimize bandwidth usage, batches database queries so it doesn't take your site down, can be set up with just a few lines of code and is compatible with just about any framework.
6
+
7
+ ## INSTALL:
8
+
9
+ * Via git: git clone git://github.com/alexrabarts/big_sitemap.git
10
+ * Via gem: gem install alexrabarts-big_sitemap -s http://gems.github.com
11
+
12
+ ## SYNOPSIS:
13
+
14
+ The minimum required to generated a sitemap is:
15
+
16
+ <pre>
17
+ sitemap = BigSitemap.new(:base_url => 'http://example.com')
18
+ sitemap.add(:model => MyModel, :path => 'my_controller')
19
+ sitemap.generate
20
+ </pre>
21
+
22
+ You can put this in a rake/thor task and create a cron job to run it periodically. It should be enough for most Rails/Merb applications.
23
+
24
+ Your models must provide either a <code>find_for_sitemap</code> or <code>all</code> class method that returns the instances that are to be included in the sitemap. Additionally, you models must provide a <code>count_for_sitemap</code> or <code>count</code> class method that returns a count of the instances to be included. If you're using ActiveRecord (Rails) or DataMapper then <code>all</code> and <code>count</code> are already provided and you don't need to do anything unless you want to include a subset of records. If you provide your own <code>find_for_sitemap</code> or <code>all</code> method then it should be able to handle the <code>:offset</code> and <code>:limit</code> options, in the same way that ActiveRecord and DataMapper handle them. This is especially important if you have more than 50,000 URLs.
25
+
26
+ To generate the URLs, BigSitemap will combine the constructor arguments with the <code>to_param</code> method of each instance returned (provided by ActiveRecord but not DataMapper). If this method is not present, <code>id</code> will be used. The URL is constructed as:
27
+
28
+ <pre>
29
+ ":base_url/:path/:to_param" # (if to_param exists)
30
+ ":base_url/:path/:id" # (if to_param does not exist)
31
+ </pre>
32
+
33
+ BigSitemap knows about the document root of Rails and Merb. If you are using another framework then you can specify the document root with the <code>:document_root</code> option. e.g.:
34
+
35
+ <pre>
36
+ BigSitemap.new(:base_url => 'http://example.com', :document_root => "#{FOO_ROOT}/httpdocs")
37
+ </pre>
38
+
39
+ By default, the sitemap files are created under <code>/sitemaps</code>. You can modify this with the <code>:path</code> option:
40
+
41
+ <pre>
42
+ BigSitemap.new(:base_url => 'http://example.com', :path => 'google-sitemaps') # places Sitemaps under /google-sitemaps
43
+ </pre>
44
+
45
+ Sitemaps will be split across several files if more than 50,000 records are returned. You can customize this limit with the <code>:max_per_sitemap</code> option:
46
+
47
+ <pre>
48
+ BigSitemap.new(:base_url => 'http://example.com', :max_per_sitemap => 1000) # Max of 1000 URLs per Sitemap
49
+ </pre>
50
+
51
+ The database is queries in batches to prevent large SQL select statements from locking the database for too long. By default, the batch size is 1001 (not 1000 due to an obscure bug in DataMapper that appears when an offset of 37000 is used). You can customize the batch size with the <code>:batch_size</code> option:
52
+
53
+ <pre>
54
+ BigSitemap.new(:base_url => 'http://example.com, :batch_size => 5000) # Database is queried in batches of 5,000
55
+ </pre>
56
+
57
+ Google, Yahoo!, MSN and Ask are pinged once the Sitemap files are generated. You can turn one or more of these off:
58
+
59
+ <pre>
60
+ BigSitemap.new(
61
+ :base_url => 'http://example.com',
62
+ :ping_google => false,
63
+ :ping_yahoo => false,
64
+ :ping_msn => false,
65
+ :ping_ask => false
66
+ )
67
+ </pre>
68
+
69
+ You must provide an App ID in order to ping Yahoo! (more info at http://developer.yahoo.com/search/siteexplorer/V1/updateNotification.html):
70
+
71
+ <pre>
72
+ BigSitemap.new(:base_url => 'http://example.com', :yahoo_app_id => 'myYahooAppId') # Yahoo! will now be pinged
73
+ </pre>
74
+
75
+ ## LIMITATIONS:
76
+
77
+ If your database is likely to shrink during the time it takes to create the sitemap then you might run into problems (the final, batched SQL select will overrun by setting a limit that is too large since it is calculated from the count, which is queried at the very beginning). Patches welcome!
78
+
79
+ ## TODO
80
+
81
+ * Support for priority and changefreq (currently hard-coded to 'weekly')
82
+
83
+ ## CREDITS
84
+
85
+ Thanks to Alastair Brunton and Harry Love, who's work provided a starting point for this library.
86
+ http://scoop.cheerfactory.co.uk/2008/02/26/google-sitemap-generator/
87
+
88
+ ## COPYRIGHT
89
+
90
+ Copyright (c) 2009 Stateless Systems (http://statelesssystems.com). See LICENSE for details.
data/VERSION.yml ADDED
@@ -0,0 +1,4 @@
1
+ ---
2
+ :minor: 1
3
+ :patch: 3
4
+ :major: 0
@@ -0,0 +1,200 @@
1
+ require 'net/http'
2
+ require 'uri'
3
+ require 'zlib'
4
+ require 'builder'
5
+ require 'extlib'
6
+
7
+ class BigSitemap
8
+ def initialize(options)
9
+ document_root = options.delete(:document_root)
10
+
11
+ if document_root.nil?
12
+ if defined? RAILS_ROOT
13
+ document_root = "#{RAILS_ROOT}/public"
14
+ elsif defined? Merb
15
+ document_root = "#{Merb.root}/public"
16
+ end
17
+ end
18
+
19
+ raise ArgumentError, 'Document root must be specified with the :document_root option' if document_root.nil?
20
+
21
+ @base_url = options.delete(:base_url)
22
+ @max_per_sitemap = options.delete(:max_per_sitemap) || 50000
23
+ @batch_size = options.delete(:batch_size) || 1001 # TODO: Set this to 1000 once DM offset 37000 bug is fixed
24
+ @web_path = options.delete(:path) || 'sitemaps'
25
+ @ping_google = options[:ping_google].nil? ? true : options.delete(:ping_google)
26
+ @ping_yahoo = options[:ping_yahoo].nil? ? true : options.delete(:ping_yahoo)
27
+ @yahoo_app_id = options.delete(:yahoo_app_id)
28
+ @ping_msn = options[:ping_msn].nil? ? true : options.delete(:ping_msn)
29
+ @ping_ask = options[:ping_ask].nil? ? true : options.delete(:ping_ask)
30
+ @file_path = "#{document_root}/#{@web_path}"
31
+ @sources = []
32
+
33
+ raise ArgumentError, "Base URL must be specified with the :base_url option" if @base_url.nil?
34
+
35
+ raise(
36
+ ArgumentError,
37
+ 'Batch size (:batch_size) must be less than or equal to maximum URLs per sitemap (:max_per_sitemap)'
38
+ ) if @batch_size > @max_per_sitemap
39
+
40
+ unless File.exists? @file_path
41
+ Dir.mkdir(@file_path)
42
+ end
43
+ end
44
+
45
+ def add(options)
46
+ raise ArgumentError, ':model and :path options must be provided' unless options[:model] && options[:path]
47
+ @sources << options
48
+ end
49
+
50
+ def generate
51
+ paths = []
52
+ sitemaps = []
53
+
54
+ @sources.each do |source|
55
+ klass = source[:model]
56
+
57
+ count_method = pick_method(klass, [:count_for_sitemap, :count])
58
+ find_method = pick_method(klass, [:find_for_sitemap, :all])
59
+ raise ArgumentError, "#{klass} must provide a count_for_sitemap class method" if count_method.nil?
60
+ raise ArgumentError, "#{klass} must provide a find_for_sitemap class method" if find_method.nil?
61
+
62
+ count = klass.send(count_method)
63
+ num_sitemaps = 1
64
+ num_batches = 1
65
+
66
+ if count > @batch_size
67
+ num_batches = (count.to_f / @batch_size.to_f).ceil
68
+ num_sitemaps = (count.to_f / @max_per_sitemap.to_f).ceil
69
+ end
70
+ batches_per_sitemap = num_batches.to_f / num_sitemaps.to_f
71
+
72
+ # Update the @sources hash so that the index file knows how many sitemaps to link to
73
+ source[:num_sitemaps] = num_sitemaps
74
+
75
+ for sitemap_num in 1..num_sitemaps
76
+ # Work out the start and end batch numbers for this sitemap
77
+ batch_num_start = sitemap_num == 1 ? 1 : ((sitemap_num * batches_per_sitemap).ceil - batches_per_sitemap + 1).to_i
78
+ batch_num_end = (batch_num_start + [batches_per_sitemap, num_batches].min).floor - 1
79
+
80
+ # Stream XML output to a file
81
+ filename = "sitemap_#{Extlib::Inflection::underscore(klass.to_s)}"
82
+ filename << "_#{sitemap_num}" if num_sitemaps > 1
83
+
84
+ gz = gz_writer("#{filename}.xml.gz")
85
+
86
+ xml = Builder::XmlMarkup.new(:target => gz)
87
+ xml.instruct!
88
+ xml.urlset(:xmlns => 'http://www.sitemaps.org/schemas/sitemap/0.9') do
89
+ for batch_num in batch_num_start..batch_num_end
90
+ offset = ((batch_num - 1) * @batch_size)
91
+ limit = (count - offset) < @batch_size ? (count - offset - 1) : @batch_size
92
+ find_options = num_batches > 1 ? {:limit => limit, :offset => offset} : {}
93
+
94
+ klass.send(find_method, find_options).each do |r|
95
+ last_mod_method = pick_method(
96
+ r,
97
+ [:updated_at, :updated_on, :updated, :created_at, :created_on, :created]
98
+ )
99
+ last_mod = last_mod_method.nil? ? Time.now : r.send(last_mod_method)
100
+
101
+ param_method = pick_method(r, [:to_param, :id])
102
+ raise ArgumentError, "#{klass} must provide a to_param instance method" if param_method.nil?
103
+
104
+ path = {:url => "#{source[:path]}/#{r.send(param_method)}", :last_mod => last_mod}
105
+
106
+ xml.url do
107
+ xml.loc("#{@base_url}/#{path[:url]}")
108
+ xml.lastmod(path[:last_mod].strftime('%Y-%m-%d')) unless path[:last_mod].nil?
109
+ xml.changefreq('weekly')
110
+ end
111
+ end
112
+ end
113
+ end
114
+
115
+ gz.close
116
+ end
117
+
118
+ end
119
+
120
+ generate_sitemap_index
121
+ ping_search_engines
122
+ end
123
+
124
+ private
125
+ def pick_method(klass, candidates)
126
+ method = nil
127
+ candidates.each do |candidate|
128
+ if klass.respond_to? candidate
129
+ method = candidate
130
+ break
131
+ end
132
+ end
133
+ method
134
+ end
135
+
136
+ def gz_writer(filename)
137
+ Zlib::GzipWriter.new(File.open("#{@file_path}/#{filename}", 'w+'))
138
+ end
139
+
140
+ def sitemap_index_filename
141
+ 'sitemap_index.xml.gz'
142
+ end
143
+
144
+ # Create a sitemap index document
145
+ def generate_sitemap_index
146
+ xml = ''
147
+ builder = Builder::XmlMarkup.new(:target => xml)
148
+ builder.instruct!
149
+ builder.sitemapindex(:xmlns => 'http://www.sitemaps.org/schemas/sitemap/0.9') do
150
+ @sources.each do |source|
151
+ num_sitemaps = source[:num_sitemaps]
152
+ for i in 1..num_sitemaps
153
+ loc = "#{@base_url}/#{@web_path}/sitemap_#{Extlib::Inflection::underscore(source[:model].to_s)}"
154
+ loc << "_#{i}" if num_sitemaps > 1
155
+ loc << '.xml.gz'
156
+
157
+ builder.sitemap do
158
+ builder.loc(loc)
159
+ builder.lastmod(Time.now.strftime('%Y-%m-%d'))
160
+ end
161
+ end
162
+ end
163
+ end
164
+
165
+ gz = gz_writer(sitemap_index_filename)
166
+ gz.write(xml)
167
+ gz.close
168
+ end
169
+
170
+ def sitemap_uri
171
+ URI.escape("#{@base_url}/#{@web_path}/#{sitemap_index_filename}")
172
+ end
173
+
174
+ # Notify Google of the new sitemap index file
175
+ def ping_google
176
+ Net::HTTP.get('www.google.com', "/webmasters/tools/ping?sitemap=#{sitemap_uri}")
177
+ end
178
+
179
+ # Notify Yahoo! of the new sitemap index file
180
+ def ping_yahoo
181
+ Net::HTTP.get('search.yahooapis.com', "/SiteExplorerService/V1/updateNotification?appid=#{@yahoo_app_id}&url=#{sitemap_uri}")
182
+ end
183
+
184
+ # Notify MSN of the new sitemap index file
185
+ def ping_msn
186
+ Net::HTTP.get('webmaster.live.com', "/ping.aspx?siteMap=#{sitemap_uri}")
187
+ end
188
+
189
+ # Notify Ask of the new sitemap index file
190
+ def ping_ask
191
+ Net::HTTP.get('submissions.ask.com', "/ping?sitemap=#{sitemap_uri}")
192
+ end
193
+
194
+ def ping_search_engines
195
+ ping_google if @ping_google
196
+ ping_yahoo if @ping_yahoo && @yahoo_app_id
197
+ ping_msn if @ping_msn
198
+ ping_ask if @ping_ask
199
+ end
200
+ end
@@ -0,0 +1,177 @@
1
+ require File.dirname(__FILE__) + '/test_helper'
2
+ require 'nokogiri'
3
+
4
+ class BigSitemapTest < Test::Unit::TestCase
5
+ def setup
6
+ delete_tmp_files
7
+ end
8
+
9
+ def teardown
10
+ delete_tmp_files
11
+ end
12
+
13
+ should 'raise an error if the :base_url option is not specified' do
14
+ assert_nothing_raised { BigSitemap.new(:base_url => 'http://example.com', :document_root => tmp_dir) }
15
+ assert_raise(ArgumentError) { BigSitemap.new(:document_root => tmp_dir) }
16
+ end
17
+
18
+ should 'generate a sitemap index file' do
19
+ generate_sitemap_files
20
+ assert File.exists?(sitemaps_index_file)
21
+ end
22
+
23
+ should 'generate a single sitemap model file' do
24
+ create_sitemap
25
+ add_model
26
+ @sitemap.generate
27
+ assert File.exists?(single_sitemaps_model_file), "#{single_sitemaps_model_file} exists"
28
+ end
29
+
30
+ should 'generate exactly two sitemap model files' do
31
+ generate_exactly_two_model_sitemap_files
32
+ assert File.exists?(first_sitemaps_model_file), "#{first_sitemaps_model_file} exists"
33
+ assert File.exists?(second_sitemaps_model_file), "#{second_sitemaps_model_file} exists"
34
+ third_sitemaps_model_file = "#{sitemaps_dir}/sitemap_test_model_3.xml.gz"
35
+ assert !File.exists?(third_sitemaps_model_file), "#{third_sitemaps_model_file} does not exist"
36
+ end
37
+
38
+ context 'Sitemap index file' do
39
+ should 'contain one sitemapindex element' do
40
+ generate_sitemap_files
41
+ assert_equal 1, num_elements(sitemaps_index_file, 'sitemapindex')
42
+ end
43
+
44
+ should 'contain one sitemap element' do
45
+ generate_sitemap_files
46
+ assert_equal 1, num_elements(sitemaps_index_file, 'sitemap')
47
+ end
48
+
49
+ should 'contain one loc element' do
50
+ generate_sitemap_files
51
+ assert_equal 1, num_elements(sitemaps_index_file, 'loc')
52
+ end
53
+
54
+ should 'contain one lastmod element' do
55
+ generate_sitemap_files
56
+ assert_equal 1, num_elements(sitemaps_index_file, 'lastmod')
57
+ end
58
+
59
+ should 'contain two loc elements' do
60
+ generate_exactly_two_model_sitemap_files
61
+ assert_equal 2, num_elements(sitemaps_index_file, 'loc')
62
+ end
63
+
64
+ should 'contain two lastmod elements' do
65
+ generate_exactly_two_model_sitemap_files
66
+ assert_equal 2, num_elements(sitemaps_index_file, 'lastmod')
67
+ end
68
+ end
69
+
70
+ context 'Sitemap model file' do
71
+ should 'contain one urlset element' do
72
+ generate_sitemap_files
73
+ assert_equal 1, num_elements(single_sitemaps_model_file, 'urlset')
74
+ end
75
+
76
+ should 'contain several loc elements' do
77
+ generate_sitemap_files
78
+ assert_equal default_num_items, num_elements(single_sitemaps_model_file, 'loc')
79
+ end
80
+
81
+ should 'contain several lastmod elements' do
82
+ generate_sitemap_files
83
+ assert_equal default_num_items, num_elements(single_sitemaps_model_file, 'lastmod')
84
+ end
85
+
86
+ should 'contain several changefreq elements' do
87
+ generate_sitemap_files
88
+ assert_equal default_num_items, num_elements(single_sitemaps_model_file, 'changefreq')
89
+ end
90
+
91
+ should 'contain one loc element' do
92
+ generate_exactly_two_model_sitemap_files
93
+ assert_equal 1, num_elements(first_sitemaps_model_file, 'loc')
94
+ assert_equal 1, num_elements(second_sitemaps_model_file, 'loc')
95
+ end
96
+
97
+ should 'contain one lastmod element' do
98
+ generate_exactly_two_model_sitemap_files
99
+ assert_equal 1, num_elements(first_sitemaps_model_file, 'lastmod')
100
+ assert_equal 1, num_elements(second_sitemaps_model_file, 'lastmod')
101
+ end
102
+
103
+ should 'contain one changefreq element' do
104
+ generate_exactly_two_model_sitemap_files
105
+ assert_equal 1, num_elements(first_sitemaps_model_file, 'changefreq')
106
+ assert_equal 1, num_elements(second_sitemaps_model_file, 'changefreq')
107
+ end
108
+ end
109
+
110
+ private
111
+ def delete_tmp_files
112
+ FileUtils.rm_rf(sitemaps_dir)
113
+ end
114
+
115
+ def create_sitemap(options={})
116
+ @sitemap = BigSitemap.new({
117
+ :base_url => 'http://example.com',
118
+ :document_root => tmp_dir,
119
+ :update_google => false
120
+ }.update(options))
121
+ end
122
+
123
+ def generate_sitemap_files
124
+ create_sitemap
125
+ add_model
126
+ @sitemap.generate
127
+ end
128
+
129
+ def generate_exactly_two_model_sitemap_files
130
+ create_sitemap(:max_per_sitemap => 1, :batch_size => 1)
131
+ add_model(:num_items => 2)
132
+ @sitemap.generate
133
+ end
134
+
135
+ def add_model(options={})
136
+ num_items = options.delete(:num_items) || default_num_items
137
+ TestModel.stubs(:num_items).returns(num_items)
138
+ @sitemap.add({:model => TestModel, :path => 'test_controller'}.update(options))
139
+ end
140
+
141
+ def default_num_items
142
+ 10
143
+ end
144
+
145
+ def sitemaps_index_file
146
+ "#{sitemaps_dir}/sitemap_index.xml.gz"
147
+ end
148
+
149
+ def single_sitemaps_model_file
150
+ "#{sitemaps_dir}/sitemap_test_model.xml.gz"
151
+ end
152
+
153
+ def first_sitemaps_model_file
154
+ "#{sitemaps_dir}/sitemap_test_model_1.xml.gz"
155
+ end
156
+
157
+ def second_sitemaps_model_file
158
+ "#{sitemaps_dir}/sitemap_test_model_2.xml.gz"
159
+ end
160
+
161
+ def sitemaps_dir
162
+ "#{tmp_dir}/sitemaps"
163
+ end
164
+
165
+ def tmp_dir
166
+ '/tmp'
167
+ end
168
+
169
+ def ns
170
+ {'s' => 'http://www.sitemaps.org/schemas/sitemap/0.9'}
171
+ end
172
+
173
+ def num_elements(filename, el)
174
+ data = Nokogiri::XML.parse(Zlib::GzipReader.open(filename).read)
175
+ data.search("//s:#{el}", ns).size
176
+ end
177
+ end
@@ -0,0 +1,18 @@
1
+ class TestModel
2
+ def to_param
3
+ object_id
4
+ end
5
+
6
+ class << self
7
+ def count_for_sitemap
8
+ self.find_for_sitemap.size
9
+ end
10
+
11
+ def find_for_sitemap(options={})
12
+ instances = []
13
+ num_times = options.delete(:limit) || self.num_items
14
+ num_times.times { instances.push(self.new) }
15
+ instances
16
+ end
17
+ end
18
+ end
@@ -0,0 +1,11 @@
1
+ require 'rubygems'
2
+ require 'test/unit'
3
+ require 'shoulda'
4
+ require 'mocha'
5
+ require 'test/fixtures/test_model'
6
+
7
+ $LOAD_PATH.unshift(File.dirname(__FILE__))
8
+ require 'big_sitemap'
9
+
10
+ class Test::Unit::TestCase
11
+ end
metadata ADDED
@@ -0,0 +1,79 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: alexrabarts-big_sitemap
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.3
5
+ platform: ruby
6
+ authors:
7
+ - Alex Rabarts
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+
12
+ date: 2009-03-10 00:00:00 -07:00
13
+ default_executable:
14
+ dependencies:
15
+ - !ruby/object:Gem::Dependency
16
+ name: builder
17
+ type: :runtime
18
+ version_requirement:
19
+ version_requirements: !ruby/object:Gem::Requirement
20
+ requirements:
21
+ - - ">="
22
+ - !ruby/object:Gem::Version
23
+ version: 2.1.2
24
+ version:
25
+ - !ruby/object:Gem::Dependency
26
+ name: extlib
27
+ type: :runtime
28
+ version_requirement:
29
+ version_requirements: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ">="
32
+ - !ruby/object:Gem::Version
33
+ version: 0.9.9
34
+ version:
35
+ description: A Sitemap generator specifically designed for large sites (although it works equally well with small sites)
36
+ email: alexrabarts@gmail.com
37
+ executables: []
38
+
39
+ extensions: []
40
+
41
+ extra_rdoc_files: []
42
+
43
+ files:
44
+ - VERSION.yml
45
+ - README.markdown
46
+ - lib/big_sitemap.rb
47
+ - test/fixtures
48
+ - test/fixtures/test_model.rb
49
+ - test/big_sitemap_test.rb
50
+ - test/test_helper.rb
51
+ has_rdoc: true
52
+ homepage: http://github.com/alexrabarts/big_sitemap
53
+ post_install_message:
54
+ rdoc_options:
55
+ - --inline-source
56
+ - --charset=UTF-8
57
+ require_paths:
58
+ - lib
59
+ required_ruby_version: !ruby/object:Gem::Requirement
60
+ requirements:
61
+ - - ">="
62
+ - !ruby/object:Gem::Version
63
+ version: "0"
64
+ version:
65
+ required_rubygems_version: !ruby/object:Gem::Requirement
66
+ requirements:
67
+ - - ">="
68
+ - !ruby/object:Gem::Version
69
+ version: "0"
70
+ version:
71
+ requirements: []
72
+
73
+ rubyforge_project:
74
+ rubygems_version: 1.2.0
75
+ signing_key:
76
+ specification_version: 2
77
+ summary: A Sitemap generator specifically designed for large sites (although it works equally well with small sites)
78
+ test_files: []
79
+