documentcloud-cloud-crowd 0.0.4 → 0.0.5

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,17 @@
1
+ The crowd, suddenly there where there was nothing before, is a mysterious and
2
+ universal phenomenon. A few people may have been standing together -- five, ten
3
+ or twelve, nor more; nothing has been announced, nothing is expected. Suddenly
4
+ everywhere is black with people and more come streaming from all sides as though
5
+ streets had only one direction. Most of them do not know what has happened and,
6
+ if questioned, have no answer; but they hurry to be there where most other
7
+ people are. There is a determination in their movement which is quite different
8
+ from the expression of ordinary curiosity. It seems as through the movement of
9
+ some of them transmits itself to all the others. But that is not all; they have
10
+ a goal which is there before they can find words for it. -p 16
11
+
12
+ Crowd crystals are the small, rigid groups of men, strictly delimited and of
13
+ great constancy, which serve to precipitate crowds. Their structure is such
14
+ that they can be comprehended and taken in at a glance. Their unity is more
15
+ important than their size. -p 73
16
+
17
+ From Elias Canetti's "Crowds and Power" (1962).
data/LICENSE ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2009 Jeremy Ashkenas, DocumentCloud
2
+
3
+ Permission is hereby granted, free of charge, to any person
4
+ obtaining a copy of this software and associated documentation
5
+ files (the "Software"), to deal in the Software without
6
+ restriction, including without limitation the rights to use,
7
+ copy, modify, merge, publish, distribute, sublicense, and/or sell
8
+ copies of the Software, and to permit persons to whom the
9
+ Software is furnished to do so, subject to the following
10
+ conditions:
11
+
12
+ The above copyright notice and this permission notice shall be
13
+ included in all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
16
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
17
+ OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
18
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
19
+ HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
20
+ WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
21
+ FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
22
+ OTHER DEALINGS IN THE SOFTWARE.
data/README ADDED
@@ -0,0 +1,75 @@
1
+
2
+ _ _
3
+ ( ` )_
4
+ ( ) `)
5
+ (_ (_ . _) _)
6
+ _
7
+ ( )
8
+ _ . ( ` ) . )
9
+ ( _ )_ (_, _( ,_)_)
10
+ (_ _(_ ,)
11
+
12
+ _ _ ___ _ _ ___ _
13
+ ( ` )_ / __| |___ _ _ __| |/ __|_ _ _____ __ ____| |
14
+ ( ) `) | (__| / _ \ || / _` | (__| '_/ _ \ V V / _` |
15
+ (_ (_ . _) _) \___|_\___/\_,_\__,_|\___|_| \___/\_/\_/\__,_|
16
+
17
+ _
18
+ ( )
19
+ _, _ . ( ` ) . )
20
+ ( ( _ )_ (_, _( ,_)_)
21
+ (_(_ _(_ ,)
22
+
23
+
24
+
25
+ ~ CloudCrowd ~
26
+
27
+ * A batch-processing system, map-reduce style
28
+ * Write your scripts in Ruby
29
+ * Built for Amazon EC2 and S3
30
+ * split -> process -> merge
31
+ * As easy as `gem install cloud-crowd`
32
+
33
+
34
+ ~ Getting started ~
35
+
36
+ # Install the gem (documentcloud-cloud-crowd until the first official release).
37
+
38
+ >> sudo gem install cloud-crowd
39
+
40
+ # Install the CloudCrowd configuration files to a location of your choosing.
41
+
42
+ >> crowd install ~/config/cloud-crowd
43
+
44
+ # Now, you can use the full complement of `crowd` commands from inside of
45
+ # this configuration directory. To see the available commands:
46
+
47
+ >> crowd --help
48
+
49
+ # Edit the configuration files to your satisfaction, and add AWS credentials.
50
+
51
+ >> mate ~/config/cloud-crowd/config.yml
52
+ >> mate ~/config/cloud-crowd/database.yml
53
+
54
+ # Write your actions, and install them into the 'actions' subdirectory.
55
+ # CloudCrowd comes with some default actions as an example.
56
+
57
+ # To spin up the central server (make sure that you include its location
58
+ # in config.yml), either:
59
+
60
+ >> crowd server
61
+
62
+ # or:
63
+
64
+ >> thin -R config.ru --servers 3 -e production start
65
+
66
+ # Any server that supports Rack should work with the rackup file.
67
+
68
+ # Then, to spin up 10 workers:
69
+
70
+ >> crowd workers start -n 10
71
+
72
+ # To spin up workers remotely, install the 'cloud-crowd' gem, and copy over
73
+ # your configuration directory.
74
+
75
+
@@ -12,9 +12,7 @@ class GraphicsMagick < CloudCrowd::Action
12
12
  # Download the initial image, and run each of the specified GraphicsMagick
13
13
  # commands against it, returning the aggregate output.
14
14
  def process
15
- result = {}
16
- options['steps'].each {|step| result[step['name']] = run_step(step) }
17
- result.to_json
15
+ options['steps'].inject({}) {|h, step| h[step['name']] = run_step(step); h }
18
16
  end
19
17
 
20
18
  # Run an individual step (single GraphicsMagick command) in a shell-injection
@@ -0,0 +1,92 @@
1
+ # Depends on working pdftk, gm (GraphicsMagick), and pdftotext (Poppler) commands.
2
+ # Splits a pdf into batches of N pages, creates their thumbnails and icons,
3
+ # as specified in the Job options, gets the text for every page, and merges
4
+ # it all back into a tar archive for convenient download.
5
+ #
6
+ # See <tt>examples/process_pdfs_example.rb</tt> for more information.
7
+ class ProcessPdfs < CloudCrowd::Action
8
+
9
+ # Split up a large pdf into single-page pdfs.
10
+ # The double pdftk shuffle fixes the document xrefs.
11
+ def split
12
+ `pdftk #{input_path} burst output "#{file_name}_%05d.pdf_temp"`
13
+ FileUtils.rm input_path
14
+ pdfs = Dir["*.pdf_temp"]
15
+ pdfs.each {|pdf| `pdftk #{pdf} output #{File.basename(pdf, '.pdf_temp')}.pdf`}
16
+ pdfs = Dir["*.pdf"]
17
+ batch_size = options['batch_size']
18
+ batches = (pdfs.length / batch_size.to_f).ceil
19
+ batches.times do |batch_num|
20
+ tar_path = "#{sprintf('%05d', batch_num)}.tar"
21
+ batch_pdfs = pdfs[batch_num*batch_size...(batch_num + 1)*batch_size]
22
+ `tar -czf #{tar_path} #{batch_pdfs.join(' ')}`
23
+ end
24
+ Dir["*.tar"].map {|tar| save(tar) }.to_json
25
+ end
26
+
27
+ # Convert a pdf page into different-sized thumbnails. Grab the text.
28
+ def process
29
+ `tar -xzf #{input_path}`
30
+ FileUtils.rm input_path
31
+ cmds = []
32
+ generate_images_commands(cmds)
33
+ generate_text_commands(cmds)
34
+ system cmds.join(' && ')
35
+ FileUtils.rm Dir['*.pdf']
36
+ `tar -czf #{file_name}.tar *`
37
+ save("#{file_name}.tar")
38
+ end
39
+
40
+ # Merge all of the resulting images, all of the resulting text files, and
41
+ # the concatenated merge of the full-text into a single tar archive, ready to
42
+ # for download.
43
+ def merge
44
+ JSON.parse(input).each do |batch_url|
45
+ batch_path = File.basename(batch_url)
46
+ download(batch_url, batch_path)
47
+ `tar -xzf #{batch_path}`
48
+ FileUtils.rm batch_path
49
+ end
50
+
51
+ names = Dir['*.txt'].map {|fn| fn.sub(/_\d+(_\w+)?\.txt\Z/, '') }.uniq
52
+ dirs = names.map {|n| ["#{n}/text/full", "#{n}/text/pages"] + options['images'].map {|i| "#{n}/images/#{i['name']}" } }.flatten
53
+ FileUtils.mkdir_p(dirs)
54
+
55
+ Dir['*.*'].each do |file|
56
+ ext = File.extname(file)
57
+ name = file.sub(/_\d+(_\w+)?#{ext}\Z/, '')
58
+ if ext == '.txt'
59
+ FileUtils.mv(file, "#{name}/text/pages/#{file}")
60
+ else
61
+ suffix = file.match(/_([^_]+)#{ext}\Z/)[1]
62
+ sans_suffix = file.sub(/_([^_]+)#{ext}\Z/, ext)
63
+ FileUtils.mv(file, "#{name}/images/#{suffix}/#{sans_suffix}")
64
+ end
65
+ end
66
+
67
+ names.each {|n| `cat #{n}/text/pages/*.txt > #{n}/text/full/#{n}.txt` }
68
+
69
+ `tar -czf processed_pdfs.tar *`
70
+ save("processed_pdfs.tar")
71
+ end
72
+
73
+
74
+ private
75
+
76
+ def generate_images_commands(command_list)
77
+ Dir["*.pdf"].each do |pdf|
78
+ name = File.basename(pdf, File.extname(pdf))
79
+ options['images'].each do |i|
80
+ command_list << "gm convert #{i['options']} #{pdf} #{name}_#{i['name']}.#{i['extension']}"
81
+ end
82
+ end
83
+ end
84
+
85
+ def generate_text_commands(command_list)
86
+ Dir["*.pdf"].each do |pdf|
87
+ name = File.basename(pdf, File.extname(pdf))
88
+ command_list << "pdftotext -enc UTF-8 -layout -q #{pdf} #{name}.txt"
89
+ end
90
+ end
91
+
92
+ end
@@ -1,7 +1,7 @@
1
1
  Gem::Specification.new do |s|
2
2
  s.name = 'cloud-crowd'
3
- s.version = '0.0.4' # Keep version in sync with cloud-cloud.rb
4
- s.date = '2009-08-23'
3
+ s.version = '0.0.5' # Keep version in sync with cloud-cloud.rb
4
+ s.date = '2009-09-01'
5
5
 
6
6
  s.homepage = "http://documentcloud.org" # wiki page on github?
7
7
  s.summary = "Better living through Map --> Ruby --> Reduce"
@@ -15,13 +15,19 @@ Gem::Specification.new do |s|
15
15
 
16
16
  s.authors = ['Jeremy Ashkenas']
17
17
  s.email = 'jeremy@documentcloud.org'
18
+ s.rubyforge_project = 'cloud-crowd'
18
19
 
19
20
  s.require_paths = ['lib']
20
21
  s.executables = ['crowd']
21
22
 
22
23
  # s.post_install_message = "Run `crowd --help` for information on using CloudCrowd."
23
- s.rubyforge_project = 'cloud-crowd'
24
- s.has_rdoc = true
24
+
25
+ s.has_rdoc = true
26
+ s.extra_rdoc_files = ['README']
27
+ s.rdoc_options << '--title' << 'CloudCrowd | Better Living through Map --> Ruby --> Reduce' <<
28
+ '--exclude' << 'test' <<
29
+ '--main' << 'README' <<
30
+ '--all'
25
31
 
26
32
  s.add_dependency 'sinatra', ['>= 0.9.4']
27
33
  s.add_dependency 'activerecord', ['>= 2.3.3']
@@ -40,16 +46,21 @@ Gem::Specification.new do |s|
40
46
 
41
47
  s.files = %w(
42
48
  actions/graphics_magick.rb
49
+ actions/process_pdfs.rb
43
50
  cloud-crowd.gemspec
44
51
  config/config.example.ru
45
52
  config/config.example.yml
46
53
  config/database.example.yml
54
+ EPIGRAPHS
55
+ examples/graphics_magick_example.rb
56
+ examples/process_pdfs_example.rb
47
57
  lib/cloud-crowd.rb
48
58
  lib/cloud_crowd/action.rb
49
59
  lib/cloud_crowd/app.rb
50
60
  lib/cloud_crowd/asset_store.rb
51
61
  lib/cloud_crowd/command_line.rb
52
62
  lib/cloud_crowd/daemon.rb
63
+ lib/cloud_crowd/exceptions.rb
53
64
  lib/cloud_crowd/helpers/authorization.rb
54
65
  lib/cloud_crowd/helpers/resources.rb
55
66
  lib/cloud_crowd/helpers.rb
@@ -60,13 +71,22 @@ lib/cloud_crowd/models.rb
60
71
  lib/cloud_crowd/runner.rb
61
72
  lib/cloud_crowd/schema.rb
62
73
  lib/cloud_crowd/worker.rb
74
+ LICENSE
75
+ public/css/admin_console.css
76
+ public/css/reset.css
77
+ public/images/queue_fill.png
78
+ public/js/admin_console.js
79
+ public/js/jquery-1.3.2.js
80
+ README
63
81
  test/acceptance/test_failing_work_units.rb
64
82
  test/blueprints.rb
83
+ test/config/config.ru
65
84
  test/config/config.yml
66
85
  test/config/database.yml
67
86
  test/config/actions/failure_testing.rb
68
87
  test/test_helper.rb
69
88
  test/unit/test_job.rb
70
89
  test/unit/test_work_unit.rb
90
+ views/index.erb
71
91
  )
72
92
  end
@@ -19,6 +19,11 @@
19
19
  :login: [your login name]
20
20
  :password: [your password]
21
21
 
22
+ # By default, CloudCrowd looks for installed actions inside the 'actions'
23
+ # subdirectory of this configuration folder. 'actions_path' allows you to install
24
+ # them in a different location.
25
+ # :actions_path: /path/to/actions
26
+
22
27
  # Set the following numbers to tweak the configuration of your worker daemons.
23
28
  # Optimum results will depend on proportion of the Memory/CPU/IO bottlenecks
24
29
  # in your actions, the number of central servers you have running, and your
@@ -0,0 +1,48 @@
1
+ # Inside of a restclient session:
2
+ # This is a fancy example that produces black and white, annotated, and blurred
3
+ # versions of a list of URLs downloaded from the web.
4
+
5
+ require 'json'
6
+
7
+ RestClient.post(
8
+ 'http://localhost:9173/jobs',
9
+ {:job => {
10
+
11
+ 'action' => 'graphics_magick',
12
+
13
+ 'inputs' => [
14
+ 'http://www.sci-fi-o-rama.com/wp-content/uploads/2008/10/dan_mcpharlin_the_land_of_sleeping_things.jpg',
15
+ 'http://www.sci-fi-o-rama.com/wp-content/uploads/2009/07/dan_mcpharlin_wired_spread01.jpg',
16
+ 'http://www.sci-fi-o-rama.com/wp-content/uploads/2009/07/dan_mcpharlin_wired_spread03.jpg',
17
+ 'http://www.sci-fi-o-rama.com/wp-content/uploads/2009/07/dan_mcpharlin_wired_spread02.jpg',
18
+ 'http://www.sci-fi-o-rama.com/wp-content/uploads/2009/02/dan_mcpharlin_untitled.jpg'
19
+ ],
20
+
21
+ 'options' => {
22
+ 'steps' => [{
23
+ 'name' => 'annotated',
24
+ 'command' => 'convert',
25
+ 'options' => '-font helvetica -fill red -draw "font-size 35; text 75,75 CloudCrowd!"',
26
+ 'extension' => 'jpg'
27
+ },{
28
+ 'name' => 'blurred',
29
+ 'command' => 'convert',
30
+ 'options' => '-blur 10x5',
31
+ 'extension' => 'png'
32
+ },{
33
+ 'name' => 'bw',
34
+ 'input' => 'blurred',
35
+ 'command' => 'convert',
36
+ 'options' => '-monochrome',
37
+ 'extension' => 'jpg'
38
+ }]
39
+ }
40
+
41
+ }.to_json}
42
+ )
43
+
44
+ # status = RestClient.get('http://localhost:9173/jobs/[job_id]')
45
+
46
+ # puts JSON.parse(RestClient.get('http://localhost:9173/jobs/[job_id]'))['outputs'].values.map {|v|
47
+ # JSON.parse(v).map {|v| v['url']}
48
+ # }.flatten.join("\n")
@@ -0,0 +1,30 @@
1
+ RestClient.post(
2
+ 'http://localhost:9173/jobs',
3
+ {:job => {
4
+
5
+ 'action' => 'process_pdfs',
6
+
7
+ 'inputs' => [
8
+ 'http://tigger.uic.edu/~victor/personal/futurism.pdf',
9
+ 'http://www.jonasmekas.com/Catalog_excerpt/The%20Avant-Garde%20From%20Futurism%20to%20Fluxus.pdf',
10
+ 'http://www.dzignism.com/articles/Futurist.Manifesto.pdf'
11
+ ],
12
+
13
+ 'options' => {
14
+
15
+ 'batch_size' => 7,
16
+
17
+ 'images' => [{
18
+ 'name' => '700',
19
+ 'options' => '-resize 700x -density 220 -depth 4 -unsharp 0.5x0.5+0.5+0.03',
20
+ 'extension' => 'gif'
21
+ },{
22
+ 'name' => '1000',
23
+ 'options' => '-resize 1000x -density 220 -depth 4 -unsharp 0.5x0.5+0.5+0.03',
24
+ 'extension' => 'gif'
25
+ }]
26
+
27
+ }
28
+
29
+ }.to_json}
30
+ )
@@ -15,6 +15,7 @@ gem 'sinatra'
15
15
  autoload :ActiveRecord, 'activerecord'
16
16
  autoload :Benchmark, 'benchmark'
17
17
  autoload :Daemons, 'daemons'
18
+ autoload :Digest, 'digest'
18
19
  autoload :ERB, 'erb'
19
20
  autoload :FileUtils, 'fileutils'
20
21
  autoload :JSON, 'json'
@@ -39,7 +40,7 @@ module CloudCrowd
39
40
  ROOT = File.expand_path(File.dirname(__FILE__) + '/..')
40
41
 
41
42
  # Keep the version in sync with the gemspec.
42
- VERSION = '0.0.4'
43
+ VERSION = '0.0.5'
43
44
 
44
45
  # A Job is processing if its WorkUnits in the queue to be handled by workers.
45
46
  PROCESSING = 1
@@ -74,21 +75,22 @@ module CloudCrowd
74
75
  class << self
75
76
  attr_reader :config
76
77
 
77
- # Configure CloudCrowd by passing in the path to +config.yml+.
78
+ # Configure CloudCrowd by passing in the path to <tt>config.yml</tt>.
78
79
  def configure(config_path)
79
80
  @config_path = File.expand_path(File.dirname(config_path))
80
81
  @config = YAML.load_file(config_path)
81
82
  end
82
83
 
83
84
  # Configure the CloudCrowd central database (and connect to it), by passing
84
- # in a path to +database.yml+.
85
+ # in a path to <tt>database.yml</tt>. The file should use the standard
86
+ # ActiveRecord connection format.
85
87
  def configure_database(config_path)
86
88
  configuration = YAML.load_file(config_path)
87
89
  ActiveRecord::Base.establish_connection(configuration)
88
90
  end
89
91
 
90
- # Keep an authenticated (if configured to enable authentication) resource
91
- # for the central server.
92
+ # Get a reference to the central server, including authentication,
93
+ # if configured.
92
94
  def central_server
93
95
  return @central_server if @central_server
94
96
  params = [CloudCrowd.config[:central_server]]
@@ -96,26 +98,29 @@ module CloudCrowd
96
98
  @central_server = RestClient::Resource.new(*params)
97
99
  end
98
100
 
99
- # Return the readable status name of an internal CloudCrowd status number.
101
+ # Return the displayable status name of an internal CloudCrowd status number.
102
+ # (See the above constants).
100
103
  def display_status(status)
101
104
  DISPLAY_STATUS_MAP[status]
102
105
  end
103
106
 
104
- # Some workers might not ever need to load all the installed actions,
105
- # so we lazy-load them. Think about a variant of this for installing and
106
- # loading actions into a running CloudCrowd cluster on the fly.
107
- def actions(name)
108
- action_class = Inflector.camelize(name)
109
- begin
110
- raise NameError, "can't find the #{action_class} Action" unless Module.constants.include?(action_class)
111
- Module.const_get(action_class)
112
- rescue NameError => e
113
- user_action = "#{@config_path}/actions/#{name}"
114
- default_action = "#{ROOT}/actions/#{name}"
115
- require user_action and retry if File.exists? "#{user_action}.rb"
116
- require default_action and retry if File.exists? "#{default_action}.rb"
117
- raise e
107
+ # CloudCrowd::Actions are requested dynamically by name. Access them through
108
+ # this actions property, which behaves like a hash. At load time, we
109
+ # load all installed Actions and CloudCrowd's default Actions into it.
110
+ # If you wish to have certain workers be specialized to only handle certain
111
+ # Actions, then install only those into the actions directory.
112
+ def actions
113
+ return @actions if @actions
114
+ @actions = {}
115
+ default_actions = Dir["#{ROOT}/actions/*.rb"]
116
+ custom_actions = Dir["#{CloudCrowd.config[:actions_path]}/*.rb"] ||
117
+ Dir["#{@config_path}/actions/*.rb"]
118
+ (default_actions + custom_actions).each do |path|
119
+ name = File.basename(path, File.extname(path))
120
+ require path
121
+ @actions[name] = Module.const_get(Inflector.camelize(name))
118
122
  end
123
+ @actions
119
124
  end
120
125
  end
121
126