documentcloud-cloud-crowd 0.0.4 → 0.0.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,17 @@
1
+ The crowd, suddenly there where there was nothing before, is a mysterious and
2
+ universal phenomenon. A few people may have been standing together -- five, ten
3
+ or twelve, nor more; nothing has been announced, nothing is expected. Suddenly
4
+ everywhere is black with people and more come streaming from all sides as though
5
+ streets had only one direction. Most of them do not know what has happened and,
6
+ if questioned, have no answer; but they hurry to be there where most other
7
+ people are. There is a determination in their movement which is quite different
8
+ from the expression of ordinary curiosity. It seems as through the movement of
9
+ some of them transmits itself to all the others. But that is not all; they have
10
+ a goal which is there before they can find words for it. -p 16
11
+
12
+ Crowd crystals are the small, rigid groups of men, strictly delimited and of
13
+ great constancy, which serve to precipitate crowds. Their structure is such
14
+ that they can be comprehended and taken in at a glance. Their unity is more
15
+ important than their size. -p 73
16
+
17
+ From Elias Canetti's "Crowds and Power" (1962).
data/LICENSE ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2009 Jeremy Ashkenas, DocumentCloud
2
+
3
+ Permission is hereby granted, free of charge, to any person
4
+ obtaining a copy of this software and associated documentation
5
+ files (the "Software"), to deal in the Software without
6
+ restriction, including without limitation the rights to use,
7
+ copy, modify, merge, publish, distribute, sublicense, and/or sell
8
+ copies of the Software, and to permit persons to whom the
9
+ Software is furnished to do so, subject to the following
10
+ conditions:
11
+
12
+ The above copyright notice and this permission notice shall be
13
+ included in all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
16
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
17
+ OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
18
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
19
+ HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
20
+ WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
21
+ FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
22
+ OTHER DEALINGS IN THE SOFTWARE.
data/README ADDED
@@ -0,0 +1,75 @@
1
+
2
+ _ _
3
+ ( ` )_
4
+ ( ) `)
5
+ (_ (_ . _) _)
6
+ _
7
+ ( )
8
+ _ . ( ` ) . )
9
+ ( _ )_ (_, _( ,_)_)
10
+ (_ _(_ ,)
11
+
12
+ _ _ ___ _ _ ___ _
13
+ ( ` )_ / __| |___ _ _ __| |/ __|_ _ _____ __ ____| |
14
+ ( ) `) | (__| / _ \ || / _` | (__| '_/ _ \ V V / _` |
15
+ (_ (_ . _) _) \___|_\___/\_,_\__,_|\___|_| \___/\_/\_/\__,_|
16
+
17
+ _
18
+ ( )
19
+ _, _ . ( ` ) . )
20
+ ( ( _ )_ (_, _( ,_)_)
21
+ (_(_ _(_ ,)
22
+
23
+
24
+
25
+ ~ CloudCrowd ~
26
+
27
+ * A batch-processing system, map-reduce style
28
+ * Write your scripts in Ruby
29
+ * Built for Amazon EC2 and S3
30
+ * split -> process -> merge
31
+ * As easy as `gem install cloud-crowd`
32
+
33
+
34
+ ~ Getting started ~
35
+
36
+ # Install the gem (documentcloud-cloud-crowd until the first official release).
37
+
38
+ >> sudo gem install cloud-crowd
39
+
40
+ # Install the CloudCrowd configuration files to a location of your choosing.
41
+
42
+ >> crowd install ~/config/cloud-crowd
43
+
44
+ # Now, you can use the full complement of `crowd` commands from inside of
45
+ # this configuration directory. To see the available commands:
46
+
47
+ >> crowd --help
48
+
49
+ # Edit the configuration files to your satisfaction, and add AWS credentials.
50
+
51
+ >> mate ~/config/cloud-crowd/config.yml
52
+ >> mate ~/config/cloud-crowd/database.yml
53
+
54
+ # Write your actions, and install them into the 'actions' subdirectory.
55
+ # CloudCrowd comes with some default actions as an example.
56
+
57
+ # To spin up the central server (make sure that you include its location
58
+ # in config.yml), either:
59
+
60
+ >> crowd server
61
+
62
+ # or:
63
+
64
+ >> thin -R config.ru --servers 3 -e production start
65
+
66
+ # Any server that supports Rack should work with the rackup file.
67
+
68
+ # Then, to spin up 10 workers:
69
+
70
+ >> crowd workers start -n 10
71
+
72
+ # To spin up workers remotely, install the 'cloud-crowd' gem, and copy over
73
+ # your configuration directory.
74
+
75
+
@@ -12,9 +12,7 @@ class GraphicsMagick < CloudCrowd::Action
12
12
  # Download the initial image, and run each of the specified GraphicsMagick
13
13
  # commands against it, returning the aggregate output.
14
14
  def process
15
- result = {}
16
- options['steps'].each {|step| result[step['name']] = run_step(step) }
17
- result.to_json
15
+ options['steps'].inject({}) {|h, step| h[step['name']] = run_step(step); h }
18
16
  end
19
17
 
20
18
  # Run an individual step (single GraphicsMagick command) in a shell-injection
@@ -0,0 +1,92 @@
1
+ # Depends on working pdftk, gm (GraphicsMagick), and pdftotext (Poppler) commands.
2
+ # Splits a pdf into batches of N pages, creates their thumbnails and icons,
3
+ # as specified in the Job options, gets the text for every page, and merges
4
+ # it all back into a tar archive for convenient download.
5
+ #
6
+ # See <tt>examples/process_pdfs_example.rb</tt> for more information.
7
+ class ProcessPdfs < CloudCrowd::Action
8
+
9
+ # Split up a large pdf into single-page pdfs.
10
+ # The double pdftk shuffle fixes the document xrefs.
11
+ def split
12
+ `pdftk #{input_path} burst output "#{file_name}_%05d.pdf_temp"`
13
+ FileUtils.rm input_path
14
+ pdfs = Dir["*.pdf_temp"]
15
+ pdfs.each {|pdf| `pdftk #{pdf} output #{File.basename(pdf, '.pdf_temp')}.pdf`}
16
+ pdfs = Dir["*.pdf"]
17
+ batch_size = options['batch_size']
18
+ batches = (pdfs.length / batch_size.to_f).ceil
19
+ batches.times do |batch_num|
20
+ tar_path = "#{sprintf('%05d', batch_num)}.tar"
21
+ batch_pdfs = pdfs[batch_num*batch_size...(batch_num + 1)*batch_size]
22
+ `tar -czf #{tar_path} #{batch_pdfs.join(' ')}`
23
+ end
24
+ Dir["*.tar"].map {|tar| save(tar) }.to_json
25
+ end
26
+
27
+ # Convert a pdf page into different-sized thumbnails. Grab the text.
28
+ def process
29
+ `tar -xzf #{input_path}`
30
+ FileUtils.rm input_path
31
+ cmds = []
32
+ generate_images_commands(cmds)
33
+ generate_text_commands(cmds)
34
+ system cmds.join(' && ')
35
+ FileUtils.rm Dir['*.pdf']
36
+ `tar -czf #{file_name}.tar *`
37
+ save("#{file_name}.tar")
38
+ end
39
+
40
+ # Merge all of the resulting images, all of the resulting text files, and
41
+ # the concatenated merge of the full-text into a single tar archive, ready to
42
+ # for download.
43
+ def merge
44
+ JSON.parse(input).each do |batch_url|
45
+ batch_path = File.basename(batch_url)
46
+ download(batch_url, batch_path)
47
+ `tar -xzf #{batch_path}`
48
+ FileUtils.rm batch_path
49
+ end
50
+
51
+ names = Dir['*.txt'].map {|fn| fn.sub(/_\d+(_\w+)?\.txt\Z/, '') }.uniq
52
+ dirs = names.map {|n| ["#{n}/text/full", "#{n}/text/pages"] + options['images'].map {|i| "#{n}/images/#{i['name']}" } }.flatten
53
+ FileUtils.mkdir_p(dirs)
54
+
55
+ Dir['*.*'].each do |file|
56
+ ext = File.extname(file)
57
+ name = file.sub(/_\d+(_\w+)?#{ext}\Z/, '')
58
+ if ext == '.txt'
59
+ FileUtils.mv(file, "#{name}/text/pages/#{file}")
60
+ else
61
+ suffix = file.match(/_([^_]+)#{ext}\Z/)[1]
62
+ sans_suffix = file.sub(/_([^_]+)#{ext}\Z/, ext)
63
+ FileUtils.mv(file, "#{name}/images/#{suffix}/#{sans_suffix}")
64
+ end
65
+ end
66
+
67
+ names.each {|n| `cat #{n}/text/pages/*.txt > #{n}/text/full/#{n}.txt` }
68
+
69
+ `tar -czf processed_pdfs.tar *`
70
+ save("processed_pdfs.tar")
71
+ end
72
+
73
+
74
+ private
75
+
76
+ def generate_images_commands(command_list)
77
+ Dir["*.pdf"].each do |pdf|
78
+ name = File.basename(pdf, File.extname(pdf))
79
+ options['images'].each do |i|
80
+ command_list << "gm convert #{i['options']} #{pdf} #{name}_#{i['name']}.#{i['extension']}"
81
+ end
82
+ end
83
+ end
84
+
85
+ def generate_text_commands(command_list)
86
+ Dir["*.pdf"].each do |pdf|
87
+ name = File.basename(pdf, File.extname(pdf))
88
+ command_list << "pdftotext -enc UTF-8 -layout -q #{pdf} #{name}.txt"
89
+ end
90
+ end
91
+
92
+ end
@@ -1,7 +1,7 @@
1
1
  Gem::Specification.new do |s|
2
2
  s.name = 'cloud-crowd'
3
- s.version = '0.0.4' # Keep version in sync with cloud-cloud.rb
4
- s.date = '2009-08-23'
3
+ s.version = '0.0.5' # Keep version in sync with cloud-cloud.rb
4
+ s.date = '2009-09-01'
5
5
 
6
6
  s.homepage = "http://documentcloud.org" # wiki page on github?
7
7
  s.summary = "Better living through Map --> Ruby --> Reduce"
@@ -15,13 +15,19 @@ Gem::Specification.new do |s|
15
15
 
16
16
  s.authors = ['Jeremy Ashkenas']
17
17
  s.email = 'jeremy@documentcloud.org'
18
+ s.rubyforge_project = 'cloud-crowd'
18
19
 
19
20
  s.require_paths = ['lib']
20
21
  s.executables = ['crowd']
21
22
 
22
23
  # s.post_install_message = "Run `crowd --help` for information on using CloudCrowd."
23
- s.rubyforge_project = 'cloud-crowd'
24
- s.has_rdoc = true
24
+
25
+ s.has_rdoc = true
26
+ s.extra_rdoc_files = ['README']
27
+ s.rdoc_options << '--title' << 'CloudCrowd | Better Living through Map --> Ruby --> Reduce' <<
28
+ '--exclude' << 'test' <<
29
+ '--main' << 'README' <<
30
+ '--all'
25
31
 
26
32
  s.add_dependency 'sinatra', ['>= 0.9.4']
27
33
  s.add_dependency 'activerecord', ['>= 2.3.3']
@@ -40,16 +46,21 @@ Gem::Specification.new do |s|
40
46
 
41
47
  s.files = %w(
42
48
  actions/graphics_magick.rb
49
+ actions/process_pdfs.rb
43
50
  cloud-crowd.gemspec
44
51
  config/config.example.ru
45
52
  config/config.example.yml
46
53
  config/database.example.yml
54
+ EPIGRAPHS
55
+ examples/graphics_magick_example.rb
56
+ examples/process_pdfs_example.rb
47
57
  lib/cloud-crowd.rb
48
58
  lib/cloud_crowd/action.rb
49
59
  lib/cloud_crowd/app.rb
50
60
  lib/cloud_crowd/asset_store.rb
51
61
  lib/cloud_crowd/command_line.rb
52
62
  lib/cloud_crowd/daemon.rb
63
+ lib/cloud_crowd/exceptions.rb
53
64
  lib/cloud_crowd/helpers/authorization.rb
54
65
  lib/cloud_crowd/helpers/resources.rb
55
66
  lib/cloud_crowd/helpers.rb
@@ -60,13 +71,22 @@ lib/cloud_crowd/models.rb
60
71
  lib/cloud_crowd/runner.rb
61
72
  lib/cloud_crowd/schema.rb
62
73
  lib/cloud_crowd/worker.rb
74
+ LICENSE
75
+ public/css/admin_console.css
76
+ public/css/reset.css
77
+ public/images/queue_fill.png
78
+ public/js/admin_console.js
79
+ public/js/jquery-1.3.2.js
80
+ README
63
81
  test/acceptance/test_failing_work_units.rb
64
82
  test/blueprints.rb
83
+ test/config/config.ru
65
84
  test/config/config.yml
66
85
  test/config/database.yml
67
86
  test/config/actions/failure_testing.rb
68
87
  test/test_helper.rb
69
88
  test/unit/test_job.rb
70
89
  test/unit/test_work_unit.rb
90
+ views/index.erb
71
91
  )
72
92
  end
@@ -19,6 +19,11 @@
19
19
  :login: [your login name]
20
20
  :password: [your password]
21
21
 
22
+ # By default, CloudCrowd looks for installed actions inside the 'actions'
23
+ # subdirectory of this configuration folder. 'actions_path' allows you to install
24
+ # them in a different location.
25
+ # :actions_path: /path/to/actions
26
+
22
27
  # Set the following numbers to tweak the configuration of your worker daemons.
23
28
  # Optimum results will depend on proportion of the Memory/CPU/IO bottlenecks
24
29
  # in your actions, the number of central servers you have running, and your
@@ -0,0 +1,48 @@
1
+ # Inside of a restclient session:
2
+ # This is a fancy example that produces black and white, annotated, and blurred
3
+ # versions of a list of URLs downloaded from the web.
4
+
5
+ require 'json'
6
+
7
+ RestClient.post(
8
+ 'http://localhost:9173/jobs',
9
+ {:job => {
10
+
11
+ 'action' => 'graphics_magick',
12
+
13
+ 'inputs' => [
14
+ 'http://www.sci-fi-o-rama.com/wp-content/uploads/2008/10/dan_mcpharlin_the_land_of_sleeping_things.jpg',
15
+ 'http://www.sci-fi-o-rama.com/wp-content/uploads/2009/07/dan_mcpharlin_wired_spread01.jpg',
16
+ 'http://www.sci-fi-o-rama.com/wp-content/uploads/2009/07/dan_mcpharlin_wired_spread03.jpg',
17
+ 'http://www.sci-fi-o-rama.com/wp-content/uploads/2009/07/dan_mcpharlin_wired_spread02.jpg',
18
+ 'http://www.sci-fi-o-rama.com/wp-content/uploads/2009/02/dan_mcpharlin_untitled.jpg'
19
+ ],
20
+
21
+ 'options' => {
22
+ 'steps' => [{
23
+ 'name' => 'annotated',
24
+ 'command' => 'convert',
25
+ 'options' => '-font helvetica -fill red -draw "font-size 35; text 75,75 CloudCrowd!"',
26
+ 'extension' => 'jpg'
27
+ },{
28
+ 'name' => 'blurred',
29
+ 'command' => 'convert',
30
+ 'options' => '-blur 10x5',
31
+ 'extension' => 'png'
32
+ },{
33
+ 'name' => 'bw',
34
+ 'input' => 'blurred',
35
+ 'command' => 'convert',
36
+ 'options' => '-monochrome',
37
+ 'extension' => 'jpg'
38
+ }]
39
+ }
40
+
41
+ }.to_json}
42
+ )
43
+
44
+ # status = RestClient.get('http://localhost:9173/jobs/[job_id]')
45
+
46
+ # puts JSON.parse(RestClient.get('http://localhost:9173/jobs/[job_id]'))['outputs'].values.map {|v|
47
+ # JSON.parse(v).map {|v| v['url']}
48
+ # }.flatten.join("\n")
@@ -0,0 +1,30 @@
1
+ RestClient.post(
2
+ 'http://localhost:9173/jobs',
3
+ {:job => {
4
+
5
+ 'action' => 'process_pdfs',
6
+
7
+ 'inputs' => [
8
+ 'http://tigger.uic.edu/~victor/personal/futurism.pdf',
9
+ 'http://www.jonasmekas.com/Catalog_excerpt/The%20Avant-Garde%20From%20Futurism%20to%20Fluxus.pdf',
10
+ 'http://www.dzignism.com/articles/Futurist.Manifesto.pdf'
11
+ ],
12
+
13
+ 'options' => {
14
+
15
+ 'batch_size' => 7,
16
+
17
+ 'images' => [{
18
+ 'name' => '700',
19
+ 'options' => '-resize 700x -density 220 -depth 4 -unsharp 0.5x0.5+0.5+0.03',
20
+ 'extension' => 'gif'
21
+ },{
22
+ 'name' => '1000',
23
+ 'options' => '-resize 1000x -density 220 -depth 4 -unsharp 0.5x0.5+0.5+0.03',
24
+ 'extension' => 'gif'
25
+ }]
26
+
27
+ }
28
+
29
+ }.to_json}
30
+ )
@@ -15,6 +15,7 @@ gem 'sinatra'
15
15
  autoload :ActiveRecord, 'activerecord'
16
16
  autoload :Benchmark, 'benchmark'
17
17
  autoload :Daemons, 'daemons'
18
+ autoload :Digest, 'digest'
18
19
  autoload :ERB, 'erb'
19
20
  autoload :FileUtils, 'fileutils'
20
21
  autoload :JSON, 'json'
@@ -39,7 +40,7 @@ module CloudCrowd
39
40
  ROOT = File.expand_path(File.dirname(__FILE__) + '/..')
40
41
 
41
42
  # Keep the version in sync with the gemspec.
42
- VERSION = '0.0.4'
43
+ VERSION = '0.0.5'
43
44
 
44
45
  # A Job is processing if its WorkUnits in the queue to be handled by workers.
45
46
  PROCESSING = 1
@@ -74,21 +75,22 @@ module CloudCrowd
74
75
  class << self
75
76
  attr_reader :config
76
77
 
77
- # Configure CloudCrowd by passing in the path to +config.yml+.
78
+ # Configure CloudCrowd by passing in the path to <tt>config.yml</tt>.
78
79
  def configure(config_path)
79
80
  @config_path = File.expand_path(File.dirname(config_path))
80
81
  @config = YAML.load_file(config_path)
81
82
  end
82
83
 
83
84
  # Configure the CloudCrowd central database (and connect to it), by passing
84
- # in a path to +database.yml+.
85
+ # in a path to <tt>database.yml</tt>. The file should use the standard
86
+ # ActiveRecord connection format.
85
87
  def configure_database(config_path)
86
88
  configuration = YAML.load_file(config_path)
87
89
  ActiveRecord::Base.establish_connection(configuration)
88
90
  end
89
91
 
90
- # Keep an authenticated (if configured to enable authentication) resource
91
- # for the central server.
92
+ # Get a reference to the central server, including authentication,
93
+ # if configured.
92
94
  def central_server
93
95
  return @central_server if @central_server
94
96
  params = [CloudCrowd.config[:central_server]]
@@ -96,26 +98,29 @@ module CloudCrowd
96
98
  @central_server = RestClient::Resource.new(*params)
97
99
  end
98
100
 
99
- # Return the readable status name of an internal CloudCrowd status number.
101
+ # Return the displayable status name of an internal CloudCrowd status number.
102
+ # (See the above constants).
100
103
  def display_status(status)
101
104
  DISPLAY_STATUS_MAP[status]
102
105
  end
103
106
 
104
- # Some workers might not ever need to load all the installed actions,
105
- # so we lazy-load them. Think about a variant of this for installing and
106
- # loading actions into a running CloudCrowd cluster on the fly.
107
- def actions(name)
108
- action_class = Inflector.camelize(name)
109
- begin
110
- raise NameError, "can't find the #{action_class} Action" unless Module.constants.include?(action_class)
111
- Module.const_get(action_class)
112
- rescue NameError => e
113
- user_action = "#{@config_path}/actions/#{name}"
114
- default_action = "#{ROOT}/actions/#{name}"
115
- require user_action and retry if File.exists? "#{user_action}.rb"
116
- require default_action and retry if File.exists? "#{default_action}.rb"
117
- raise e
107
+ # CloudCrowd::Actions are requested dynamically by name. Access them through
108
+ # this actions property, which behaves like a hash. At load time, we
109
+ # load all installed Actions and CloudCrowd's default Actions into it.
110
+ # If you wish to have certain workers be specialized to only handle certain
111
+ # Actions, then install only those into the actions directory.
112
+ def actions
113
+ return @actions if @actions
114
+ @actions = {}
115
+ default_actions = Dir["#{ROOT}/actions/*.rb"]
116
+ custom_actions = Dir["#{CloudCrowd.config[:actions_path]}/*.rb"] ||
117
+ Dir["#{@config_path}/actions/*.rb"]
118
+ (default_actions + custom_actions).each do |path|
119
+ name = File.basename(path, File.extname(path))
120
+ require path
121
+ @actions[name] = Module.const_get(Inflector.camelize(name))
118
122
  end
123
+ @actions
119
124
  end
120
125
  end
121
126