RubyGems - documentcloud-cloud-crowd - Versions diffs - 0.0.4 → 0.0.5 - Mend

documentcloud-cloud-crowd 0.0.4 → 0.0.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (36) hide show

data/EPIGRAPHS +17 -0
data/LICENSE +22 -0
data/README +75 -0
data/actions/graphics_magick.rb +1 -3
data/actions/process_pdfs.rb +92 -0
data/cloud-crowd.gemspec +24 -4
data/config/config.example.yml +5 -0
data/examples/graphics_magick_example.rb +48 -0
data/examples/process_pdfs_example.rb +30 -0
data/lib/cloud-crowd.rb +25 -20
data/lib/cloud_crowd/action.rb +29 -24
data/lib/cloud_crowd/app.rb +40 -13
data/lib/cloud_crowd/asset_store.rb +13 -6
data/lib/cloud_crowd/command_line.rb +11 -5
data/lib/cloud_crowd/daemon.rb +7 -2
data/lib/cloud_crowd/exceptions.rb +17 -0
data/lib/cloud_crowd/helpers.rb +1 -1
data/lib/cloud_crowd/helpers/authorization.rb +7 -3
data/lib/cloud_crowd/helpers/resources.rb +12 -3
data/lib/cloud_crowd/inflector.rb +1 -1
data/lib/cloud_crowd/models/job.rb +75 -38
data/lib/cloud_crowd/models/work_unit.rb +14 -8
data/lib/cloud_crowd/schema.rb +3 -1
data/lib/cloud_crowd/worker.rb +32 -15
data/public/css/admin_console.css +51 -0
data/public/css/reset.css +52 -0
data/public/images/queue_fill.png +0 -0
data/public/js/admin_console.js +51 -0
data/public/js/jquery-1.3.2.js +4376 -0
data/test/acceptance/test_failing_work_units.rb +2 -2
data/test/blueprints.rb +1 -0
data/test/config/config.ru +17 -0
data/test/unit/test_job.rb +5 -5
data/test/unit/test_work_unit.rb +1 -1
data/views/index.erb +22 -0
metadata +27 -8

data/EPIGRAPHS ADDED

@@ -0,0 +1,17 @@
+The crowd, suddenly there where there was nothing before, is a mysterious and
+universal phenomenon. A few people may have been standing together -- five, ten
+or twelve, nor more; nothing has been announced, nothing is expected. Suddenly
+everywhere is black with people and more come streaming from all sides as though
+streets had only one direction. Most of them do not know what has happened and,
+if questioned, have no answer; but they hurry to be there where most other
+people are. There is a determination in their movement which is quite different
+from the expression of ordinary curiosity. It seems as through the movement of
+some of them transmits itself to all the others. But that is not all; they have
+a goal which is there before they can find words for it. -p 16
+Crowd crystals are the small, rigid groups of men, strictly delimited and of
+great constancy, which serve to precipitate crowds. Their structure is such
+that they can be comprehended and taken in at a glance. Their unity is more
+important than their size. -p 73
+From Elias Canetti's "Crowds and Power" (1962).

data/LICENSE ADDED

@@ -0,0 +1,22 @@
+Copyright (c) 2009 Jeremy Ashkenas, DocumentCloud
+Permission is hereby granted, free of charge, to any person
+obtaining a copy of this software and associated documentation
+files (the "Software"), to deal in the Software without
+restriction, including without limitation the rights to use,
+copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the
+Software is furnished to do so, subject to the following
+conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
+OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
+HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
+WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+OTHER DEALINGS IN THE SOFTWARE.

data/README ADDED

@@ -0,0 +1,75 @@
+           _  _
+          ( `   )_
+         (    )    `)
+       (_   (_ .  _) _)
+                                      _
+                                     (  )
+      _ .                         ( `  ) . )
+    (  _ )_                      (_, _(  ,_)_)
+  (_  _(_ ,)
+           _  _               ___ _             _  ___                   _
+          ( `   )_           / __| |___ _  _ __| |/ __|_ _ _____ __ ____| |
+         (    )    `)       | (__| / _ \ || / _` | (__| '_/ _ \ V  V / _` |
+       (_   (_ .  _) _)      \___|_\___/\_,_\__,_|\___|_| \___/\_/\_/\__,_|
+                                                     _
+                                                    (  )
+                  _, _ .                         ( `  ) . )
+                 ( (  _ )_                      (_, _(  ,_)_)
+               (_(_  _(_ ,)
+	~ CloudCrowd ~
+		* A batch-processing system, map-reduce style
+		* Write your scripts in Ruby
+		* Built for Amazon EC2 and S3
+		* split -> process -> merge
+		* As easy as `gem install cloud-crowd`
+	~ Getting started ~
+		# Install the gem (documentcloud-cloud-crowd until the first official release).
+		>> sudo gem install cloud-crowd
+		# Install the CloudCrowd configuration files to a location of your choosing.
+		>> crowd install ~/config/cloud-crowd
+		# Now, you can use the full complement of `crowd` commands from inside of
+		# this configuration directory. To see the available commands:
+		>> crowd --help
+		# Edit the configuration files to your satisfaction, and add AWS credentials.
+		>> mate ~/config/cloud-crowd/config.yml
+		>> mate ~/config/cloud-crowd/database.yml
+		# Write your actions, and install them into the 'actions' subdirectory.
+		# CloudCrowd comes with some default actions as an example.
+		# To spin up the central server (make sure that you include its location
+		# in config.yml), either:
+		>> crowd server
+		# or:
+		>> thin -R config.ru --servers 3 -e production start
+		# Any server that supports Rack should work with the rackup file.
+		# Then, to spin up 10 workers:
+		>> crowd workers start -n 10
+		# To spin up workers remotely, install the 'cloud-crowd' gem, and copy over
+		# your configuration directory.

data/actions/graphics_magick.rb CHANGED

@@ -12,9 +12,7 @@ class GraphicsMagick < CloudCrowd::Action
   # Download the initial image, and run each of the specified GraphicsMagick
   # commands against it, returning the aggregate output.
   def process
-    result = {}
-    options['steps'].each {|step| result[step['name']] = run_step(step) }
-    result.to_json
+    options['steps'].inject({}) {|h, step| h[step['name']] = run_step(step); h }
   end
   # Run an individual step (single GraphicsMagick command) in a shell-injection

data/actions/process_pdfs.rb ADDED

@@ -0,0 +1,92 @@
+# Depends on working pdftk, gm (GraphicsMagick), and pdftotext (Poppler) commands.
+# Splits a pdf into batches of N pages, creates their thumbnails and icons,
+# as specified in the Job options, gets the text for every page, and merges
+# it all back into a tar archive for convenient download.
+#
+# See <tt>examples/process_pdfs_example.rb</tt> for more information.
+class ProcessPdfs < CloudCrowd::Action
+  # Split up a large pdf into single-page pdfs.
+  # The double pdftk shuffle fixes the document xrefs.
+  def split
+    `pdftk #{input_path} burst output "#{file_name}_%05d.pdf_temp"`
+    FileUtils.rm input_path
+    pdfs = Dir["*.pdf_temp"]
+    pdfs.each {|pdf| `pdftk #{pdf} output #{File.basename(pdf, '.pdf_temp')}.pdf`}
+    pdfs = Dir["*.pdf"]
+    batch_size = options['batch_size']
+    batches = (pdfs.length / batch_size.to_f).ceil
+    batches.times do |batch_num|
+      tar_path = "#{sprintf('%05d', batch_num)}.tar"
+      batch_pdfs = pdfs[batch_num*batch_size...(batch_num + 1)*batch_size]
+      `tar -czf #{tar_path} #{batch_pdfs.join(' ')}`
+    end
+    Dir["*.tar"].map {|tar| save(tar) }.to_json
+  end
+  # Convert a pdf page into different-sized thumbnails. Grab the text.
+  def process
+    `tar -xzf #{input_path}`
+    FileUtils.rm input_path
+    cmds = []
+    generate_images_commands(cmds)
+    generate_text_commands(cmds)
+    system cmds.join(' && ')
+    FileUtils.rm Dir['*.pdf']
+    `tar -czf #{file_name}.tar *`
+    save("#{file_name}.tar")
+  end
+  # Merge all of the resulting images, all of the resulting text files, and
+  # the concatenated merge of the full-text into a single tar archive, ready to
+  # for download.
+  def merge
+    JSON.parse(input).each do |batch_url|
+      batch_path = File.basename(batch_url)
+      download(batch_url, batch_path)
+      `tar -xzf #{batch_path}`
+      FileUtils.rm batch_path
+    end
+    names = Dir['*.txt'].map {|fn| fn.sub(/_\d+(_\w+)?\.txt\Z/, '') }.uniq
+    dirs = names.map {|n| ["#{n}/text/full", "#{n}/text/pages"] + options['images'].map {|i| "#{n}/images/#{i['name']}" } }.flatten
+    FileUtils.mkdir_p(dirs)
+    Dir['*.*'].each do |file|
+      ext = File.extname(file)
+      name = file.sub(/_\d+(_\w+)?#{ext}\Z/, '')
+      if ext == '.txt'
+        FileUtils.mv(file, "#{name}/text/pages/#{file}")
+      else
+        suffix      = file.match(/_([^_]+)#{ext}\Z/)[1]
+        sans_suffix = file.sub(/_([^_]+)#{ext}\Z/, ext)
+        FileUtils.mv(file, "#{name}/images/#{suffix}/#{sans_suffix}")
+      end
+    end
+    names.each {|n| `cat #{n}/text/pages/*.txt > #{n}/text/full/#{n}.txt` }
+    `tar -czf processed_pdfs.tar *`
+    save("processed_pdfs.tar")
+  end
+  private
+  def generate_images_commands(command_list)
+    Dir["*.pdf"].each do |pdf|
+      name = File.basename(pdf, File.extname(pdf))
+      options['images'].each do |i|
+        command_list << "gm convert #{i['options']} #{pdf} #{name}_#{i['name']}.#{i['extension']}"
+      end
+    end
+  end
+  def generate_text_commands(command_list)
+    Dir["*.pdf"].each do |pdf|
+      name = File.basename(pdf, File.extname(pdf))
+      command_list << "pdftotext -enc UTF-8 -layout -q #{pdf} #{name}.txt"
+    end
+  end
+end

data/cloud-crowd.gemspec CHANGED

@@ -1,7 +1,7 @@
 Gem::Specification.new do |s|
   s.name      = 'cloud-crowd'
-  s.version   = '0.0.4'         # Keep version in sync with cloud-cloud.rb
-  s.date      = '2009-08-23'
+  s.version   = '0.0.5'         # Keep version in sync with cloud-cloud.rb
+  s.date      = '2009-09-01'
   s.homepage    = "http://documentcloud.org" # wiki page on github?
   s.summary     = "Better living through Map --> Ruby --> Reduce"
@@ -15,13 +15,19 @@ Gem::Specification.new do |s|
   s.authors     = ['Jeremy Ashkenas']
   s.email       = 'jeremy@documentcloud.org'
+  s.rubyforge_project    = 'cloud-crowd'
   s.require_paths = ['lib']
   s.executables   = ['crowd']
   # s.post_install_message = "Run `crowd --help` for information on using CloudCrowd."
-  s.rubyforge_project    = 'cloud-crowd'
-  s.has_rdoc             = true
+  s.has_rdoc          = true
+  s.extra_rdoc_files  = ['README']
+  s.rdoc_options      << '--title'    << 'CloudCrowd | Better Living through Map --> Ruby --> Reduce' <<
+                         '--exclude'  << 'test' <<
+                         '--main'     << 'README' <<
+                         '--all'
   s.add_dependency 'sinatra',       ['>= 0.9.4']
   s.add_dependency 'activerecord',  ['>= 2.3.3']
@@ -40,16 +46,21 @@ Gem::Specification.new do |s|
   s.files = %w(
 actions/graphics_magick.rb
+actions/process_pdfs.rb
 cloud-crowd.gemspec
 config/config.example.ru
 config/config.example.yml
 config/database.example.yml
+EPIGRAPHS
+examples/graphics_magick_example.rb
+examples/process_pdfs_example.rb
 lib/cloud-crowd.rb
 lib/cloud_crowd/action.rb
 lib/cloud_crowd/app.rb
 lib/cloud_crowd/asset_store.rb
 lib/cloud_crowd/command_line.rb
 lib/cloud_crowd/daemon.rb
+lib/cloud_crowd/exceptions.rb
 lib/cloud_crowd/helpers/authorization.rb
 lib/cloud_crowd/helpers/resources.rb
 lib/cloud_crowd/helpers.rb
@@ -60,13 +71,22 @@ lib/cloud_crowd/models.rb
 lib/cloud_crowd/runner.rb
 lib/cloud_crowd/schema.rb
 lib/cloud_crowd/worker.rb
+LICENSE
+public/css/admin_console.css
+public/css/reset.css
+public/images/queue_fill.png
+public/js/admin_console.js
+public/js/jquery-1.3.2.js
+README
 test/acceptance/test_failing_work_units.rb
 test/blueprints.rb
+test/config/config.ru
 test/config/config.yml
 test/config/database.yml
 test/config/actions/failure_testing.rb
 test/test_helper.rb
 test/unit/test_job.rb
 test/unit/test_work_unit.rb
+views/index.erb
 )
 end

data/config/config.example.yml CHANGED

@@ -19,6 +19,11 @@
 :login:                   [your login name]
 :password:                [your password]
+# By default, CloudCrowd looks for installed actions inside the 'actions'
+# subdirectory of this configuration folder. 'actions_path' allows you to install
+# them in a different location.
+# :actions_path: /path/to/actions
 # Set the following numbers to tweak the configuration of your worker daemons.
 # Optimum results will depend on proportion of the Memory/CPU/IO bottlenecks
 # in your actions, the number of central servers you have running, and your

data/examples/graphics_magick_example.rb ADDED

@@ -0,0 +1,48 @@
+# Inside of a restclient session:
+# This is a fancy example that produces black and white, annotated, and blurred
+# versions of a list of URLs downloaded from the web.
+require 'json'
+RestClient.post(
+	'http://localhost:9173/jobs',
+	{:job => {
+		'action' => 'graphics_magick',
+		'inputs' => [
+			'http://www.sci-fi-o-rama.com/wp-content/uploads/2008/10/dan_mcpharlin_the_land_of_sleeping_things.jpg',
+			'http://www.sci-fi-o-rama.com/wp-content/uploads/2009/07/dan_mcpharlin_wired_spread01.jpg',
+			'http://www.sci-fi-o-rama.com/wp-content/uploads/2009/07/dan_mcpharlin_wired_spread03.jpg',
+			'http://www.sci-fi-o-rama.com/wp-content/uploads/2009/07/dan_mcpharlin_wired_spread02.jpg',
+			'http://www.sci-fi-o-rama.com/wp-content/uploads/2009/02/dan_mcpharlin_untitled.jpg'
+		],
+		'options' => {
+			'steps' => [{
+				'name' 			=> 'annotated',
+				'command' 	=> 'convert',
+				'options'		=> '-font helvetica -fill red -draw "font-size 35; text 75,75 CloudCrowd!"',
+				'extension' => 'jpg'
+			},{
+				'name'			=> 'blurred',
+				'command' 	=> 'convert',
+				'options'		=> '-blur 10x5',
+				'extension' => 'png'
+			},{
+				'name' 			=> 'bw',
+				'input'			=> 'blurred',
+				'command' 	=> 'convert',
+				'options' 	=> '-monochrome',
+				'extension' => 'jpg'
+			}]
+		}
+	}.to_json}
+)
+# status = RestClient.get('http://localhost:9173/jobs/[job_id]')
+# puts JSON.parse(RestClient.get('http://localhost:9173/jobs/[job_id]'))['outputs'].values.map {|v|
+#		JSON.parse(v).map {|v| v['url']}
+#	}.flatten.join("\n")

data/examples/process_pdfs_example.rb ADDED

@@ -0,0 +1,30 @@
+RestClient.post(
+	'http://localhost:9173/jobs',
+	{:job => {
+		'action' => 'process_pdfs',
+		'inputs' => [
+		  'http://tigger.uic.edu/~victor/personal/futurism.pdf',
+		  'http://www.jonasmekas.com/Catalog_excerpt/The%20Avant-Garde%20From%20Futurism%20to%20Fluxus.pdf',
+		  'http://www.dzignism.com/articles/Futurist.Manifesto.pdf'
+		],
+		'options' => {
+		  'batch_size' => 7,
+		  'images' => [{
+				'name' 			=> '700',
+				'options'		=> '-resize 700x -density 220 -depth 4 -unsharp 0.5x0.5+0.5+0.03',
+				'extension' => 'gif'
+			},{
+				'name' 			=> '1000',
+				'options'		=> '-resize 1000x -density 220 -depth 4 -unsharp 0.5x0.5+0.5+0.03',
+				'extension' => 'gif'
+			}]
+		}
+	}.to_json}
+)

data/lib/cloud-crowd.rb CHANGED

@@ -15,6 +15,7 @@ gem 'sinatra'
 autoload :ActiveRecord, 'activerecord'
 autoload :Benchmark,    'benchmark'
 autoload :Daemons,      'daemons'
+autoload :Digest,       'digest'
 autoload :ERB,          'erb'
 autoload :FileUtils,    'fileutils'
 autoload :JSON,         'json'
@@ -39,7 +40,7 @@ module CloudCrowd
   ROOT        = File.expand_path(File.dirname(__FILE__) + '/..')
   # Keep the version in sync with the gemspec.
-  VERSION     = '0.0.4'
+  VERSION     = '0.0.5'
   # A Job is processing if its WorkUnits in the queue to be handled by workers.
   PROCESSING  = 1
@@ -74,21 +75,22 @@ module CloudCrowd
   class << self
     attr_reader :config
-    # Configure CloudCrowd by passing in the path to +config.yml+.
+    # Configure CloudCrowd by passing in the path to <tt>config.yml</tt>.
     def configure(config_path)
       @config_path = File.expand_path(File.dirname(config_path))
       @config = YAML.load_file(config_path)
     end
     # Configure the CloudCrowd central database (and connect to it), by passing
-    # in a path to +database.yml+.
+    # in a path to <tt>database.yml</tt>. The file should use the standard
+    # ActiveRecord connection format.
     def configure_database(config_path)
       configuration = YAML.load_file(config_path)
       ActiveRecord::Base.establish_connection(configuration)
     end
-    # Keep an authenticated (if configured to enable authentication) resource
-    # for the central server.
+    # Get a reference to the central server, including authentication,
+    # if configured.
     def central_server
       return @central_server if @central_server
       params = [CloudCrowd.config[:central_server]]
@@ -96,26 +98,29 @@ module CloudCrowd
       @central_server = RestClient::Resource.new(*params)
     end
-    # Return the readable status name of an internal CloudCrowd status number.
+    # Return the displayable status name of an internal CloudCrowd status number.
+    # (See the above constants).
     def display_status(status)
       DISPLAY_STATUS_MAP[status]
     end
-    # Some workers might not ever need to load all the installed actions,
-    # so we lazy-load them. Think about a variant of this for installing and
-    # loading actions into a running CloudCrowd cluster on the fly.
-    def actions(name)
-      action_class = Inflector.camelize(name)
-      begin
-        raise NameError, "can't find the #{action_class} Action" unless Module.constants.include?(action_class)
-        Module.const_get(action_class)
-      rescue NameError => e
-        user_action     = "#{@config_path}/actions/#{name}"
-        default_action  = "#{ROOT}/actions/#{name}"
-        require user_action and retry    if File.exists? "#{user_action}.rb"
-        require default_action and retry if File.exists? "#{default_action}.rb"
-        raise e
+    # CloudCrowd::Actions are requested dynamically by name. Access them through
+    # this actions property, which behaves like a hash. At load time, we
+    # load all installed Actions and CloudCrowd's default Actions into it.
+    # If you wish to have certain workers be specialized to only handle certain
+    # Actions, then install only those into the actions directory.
+    def actions
+      return @actions if @actions
+      @actions = {}
+      default_actions = Dir["#{ROOT}/actions/*.rb"]
+      custom_actions  = Dir["#{CloudCrowd.config[:actions_path]}/*.rb"] ||
+                        Dir["#{@config_path}/actions/*.rb"]
+      (default_actions + custom_actions).each do |path|
+        name = File.basename(path, File.extname(path))
+        require path
+        @actions[name] = Module.const_get(Inflector.camelize(name))
       end
+      @actions
     end
   end