RubyGems - mapredus - Versions diffs - 0.0.1 - Mend

mapredus 0.0.1

Files changed (17) hide show

data/LICENSE ADDED Viewed

@@ -0,0 +1,20 @@
+Copyright (c) 2010 Dolores Labs
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,227 @@
+MapRedus
+=========
+Simple MapReduce type framework using redis and resque.
+Overview
+--------
+This is an experimental implementation of MapReduce using Ruby for
+process definition, Resque for work execution, and Redis for data
+storage.
+Goals:
+* simple M/R-style programming for existing Ruby projects
+* low cost of entry (no need for a dedicated cluster)
+If you are looking for a high-performance MapReduce implementation
+that can meet your big data needs, try Hadoop.
+Using MapRedus
+---------------
+MapRedus uses Resque to handle the processes that it runs, and redis
+to keep a store for the values/data produced.
+Workers for a MapRedus process, are Resque workers.  Refer to the
+Resque worker documentation to see how to load the necessary
+environment for your worker to be able to run mapreduce processs.  An
+example is also located in the tests.
+### Attaching a mapreduce process to a class
+Often times you'll want to define a mapreduce process that does
+operation on data within a class.  Here is how this looks.  There is
+also an example of this in the tests.
+    class GetWordCount < MapRedus::Process
+      def self.specification
+        {
+          :inputter => WordStream,
+          :mapper => WordCounter,
+          :reducer => Adder,
+          :finalizer => ToRedisHash,
+          :outputter => MapRedus::RedisHasher,
+          :ordered => false
+        }
+      end
+    end
+    class Job
+      mapreduce_process :word_count, GetWordCount, "job:store:result"
+    end
+The mapreduce_process needs a name, mapper, reducer, finalizer,
+outputter, and key to store the result.  The operation would then be
+run on a job calling the following.
+    job = Job.new
+    job.mapreduce.word_count( data )
+The data specifies the data on which this operation is to run.  We are
+currently working on a way to allow the result_store_key to change
+depending on class properties.  For instance in the above example, if
+the Job class had an id attribute, we may want to store the final
+mapreduce result in "job:store:result:#{id}".
+### Inputters, Mappers, Reducers, Finalizers
+MapRedus needs a input stream, mapper, reducer, finalizer to be
+defined to run.  The input stream defines how a block of your data
+gets divided so that a mapper can work on a small portion to map. For
+example:
+    class InputStream < MapRedus::InputStream
+      def self.scan(data_object)
+        # your data object is a reference to a block of text in redis
+        text_block = MapRedus.redis.get(data_object)
+        text_block.each_line.each_with_index do |line, i|
+          yield(i, line)
+        end
+      end
+    end
+    class Mapper < MapRedus::Mapper
+      def self.map(data_to_map)
+        data_to_map.each do |data|
+          key = data
+          value = 1
+          yield( key, value )
+        end
+      end
+    end
+In this example, the inputt stream calls yield to output a mapredus
+file number and a the value that is saved to file (in redis).  The
+mapper's map function calls yield to emit the key value pair for
+storage in redis.  The reducer's reduce function acts similarly.
+The finalizer runs whatever needs to be run when a process completes,
+an example:
+    class Finalizer < MapRedus::Finalizer
+      def self.finalize(process)
+        process.each_key_reduced_value do |key, value|
+          process.outputter.encode(process.keyname, key, value)
+        end
+        ...
+        < set off a new mapredus process to use this stored data >
+      end
+    end
+The process.keyname refers the final result key that is stored in
+redis.  The outputter is needed to define how exactly that encoding is
+defined.  We provided an outputter that encodes your data into a redis
+hash.
+    class RedisHasher < MapRedus::Outputter
+      def encode(result_key, k, v)
+        MapRedus::FileSystem.hset(result_key, k, v)
+      end
+      def decode(result_key, k)
+        MapRedus::FileSystem.hget(result_key, k)
+      end
+    end
+The default Outputter makes no changes to original result, and tries
+to store that directly into redis as a string.
+Running Tests
+-------------
+Run the tests which tests the word counter example and some other
+tests (you'll need to have bundler installed)
+    rake
+Requirements
+------------
+Bundler (this will install all the requirements below)
+Redis
+RedisSupport
+Resque
+Resque-scheduler
+### Notes
+    Instead of calling "emit_intermediate"/"emit" in your map/reduce
+    to produce a key value pair/value you call yield, which will call
+    emit_intermediate/emit for you.  This gives flexibility in using
+    Mapper/Reducer classes especially in testing.
+TODO
+----
+not necessarily in the given order
+* if a process fails we do what we are supposed to do i.e. add a
+  failure_hook which does something if your process fails
+* include functionality for a partitioner, input reader, combiner
+* implement this shit (registering of environment shit in resque) so
+  that we can run mapreduce commands from the command line.  Defining
+  any arbitrary mapper and reducer.
+* implement redundant workers (workers doing the same work in case one
+  of them fails)
+* if a reducer runs a recoverable fail, then make sure that an attempt
+  to reenslave the worker is delayed by some fixed interval
+* edit emit for when we have multiple workers doing the same reduce
+  (redundant workers for fault tolerance might need to change the
+  rpush to a lock and setting of just a value) even if other workers
+  do work on the same answer, want to make sure that the final reduced
+  thing is the same every time
+* Add fault tolerance, better tracking of which workers fail,
+  especially when we have multiple workers doing the same work
+  ... currently is handled by Resque failure auto retry
+* if a perform operation fails then we need to have worker recover
+* make use of finish_metrics somewhere so that we can have statistics
+  on how long map reduce processs take
+* better tracking of work being assigned so we can know when a process is finished
+  or in progress and have a trigger to do things when shit finishes
+    in resque there is functionality for an after hook which performs
+    something after your process does it's work
+    might also check out the resque-status plugin for a cheap and easy
+    way to plug status and completion-rate into existing resque jobs.
+* ensure reducers only do a fixed amount of work?  See section 3.2 of
+  paper. bookkeeping that tells the master when tasks are in-progress
+  or completed.  this will be important for better paralleziation of
+  tasks
+* think about the following logic
+    if a reducer starts working on a key after all maps have finished
+    then when it is done the work on that key is finished forerver
+    this would imply a process finishes when all map tasks have
+    finished and all reduce tasks that start after the map tasks have
+    finished
+    if a reducer started before all map tasks were finished, then load
+    its reduced result back onto the value list
+    if the reducer started after all map tasks finished, then emit the
+    result
+Note on Patches/Pull Requests
+-----------------------------
+* Fork the project.
+* Make your feature addition or bug fix.
+* Add tests for it. This is important so I don't break it in a
+  future version unintentionally.
+* Commit, do not mess with rakefile, version, or history.  (if you
+  want to have your own version, that is fine but bump version in a
+  commit by itself I can ignore when I pull)
+* Send me a pull request. Bonus points for topic branches.
+## Copyright
+Copyright (c) 2010 Dolores Labs. See LICENSE for details.

data/lib/mapredus/filesystem.rb ADDED Viewed

@@ -0,0 +1,43 @@
+module MapRedus
+  # Manages the book keeping of redis keys and redis usage
+  # provides the data storage for process information through redis
+  # All interaction with redis should go through this class
+  #
+  class FileSystem
+    def self.storage
+      MapRedus.redis
+    end
+    # Save/Read functions to save/read values for a redis key
+    #
+    # Examples
+    #   FileSystem.save( key, value )
+    def self.save(key, value, time = nil)
+      storage.set(key, value)
+      storage.expire(key, time) if time
+    end
+    def self.method_missing(method, *args, &block)
+      storage.send(method, *args)
+    end
+    # Setup locks on results using RedisSupport lock functionality
+    #
+    # Examples
+    #   FileSystem::has_lock?(keyname)
+    #   # => true or false
+    #
+    # Returns true if there's a lock
+    def self.has_lock?(keyname)
+      MapRedus.has_redis_lock?( RedisKey.result_cache(keyname) )
+    end
+    def self.acquire_lock(keyname)
+      MapRedus.acquire_redis_lock_nonblock( RedisKey.result_cache(keyname), 60 * 60 )
+    end
+    def self.release_lock(keyname)
+      MapRedus.release_redis_lock( RedisKey.result_cache(keyname) )
+    end
+  end
+end

data/lib/mapredus/finalizer.rb ADDED Viewed

@@ -0,0 +1,33 @@
+module MapRedus
+  # Run the stuff you want to run at the end of the process.
+  # Define subclass which defines self.finalize and self.serialize
+  # to do what is needed when you want to get the final output
+  # out of redis and into ruby.
+  #
+  # This is basically the message back to the user program that a
+  # process is completed storing the necessary info.
+  #
+  class Finalizer < QueueProcess
+    # The default finalizer is to notify of process completion
+    #
+    # Example
+    #   Finalizer::finalize(pid)
+    #   # => "MapRedus Process : 111 : has completed"
+    #
+    # Returns a message notification
+    def self.finalize(pid)
+      "MapRedus Process : #{pid} : has completed"
+    end
+    def self.perform(pid)
+      process = Process.open(pid)
+      result = finalize(process)
+      Master.finish_metrics(pid)
+      result
+    ensure
+      Master.free_slave(pid)
+      process.next_state
+    end
+  end
+end

data/lib/mapredus/inputter.rb ADDED Viewed

@@ -0,0 +1,31 @@
+module MapRedus
+  class InputStream < QueueProcess
+    #
+    # An InputSteam needs to implement a way to scan through the
+    # data_object (the object data that is sent to the MapRedus
+    # process). The scan function implements how the data object is
+    # broken sizable pieces for the mappers to operate on.
+    #
+    # It does this by yielding a <key, map_data> pair.  The key
+    # specifies the location storage in redis.  map_data is string
+    # data that will be written to the redis.
+    #
+    # Example
+    #   scan(data_object) do |key, map_data|
+    #     ...
+    #   end
+    def self.scan(data_object)
+      raise InvalidInputStream
+    end
+    def self.perform(pid, data_object)
+      process = Process.open(pid)
+      scan(data_object) do |key, map_data|
+        FileSystem.hset(ProcessInfo.input(pid), key, map_data)
+        Master.enslave_map(process, key)
+      end
+    ensure
+      Master.free_slave(pid)
+    end
+  end
+end

data/lib/mapredus/keys.rb ADDED Viewed

@@ -0,0 +1,86 @@
+module MapRedus
+  RedisKey = MapRedus::Keys
+  ProcessInfo = RedisKey
+  #### USED WITHIN process.rb ####
+  # Holds the current map reduce processes that are either running or which still have data lying around
+  #
+  redis_key :processes, "mapredus:processes"
+  redis_key :processes_count, "mapredus:processes:count"
+  # Holds the information (mapper, reducer, etc.) in json format for a map reduce process with pid PID
+  #
+  redis_key :pid, "mapredus:process:PID"
+  # The input blocks broken down by the InputStream
+  redis_key :input, "mapredus:process:PID:input"
+  # All the keys that the map produced
+  #
+  redis_key :keys, "mapredus:process:PID:keys"
+  # The hashed key to actual string value of key
+  #
+  redis_key :hash_to_key, "mapredus:process:PID:keys:HASHED_KEY" # to ACTUAL KEY
+  # The list of values for a given key generated by our map function.
+  # When a reduce is run it takes elements from this key and pushes them to :reduce
+  #
+  # key - list of values
+  #
+  redis_key :map, "mapredus:process:PID:map_key:HASHED_KEY"
+  redis_key :reduce, "mapredus:process:PID:map_key:HASHED_KEY:reduce"
+  # Temporary redis space for reduce functions to use
+  #
+  redis_key :temp, "mapredus:process:PID:temp_reduce_key:HASHED_KEY:UNIQUE_REDUCE_HOSTNAME:UNIQUE_REDUCE_PROCESS_ID"
+  # If we want to hold on to our final data we have a key to put that data in
+  # In normal map reduce we would just be outputting files
+  #
+  redis_key :result, "mapredus:process:PID:result"
+  redis_key :result_cache, "mapredus:result:KEYNAME"
+  #### USED WITHIN master.rb ####
+  # Keeps track of the current slaves (by appending "1" to a redis list)
+  #
+  # TODO: should append some sort of proper process id so we can explicitly keep track
+  #       of processes
+  #
+  redis_key :slaves, "mapredus:process:PID:master:slaves"
+  #
+  # Use these constants to keep track of the progress of a process
+  #
+  # Example
+  #   state => map_in_progress
+  #            reduce_in_progress
+  #            finalize_in_progress
+  #            complete
+  #            failed
+  #            not_started
+  #
+  # contained in the ProcessInfo hash (redis_key :state, "mapredus:process:PID:master:state")
+  #
+  NOT_STARTED = "not_started"
+  INPUT_MAP_IN_PROGRESS = "mappers"
+  REDUCE_IN_PROGRESS = "reducers"
+  FINALIZER_IN_PROGRESS = "finalizer"
+  COMPLETE = "complete"
+  FAILED = "failed"
+  STATE_MACHINE = { nil => NOT_STARTED,
+    NOT_STARTED => INPUT_MAP_IN_PROGRESS,
+    INPUT_MAP_IN_PROGRESS => REDUCE_IN_PROGRESS,
+    REDUCE_IN_PROGRESS => FINALIZER_IN_PROGRESS,
+    FINALIZER_IN_PROGRESS => COMPLETE}
+  # These keep track of timing information for a map reduce process of pid PID
+  #
+  redis_key :requested_at, "mapredus:process:PID:request_at"
+  redis_key :started_at, "mapredus:process:PID:started_at"
+  redis_key :finished_at, "mapredus:process:PID:finished_at"
+  redis_key :recent_time_to_complete, "mapredus:process:recent_time_to_complete"
+end

data/lib/mapredus/mapper.rb ADDED Viewed

@@ -0,0 +1,27 @@
+module MapRedus
+  # Map is a function that takes a data chunk
+  # where each data chunk is a list of pieces of your raw data
+  # and emits a list of key, value pairs.
+  #
+  # The output of the map shall always be
+  #   [ [key, value], [key, value], ... ]
+  #
+  # Note: Values must be string, integers, booleans, or floats.
+  # i.e., They must be primitive types since these are the only
+  # types that redis supports and since anything inputted into
+  # redis becomes a string.
+  class Mapper < QueueProcess
+    def self.map(data_chunk); raise InvalidMapper; end
+    def self.perform(pid, data_key)
+      process = Process.open(pid)
+      data_chunk = FileSystem.hget(ProcessInfo.input(pid), data_key)
+      map( data_chunk ) do |*key_value|
+        process.emit_intermediate(*key_value)
+      end
+    ensure
+      Master.free_slave(pid)
+      process.next_state
+    end
+  end
+end

data/lib/mapredus/master.rb ADDED Viewed

@@ -0,0 +1,182 @@
+module MapRedus
+  # Note: Instead of using Resque directly within the process, we implement
+  # a master interface with Resque
+  #
+  # Does bookkeeping to keep track of how many slaves are doing work. If we have
+  # no slaves doing work for a process then the process is done. While there is work available
+  # the slaves will always be doing work.
+  #
+  class Master < QueueProcess
+    # Check whether there are still workers working on process PID's processes
+    #
+    # In synchronous condition, master is always working since nothing is going to
+    # the queue.
+    def self.working?(pid)
+      0 < FileSystem.llen(ProcessInfo.slaves(pid))
+    end
+    #
+    # Master performs the work that it needs to do:
+    #   it must free itself as a slave from Resque
+    #   enslave mappers
+    #
+    def self.perform( pid, data_object )
+      process = Process.open(pid)
+      enslave_inputter(process, data_object)
+      process.update(:state => INPUT_MAP_IN_PROGRESS)
+    end
+    #
+    # The order of operations that occur in the mapreduce process
+    #
+    # The inputter sets off the mapper processes
+    #
+    def self.mapreduce( process, data_object )
+      start_metrics(process.pid)
+      if process.synchronous
+        process.update(:state => INPUT_MAP_IN_PROGRESS)
+        enslave_inputter(process, data_object)
+        process.update(:state => REDUCE_IN_PROGRESS)
+        enslave_reducers(process)
+        process.update(:state => FINALIZER_IN_PROGRESS)
+        enslave_finalizer(process)
+      else
+        Resque.push(QueueProcess.queue, {:class => MapRedus::Master , :args => [process.pid, data_object]} )
+      end
+    end
+    def self.enslave_inputter(process, data_object)
+      enslave( process, process.inputter, process.pid, data_object )
+    end
+    def self.enslave_reducers( process )
+      process.map_keys.each do |key|
+        enslave_reduce( process, key )
+      end
+    end
+    def self.enslave_finalizer( process )
+      enslave( process, process.finalizer, process.pid )
+    end
+    # Have these to match what the Mapper/Reducer perform function expects to see as arguments
+    #
+    # though instead of process the perform function will receive the pid
+    def self.enslave_map(process, data_chunk)
+      enslave( process, process.mapper, process.pid, data_chunk )
+    end
+    def self.enslave_reduce(process, key)
+      enslave( process, process.reducer, process.pid, key )
+    end
+    def self.enslave_later_reduce(process, key)
+      enslave_later( process.reducer.wait, process, process.reducer, process.pid, key )
+    end
+    # The current default (QUEUE) that we push on to is
+    #   :mapredus
+    #
+    def self.enslave( process, klass, *args )
+      FileSystem.rpush(ProcessInfo.slaves(process.pid), 1)
+      if( process.synchronous )
+        klass.perform(*args)
+      else
+        Resque.push( klass.queue, { :class => klass.to_s, :args => args } )
+      end
+    end
+    def self.enslave_later( delay_in_seconds, process, klass, *args)
+      FileSystem.rpush(ProcessInfo.slaves(process.pid), 1)
+      if( process.synchronous )
+        klass.perform(*args)
+      else
+        #
+        # TODO: I cannot get enqueue_in to work with my tests
+        #       there seems to be a silent failure somewhere
+        #       in the tests such that it never calls the function
+        #       and the queue gets emptied
+        #
+        # Resque.enqueue_in(delay_in_seconds, klass, *args)
+        ##
+        ## Temporary, immediately just push process back onto the resque queue
+        Resque.push( klass.queue, { :class => klass.to_s, :args => args } )
+      end
+    end
+    def self.slaves(pid)
+      FileSystem.lrange(ProcessInfo.slaves(pid), 0, -1)
+    end
+    def self.free_slave(pid)
+      FileSystem.lpop(ProcessInfo.slaves(pid))
+    end
+    def self.emancipate(pid)
+      process = Process.open(pid)
+      return unless process
+      # Working on resque directly seems dangerous
+      #
+      # Warning: this is supposed to be used as a debugging operation
+      # and isn't intended for normal use.  It is potentially very expensive.
+      #
+      destroyed = 0
+      qs = [queue, process.mapper.queue, process.reducer.queue, process.finalizer.queue].uniq
+      qs.each do |q|
+        q_key = "queue:#{q}"
+        Resque.redis.lrange(q_key, 0, -1).each do | string |
+          json   = Helper.decode(string)
+          match  = json['class'] == "MapRedus::Master"
+          match |= json['class'] == process.inputter.to_s
+          match |= json['class'] == process.mapper.to_s
+          match |= json['class'] == process.reducer.to_s
+          match |= json['class'] == process.finalizer.to_s
+          match &= json['args'].first.to_s == process.pid.to_s
+          if match
+            destroyed += Resque.redis.lrem(q_key, 0, string).to_i
+          end
+        end
+      end
+      #
+      # our slave information is kept track of on file and not in Resque
+      #
+      FileSystem.del(ProcessInfo.slaves(pid))
+      destroyed
+    end
+    # Time metrics for measuring how long it takes map reduce to do a process
+    #
+    def self.set_request_time(pid)
+      FileSystem.set( ProcessInfo.requested_at(pid), Time.now.to_i )
+    end
+    def self.start_metrics(pid)
+      started  = ProcessInfo.started_at( pid )
+      FileSystem.set started, Time.now.to_i
+    end
+    def self.finish_metrics(pid)
+      started  = ProcessInfo.started_at( pid )
+      finished = ProcessInfo.finished_at( pid )
+      requested = ProcessInfo.requested_at( pid )
+      completion_time = Time.now.to_i
+      FileSystem.set finished, completion_time
+      time_to_complete = completion_time - FileSystem.get(started).to_i
+      recent_ttcs = ProcessInfo.recent_time_to_complete
+      FileSystem.lpush( recent_ttcs , time_to_complete )
+      FileSystem.ltrim( recent_ttcs , 0, 30 - 1)
+      FileSystem.expire finished, 60 * 60
+      FileSystem.expire started, 60 * 60
+      FileSystem.expire requested, 60 * 60
+    end
+  end
+end

data/lib/mapredus/outputter.rb ADDED Viewed

@@ -0,0 +1,42 @@
+module MapRedus
+  #
+  # Standard readers for the input and output of Files coming out
+  # of the FileSystem.
+  #
+  class Outputter < QueueProcess
+    def self.decode(result_key)
+      FileSystem.get(result_key)
+    end
+    def self.encode(result_key, o)
+      FileSystem.set(result_key, o)
+    end
+    #
+    # type should either be "decode" or "encode"
+    #
+    def self.perform(type, o)
+      send(type, o)
+    end
+  end
+  class JsonOutputter < Outputter
+    def self.decode(result_key)
+      Helper.decode(FileSystem.get(result_key))
+    end
+    def self.encode(result_key, o)
+      FileSystem.set(result_key, Helper.encode(o))
+    end
+  end
+  class RedisHasher < Outputter
+    def self.encode(result_key, k, v)
+      FileSystem.hset(result_key, k, v)
+    end
+    def self.decode(result_key, k)
+      FileSystem.hget(result_key, k)
+    end
+  end
+end