RubyGems - finite_mdp - Versions diffs - 0.0.1 - Mend

finite_mdp 0.0.1

Files changed (10) hide show

data/README.rdoc +229 -0
data/lib/finite_mdp/hash_model.rb +123 -0
data/lib/finite_mdp/model.rb +195 -0
data/lib/finite_mdp/solver.rb +344 -0
data/lib/finite_mdp/table_model.rb +122 -0
data/lib/finite_mdp/vector_valued.rb +46 -0
data/lib/finite_mdp/version.rb +3 -0
data/lib/finite_mdp.rb +14 -0
data/test/finite_mdp_test.rb +347 -0
metadata +94 -0

data/README.rdoc ADDED Viewed

@@ -0,0 +1,229 @@
+= finite_mdp
+* https://github.com/jdleesmiller/finite_mdp
+== SYNOPSIS
+Solve small, finite Markov Decision Process (MDP) models.
+This library provides several ways of describing an MDP model (see
+{FiniteMDP::Model}) and some reasonably efficient implementations of policy
+iteration and value iteration to solve it (see {FiniteMDP::Solver}).
+=== Usage
+==== Example 1: Recycling Robot
+The following shows how to solve the recycling robot model (example 3.7) from
+<cite>Sutton and Barto (1998). Reinforcement Learning: An Introduction</cite>.
+<blockquote>
+At each time step, the robot decides whether it should (1) actively search for a
+can, (2) remain stationary and wait for someone to bring it a can, or (3) go
+back to home base to recharge its battery. The best way to find cans is to
+actively search for them, but this runs down the robot's battery, whereas
+waiting does not. Whenever the robot is searching, the possibility exists that
+its battery will become depleted. In this case the robot must shut down and wait
+to be rescued (producing a low reward).  The agent makes its decisions solely as
+a function of the energy level of the battery. It can distinguish two levels,
+high and low.
+</blockquote>
+The transition model is described in Table 3.1, which can be fed directly into
+FiniteMDP using the {FiniteMDP::TableModel}, as follows.
+  require 'finite_mdp'
+  alpha    = 0.1 # Pr(stay at high charge if searching | now have high charge)
+  beta     = 0.1 # Pr(stay at low charge if searching | now have low charge)
+  r_search = 2   # reward for searching
+  r_wait   = 1   # reward for waiting
+  r_rescue = -3  # reward (actually penalty) for running out of charge
+  model = FiniteMDP::TableModel.new [
+    [:high, :search,   :high, alpha,   r_search],
+    [:high, :search,   :low,  1-alpha, r_search],
+    [:low,  :search,   :high, 1-beta,  r_rescue],
+    [:low,  :search,   :low,  beta,    r_search],
+    [:high, :wait,     :high, 1,       r_wait],
+    [:high, :wait,     :low,  0,       r_wait],
+    [:low,  :wait,     :high, 0,       r_wait],
+    [:low,  :wait,     :low,  1,       r_wait],
+    [:low,  :recharge, :high, 1,       0],
+    [:low,  :recharge, :low,  0,       0]]
+  solver = FiniteMDP::Solver.new(model, 0.95) # discount factor 0.95
+  solver.policy_iteration 1e-4
+  solver.policy #=> {:high=>:search, :low=>:recharge}
+==== Example 2: Grid Worlds
+A more complicated example: the grid world from
+<cite>Russel and Norvig (2003). Artificial Intelligence: A Modern
+Approach</cite>, Chapter 17.
+Here we describe the model as a class that implements the {FiniteMDP::Model}
+interface. The model contains terminal states, which we represent with a special
+absorbing state with zero reward, called :stop.
+  require 'finite_mdp'
+  class AIMAGridModel
+    include FiniteMDP::Model
+    #
+    # @param [Array<Array<Float, nil>>] grid rewards at each point, or nil if a
+    #        grid square is an obstacle
+    #
+    # @param [Array<[i, j]>] terminii coordinates of the terminal states
+    #
+    def initialize grid, terminii
+      @grid, @terminii = grid, terminii
+    end
+    attr_reader :grid, :terminii
+    # every position on the grid is a state, except for obstacles, which are
+    # indicated by a nil in the grid
+    def states
+      is, js = (0...grid.size).to_a, (0...grid.first.size).to_a
+      is.product(js).select {|i, j| grid[i][j]} + [:stop]
+    end
+    # can move north, east, south or west on the grid
+    MOVES = {
+      '^' => [-1,  0],
+      '>' => [ 0,  1],
+      'v' => [ 1,  0],
+      '<' => [ 0, -1]}
+    # agent can move north, south, east or west (unless it's in the :stop
+    # state); if it tries to move off the grid or into an obstacle, it stays
+    # where it is
+    def actions state
+      if state == :stop || terminii.member?(state)
+        [:stop]
+      else
+        MOVES.keys
+      end
+    end
+    # define the transition model
+    def transition_probability state, action, next_state
+      if state == :stop || terminii.member?(state)
+        (action == :stop && next_state == :stop) ? 1 : 0
+      else
+        # agent usually succeeds in moving forward, but sometimes it ends up
+        # moving left or right
+        move = case action
+               when '^' then [['^', 0.8], ['<', 0.1], ['>', 0.1]]
+               when '>' then [['>', 0.8], ['^', 0.1], ['v', 0.1]]
+               when 'v' then [['v', 0.8], ['<', 0.1], ['>', 0.1]]
+               when '<' then [['<', 0.8], ['^', 0.1], ['v', 0.1]]
+               end
+        move.map {|m, pr|
+          m_state = [state[0] + MOVES[m][0], state[1] + MOVES[m][1]]
+          m_state = state unless states.member?(m_state) # stay in bounds
+          pr if m_state == next_state
+        }.compact.inject(:+) || 0
+      end
+    end
+    # reward is given by the grid cells; zero reward for the :stop state
+    def reward state, action, next_state
+      state == :stop ? 0 : grid[state[0]][state[1]]
+    end
+    # helper for functions below
+    def hash_to_grid hash
+      0.upto(grid.size-1).map{|i| 0.upto(grid[i].size-1).map{|j| hash[[i,j]]}}
+    end
+    # print the values in a grid
+    def pretty_value value
+      hash_to_grid(Hash[value.map {|s, v| [s, "%+.3f" % v]}]).map{|row|
+        row.map{|cell| cell || '      '}.join(' ')}
+    end
+    # print the policy using ASCII arrows
+    def pretty_policy policy
+      hash_to_grid(policy).map{|row| row.map{|cell|
+        (cell.nil? || cell == :stop) ? ' ' : cell}.join(' ')}
+    end
+  end
+  # the grid from Figures 17.1, 17.2(a) and 17.3
+  model = AIMAGridModel.new(
+    [[-0.04, -0.04, -0.04,    +1],
+     [-0.04,   nil, -0.04,    -1],
+     [-0.04, -0.04, -0.04, -0.04]],
+     [[0, 3], [1, 3]]) # terminals (the +1 and -1 states)
+  # sanity check: probabilities in a row must sum to 1
+  model.check_transition_probabilities_sum
+  solver = FiniteMDP::Solver.new(model, 1) # discount factor 1
+  solver.value_iteration(1e-5, 100) #=> true if converged
+  puts model.pretty_policy(solver.policy)
+  # output: (matches Figure 17.2(a))
+  # > > >
+  # ^   ^
+  # ^ < < <
+  puts model.pretty_value(solver.value)
+  # output: (matches Figure 17.3)
+  #  0.812  0.868  0.918  1.000
+  #  0.762         0.660 -1.000
+  #  0.705  0.655  0.611  0.388
+  FiniteMDP::TableModel.from_model(model)
+  #=> [[0, 0], "v", [0, 0], 0.1, -0.04]
+  #   [[0, 0], "v", [0, 1], 0.1, -0.04]
+  #   [[0, 0], "v", [1, 0], 0.8, -0.04]
+  #   [[0, 0], "<", [0, 0], 0.9, -0.04]
+  #   [[0, 0], "<", [1, 0], 0.1, -0.04]
+  #   [[0, 0], ">", [0, 0], 0.1, -0.04]
+  #   [[0, 0], ">", [0, 1], 0.8, -0.04]
+  #   [[0, 0], ">", [1, 0], 0.1, -0.04]
+  #   ...
+  #   [:stop, :stop, :stop, 1, 0]
+Note that python code for this model is also available from the book's authors
+at http://aima.cs.berkeley.edu/python/mdp.html
+== REQUIREMENTS
+Tested on
+* ruby 1.8.7 (2010-06-23 patchlevel 299) [i686-linux]
+* ruby 1.9.2p0 (2010-08-18 revision 29036) [i686-linux]
+== INSTALLATION
+  gem install finite_mdp
+== LICENSE
+(The MIT License)
+Copyright (c) 2011 John Lees-Miller
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+'Software'), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/lib/finite_mdp/hash_model.rb ADDED Viewed

@@ -0,0 +1,123 @@
+#
+# A finite markov decision process model for which the transition
+# probabilities and rewards are specified using nested hash tables.
+#
+# The structure of the nested hash is as follows:
+#  hash[:s]         #=> a Hash that maps actions to successor states
+#  hash[:s][:a]     #=> a Hash from successor states to pairs (see next)
+#  hash[:s][:a][:t] #=> an Array [probability, reward] for transition (s,a,t)
+#
+# The states and actions can be arbitrary objects; see notes for {Model}.
+#
+# The {TableModel} is an alternative way of storing these data.
+#
+class FiniteMDP::HashModel
+  include FiniteMDP::Model
+  #
+  # @param [Hash<state, Hash<action, Hash<state, [Float, Float]>>>] hash see
+  #        notes for {HashModel} for an explanation of this structure
+  #
+  def initialize hash
+    @hash = hash
+  end
+  #
+  # @return [Hash<state, Hash<action, Hash<state, [Float, Float]>>>] see notes
+  #         for {HashModel} for an explanation of this structure
+  #
+  attr_accessor :hash
+  #
+  # States in this model; see {Model#states}.
+  #
+  # @return [Array<state>] not empty; no duplicate states
+  #
+  def states
+    hash.keys
+  end
+  #
+  # Actions that are valid for the given state; see {Model#actions}.
+  #
+  # @param [state] state
+  #
+  # @return [Array<action>] not empty; no duplicate actions
+  #
+  def actions state
+    hash[state].keys
+  end
+  #
+  # Possible successor states after taking the given action in the given state;
+  # see {Model#next_states}.
+  #
+  # @param [state] state
+  #
+  # @param [action] action
+  #
+  # @return [Array<state>] not empty; no duplicate states
+  #
+  def next_states state, action
+    hash[state][action].keys
+  end
+  #
+  # Probability of the given transition; see {Model#transition_probability}.
+  #
+  # @param [state] state
+  #
+  # @param [action] action
+  #
+  # @param [state] next_state
+  #
+  # @return [Float] in [0, 1]; zero if the transition is not in the hash
+  #
+  def transition_probability state, action, next_state
+    probability, reward = hash[state][action][next_state]
+    probability || 0
+  end
+  #
+  # Reward for a given transition; see {Model#reward}.
+  #
+  # @param [state] state
+  #
+  # @param [action] action
+  #
+  # @param [state] next_state
+  #
+  # @return [Float, nil] nil if the transition is not in the hash
+  #
+  def reward state, action, next_state
+    probability, reward = hash[state][action][next_state]
+    reward
+  end
+  #
+  # Convert a generic model into a hash model.
+  #
+  # @param [Model] model
+  #
+  # @param [Boolean] sparse do not store entries for transitions with zero
+  #        probability
+  #
+  # @return [HashModel] not nil
+  #
+  def self.from_model model, sparse=true
+    hash = {}
+    model.states.each do |state|
+      hash[state] ||= {}
+      model.actions(state).each do |action|
+        hash[state][action] ||= {}
+        model.next_states(state, action).each do |next_state|
+          pr = model.transition_probability(state, action, next_state)
+          hash[state][action][next_state] = [pr,
+            model.reward(state, action, next_state)] if pr > 0 || !sparse
+        end
+      end
+    end
+    FiniteMDP::HashModel.new(hash)
+  end
+end

data/lib/finite_mdp/model.rb ADDED Viewed

@@ -0,0 +1,195 @@
+#
+# Interface that defines a finite markov decision process model.
+#
+# There are several approaches to describing the state, action, transition
+# probability and reward data for use with this library.
+#
+# 1. Write the data directly into a {TableModel} or {HashModel}. This is usually
+#    the way to go for small models, such as examples from text books.
+#
+# 1. Write a procedure that generates the data and stores them in a
+#    {TableModel} or {HashModel}. This gives the most flexibility in how the
+#    data are generated.
+#
+# 1. Write a class that implements the methods in this module. The methods in
+#    this module are a fairly close approximation to the usual way of defining
+#    an MDP mathematically, so it can be a useful way of structuring the
+#    definition. It can then be converted to one of the other representations
+#    (see {TableModel.from_model}) or passed directly to a {Solver}.
+#
+# The discussion below applies to all of these approaches.
+#
+# Note that there is no special treatment for terminal states, but they can be
+# modeled by including a dummy state (a state with zero reward and one action
+# that brings the process back to the dummy state with probability 1).
+#
+# The states and actions can be arbitrary objects. The only requirement is that
+# they support hashing and equality (in the sense of <tt>eql?</tt>), which all
+# ruby objects do. Built-in types, such as symbols, arrays and Structs, will
+# work as expected. Note, however, that the default hashing and equality
+# semantics for custom classes may not be what you want. The following example
+# illustrates this:
+#
+#   class BadGridState
+#     def initialize x, y
+#       @x, @y = x, y
+#     end
+#     attr_accessor :x, :y
+#   end
+#
+#   BadGridState.new(1, 1) == BadGridState.new(1, 2) #=> false
+#   BadGridState.new(1, 1) == BadGridState.new(1, 1) #=> false (!!!)
+#
+# This is because, by default, hashing and equality are defined in terms of
+# object identifiers, not the 'content' of the objects.
+# The preferred solution is to define the state as a <tt>Struct</tt>:
+#
+#   GoodGridState = Struct.new(:x, :y)
+#
+#   GoodGridState.new(1, 1) == GoodGridState.new(1, 2) #=> false
+#   GoodGridState.new(1, 1) == GoodGridState.new(1, 1) #=> true
+#
+# <tt>Struct</tt> is part of the ruby standard library, and it implements
+# hashing and equality based on object content rather than identity.
+#
+# Alternatively, if you cannot derive your state class from <tt>Struct</tt>, you
+# can define your own hash code and equality check. An easy way to do this is to
+# include the {VectorValued} mix-in. It is also notable that you can make the
+# default semantics work; you just have to make sure that there is only one
+# instance of your state class per state, as in the following example:
+#
+#   g11 = BadGridState.new(1, 1)
+#   g12 = BadGridState.new(1, 2)
+#   g21 = BadGridState.new(2, 1)
+#   model = FiniteMDP::TableModel.new([
+#     [g11, :up,    g12, 0, 0.9],
+#     [g11, :up,    g21, 0, 0.1],
+#     [g11, :right, g21, 0, 0.9],
+#     # ...
+#     ]) # this will work as expected
+#
+# Note that the {Solver} will convert the model to its own internal
+# representation. The efficiency of the methods that define the model is
+# important while the solver is building its internal representation, but it
+# does not affect the performance of the iterative algorithm used after that.
+# Also note that the solver handles state and action numbering internally, so it
+# is not necessary to use numbers for the states.
+#
+module FiniteMDP::Model
+  #
+  # States in this model.
+  #
+  # @return [Array<state>] not empty; no duplicate states
+  #
+  # @abstract
+  #
+  def states
+    raise NotImplementedError
+  end
+  #
+  # Actions that are valid for the given state.
+  #
+  # All states must have at least one valid action; see notes for {Model}
+  # regarding how to encode a terminal state.
+  #
+  # @param [state] state
+  #
+  # @return [Array<action>] not empty; no duplicate actions
+  #
+  # @abstract
+  #
+  def actions state
+    raise NotImplementedError
+  end
+  #
+  # Successor states after taking the given action in the given state. Note that
+  # the returned states may occur with zero probability.
+  #
+  # The default behavior is to return all states as candidate successor states
+  # and let {#transition_probability} determine which ones are possible. It can
+  # be overridden in sparse models to avoid storing or computing lots of zeros.
+  # Also note that {TableModel.from_model} and {HashModel.from_model} can be
+  # told to ignore transitions with zero probability, and that the {Solver}
+  # ignores them in its internal representation, so you can usually forget about
+  # this method.
+  #
+  # @param [state] state
+  #
+  # @param [action] action
+  #
+  # @return [Array<state>] not empty; no duplicate states
+  #
+  def next_states state, action
+    states
+  end
+  #
+  # Probability of the given transition.
+  #
+  # If the transition is not in the model, in the sense that it would never
+  # arise from {#states}, {#actions} and {#next_states}, the result is
+  # undefined. Note that {HashModel#transition_probability} and
+  # {TableModel#transition_probability} return zero in this case, but this is
+  # not part of the contract.
+  #
+  # @param [state] state
+  #
+  # @param [action] action
+  #
+  # @param [state] next_state
+  #
+  # @return [Float] in [0, 1]; undefined if the transition is not in the model
+  #  (see notes above)
+  #
+  # @abstract
+  #
+  def transition_probability state, action, next_state
+    raise NotImplementedError
+  end
+  #
+  # Reward for a given transition.
+  #
+  # If the transition is not in the model, in the sense that it would never
+  # arise from {#states}, {#actions} and {#next_states}, the result is
+  # undefined. Note that {HashModel#reward} and {TableModel#reward} return
+  # <tt>nil</tt> in this case, but this is not part of the contract.
+  #
+  # @param [state] state
+  #
+  # @param [action] action
+  #
+  # @param [state] next_state
+  #
+  # @return [Float, nil] nil only if the transition is not in the model (but the
+  #  result is undefined in this case -- it need not be nil; see notes above)
+  #
+  # @abstract
+  #
+  def reward state, action, next_state
+    raise NotImplementedError
+  end
+  #
+  # Raise an error if the sum of the transition probabilities for any (state,
+  # action) pair is not sufficiently close to 1.
+  #
+  # @param [Float] tol numerical tolerance
+  #
+  # @return [nil]
+  #
+  def check_transition_probabilities_sum tol=1e-6
+    states.each do |state|
+      actions(state).each do |action|
+        pr = next_states(state, action).map{|next_state|
+          transition_probability(state, action, next_state)}.inject(:+)
+        raise "transition probabilities for state #{state.inspect} and
+          action #{action.inspect} sum to #{pr}" if pr < 1 - tol
+      end
+    end
+    nil
+  end
+end