PyPI - pyRDDLGym-jax - Versions diffs - 2.1__tar.gz → 2.2__tar.gz - Mend

pyRDDLGym-jax 2.1tar.gz → 2.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (55) hide show

{pyrddlgym_jax-2.1 → pyrddlgym_jax-2.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: pyRDDLGym-jax
-Version: 2.1
+Version: 2.2
 Summary: pyRDDLGym-jax: automatic differentiation for solving sequential planning problems in JAX.
 Home-page: https://github.com/pyrddlgym-project/pyRDDLGym-jax
 Author: Michael Gimelfarb, Ayal Taitler, Scott Sanner
@@ -58,8 +58,11 @@ Dynamic: summary
 Purpose:
-1. automatic translation of any RDDL description file into a differentiable simulator in JAX
-2. flexible policy class representations, automatic model relaxations for working in discrete and hybrid domains, and Bayesian hyper-parameter tuning.
+1. automatic translation of RDDL description files into differentiable JAX simulators
+2. implementation of (highly configurable) operator relaxations for working in discrete and hybrid domains
+3. flexible policy representations and automated Bayesian hyper-parameter tuning
+4. interactive dashboard for dyanmic visualization and debugging
+5. hybridization with parameter-exploring policy gradients.
 Some demos of solved problems by JaxPlan:
@@ -235,8 +238,23 @@ More documentation about this and other new features will be coming soon.
 ## Tuning the Planner
-It is easy to tune the planner's hyper-parameters efficiently and automatically using Bayesian optimization.
-To do this, first create a config file template with patterns replacing concrete parameter values that you want to tune, e.g.:
+A basic run script is provided to run automatic Bayesian hyper-parameter tuning for the most sensitive parameters of JaxPlan:
+```shell
+jaxplan tune <domain> <instance> <method> <trials> <iters> <workers> <dashboard>
+```
+where:
+- ``domain`` is the domain identifier as specified in rddlrepository
+- ``instance`` is the instance identifier
+- ``method`` is the planning method to use (i.e. drp, slp, replan)
+- ``trials`` is the (optional) number of trials/episodes to average in evaluating each hyper-parameter setting
+- ``iters`` is the (optional) maximum number of iterations/evaluations of Bayesian optimization to perform
+- ``workers`` is the (optional) number of parallel evaluations to be done at each iteration, e.g. the total evaluations = ``iters * workers``
+- ``dashboard`` is whether the optimizations are tracked in the dashboard application.
+It is easy to tune a custom range of the planner's hyper-parameters efficiently.
+First create a config file template with patterns replacing concrete parameter values that you want to tune, e.g.:
 ```ini
 [Model]
@@ -260,7 +278,7 @@ train_on_reset=True
 would allow to tune the sharpness of model relaxations, and the learning rate of the optimizer.
-Next, you must link the patterns in the config with concrete hyper-parameter ranges the tuner will understand:
+Next, you must link the patterns in the config with concrete hyper-parameter ranges the tuner will understand, and run the optimizer:
 ```python
 import pyRDDLGym
@@ -292,22 +310,7 @@ tuning = JaxParameterTuning(env=env,
                             gp_iters=iters)
 tuning.tune(key=42, log_file='path/to/log.csv')
 ```
-A basic run script is provided to run the automatic hyper-parameter tuning for the most sensitive parameters of JaxPlan:
-```shell
-jaxplan tune <domain> <instance> <method> <trials> <iters> <workers> <dashboard>
-```
-where:
-- ``domain`` is the domain identifier as specified in rddlrepository
-- ``instance`` is the instance identifier
-- ``method`` is the planning method to use (i.e. drp, slp, replan)
-- ``trials`` is the (optional) number of trials/episodes to average in evaluating each hyper-parameter setting
-- ``iters`` is the (optional) maximum number of iterations/evaluations of Bayesian optimization to perform
-- ``workers`` is the (optional) number of parallel evaluations to be done at each iteration, e.g. the total evaluations = ``iters * workers``
-- ``dashboard`` is whether the optimizations are tracked in the dashboard application.
 ## Simulation

{pyrddlgym_jax-2.1 → pyrddlgym_jax-2.2}/README.md RENAMED Viewed

@@ -12,8 +12,11 @@
 Purpose:
-1. automatic translation of any RDDL description file into a differentiable simulator in JAX
-2. flexible policy class representations, automatic model relaxations for working in discrete and hybrid domains, and Bayesian hyper-parameter tuning.
+1. automatic translation of RDDL description files into differentiable JAX simulators
+2. implementation of (highly configurable) operator relaxations for working in discrete and hybrid domains
+3. flexible policy representations and automated Bayesian hyper-parameter tuning
+4. interactive dashboard for dyanmic visualization and debugging
+5. hybridization with parameter-exploring policy gradients.
 Some demos of solved problems by JaxPlan:
@@ -189,8 +192,23 @@ More documentation about this and other new features will be coming soon.
 ## Tuning the Planner
-It is easy to tune the planner's hyper-parameters efficiently and automatically using Bayesian optimization.
-To do this, first create a config file template with patterns replacing concrete parameter values that you want to tune, e.g.:
+A basic run script is provided to run automatic Bayesian hyper-parameter tuning for the most sensitive parameters of JaxPlan:
+```shell
+jaxplan tune <domain> <instance> <method> <trials> <iters> <workers> <dashboard>
+```
+where:
+- ``domain`` is the domain identifier as specified in rddlrepository
+- ``instance`` is the instance identifier
+- ``method`` is the planning method to use (i.e. drp, slp, replan)
+- ``trials`` is the (optional) number of trials/episodes to average in evaluating each hyper-parameter setting
+- ``iters`` is the (optional) maximum number of iterations/evaluations of Bayesian optimization to perform
+- ``workers`` is the (optional) number of parallel evaluations to be done at each iteration, e.g. the total evaluations = ``iters * workers``
+- ``dashboard`` is whether the optimizations are tracked in the dashboard application.
+It is easy to tune a custom range of the planner's hyper-parameters efficiently.
+First create a config file template with patterns replacing concrete parameter values that you want to tune, e.g.:
 ```ini
 [Model]
@@ -214,7 +232,7 @@ train_on_reset=True
 would allow to tune the sharpness of model relaxations, and the learning rate of the optimizer.
-Next, you must link the patterns in the config with concrete hyper-parameter ranges the tuner will understand:
+Next, you must link the patterns in the config with concrete hyper-parameter ranges the tuner will understand, and run the optimizer:
 ```python
 import pyRDDLGym
@@ -246,22 +264,7 @@ tuning = JaxParameterTuning(env=env,
                             gp_iters=iters)
 tuning.tune(key=42, log_file='path/to/log.csv')
 ```
-A basic run script is provided to run the automatic hyper-parameter tuning for the most sensitive parameters of JaxPlan:
-```shell
-jaxplan tune <domain> <instance> <method> <trials> <iters> <workers> <dashboard>
-```
-where:
-- ``domain`` is the domain identifier as specified in rddlrepository
-- ``instance`` is the instance identifier
-- ``method`` is the planning method to use (i.e. drp, slp, replan)
-- ``trials`` is the (optional) number of trials/episodes to average in evaluating each hyper-parameter setting
-- ``iters`` is the (optional) maximum number of iterations/evaluations of Bayesian optimization to perform
-- ``workers`` is the (optional) number of parallel evaluations to be done at each iteration, e.g. the total evaluations = ``iters * workers``
-- ``dashboard`` is whether the optimizations are tracked in the dashboard application.
 ## Simulation

pyrddlgym_jax-2.2/pyRDDLGym_jax/__init__.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ __version__ = '2.2'

{pyrddlgym_jax-2.1 → pyrddlgym_jax-2.2}/pyRDDLGym_jax/core/planner.py RENAMED Viewed

@@ -47,7 +47,9 @@ import jax.random as random
 import numpy as np
 import optax
 import termcolor
-from tqdm import tqdm
+from tqdm import tqdm, TqdmWarning
+import warnings
+warnings.filterwarnings("ignore", category=TqdmWarning)
 from pyRDDLGym.core.compiler.model import RDDLPlanningModel, RDDLLiftedModel
 from pyRDDLGym.core.debug.logger import Logger
@@ -1212,17 +1214,22 @@ class GaussianPGPE(PGPE):
                  init_sigma: float=1.0,
                  sigma_range: Tuple[float, float]=(1e-5, 1e5),
                  scale_reward: bool=True,
+                 min_reward_scale: float=1e-5,
                  super_symmetric: bool=True,
                  super_symmetric_accurate: bool=True,
                  optimizer: Callable[..., optax.GradientTransformation]=optax.adam,
                  optimizer_kwargs_mu: Optional[Kwargs]=None,
-                 optimizer_kwargs_sigma: Optional[Kwargs]=None) -> None:
+                 optimizer_kwargs_sigma: Optional[Kwargs]=None,
+                 start_entropy_coeff: float=1e-3,
+                 end_entropy_coeff: float=1e-8,
+                 max_kl_update: Optional[float]=None) -> None:
         '''Creates a new Gaussian PGPE planner.
         :param batch_size: how many policy parameters to sample per optimization step
         :param init_sigma: initial standard deviation of Gaussian
         :param sigma_range: bounds to constrain standard deviation
         :param scale_reward: whether to apply reward scaling as in the paper
+        :param min_reward_scale: minimum reward scaling to avoid underflow
         :param super_symmetric: whether to use super-symmetric sampling as in the paper
         :param super_symmetric_accurate: whether to use the accurate formula for super-
         symmetric sampling or the simplified but biased formula
@@ -1231,6 +1238,9 @@ class GaussianPGPE(PGPE):
         factory for the mean optimizer
         :param optimizer_kwargs_sigma: a dictionary of parameters to pass to the SGD
         factory for the standard deviation optimizer
+        :param start_entropy_coeff: starting entropy regularization coeffient for Gaussian
+        :param end_entropy_coeff: ending entropy regularization coeffient for Gaussian
+        :param max_kl_update: bound on kl-divergence between parameter updates
         '''
         super().__init__()
@@ -1238,8 +1248,13 @@ class GaussianPGPE(PGPE):
         self.init_sigma = init_sigma
         self.sigma_range = sigma_range
         self.scale_reward = scale_reward
+        self.min_reward_scale = min_reward_scale
         self.super_symmetric = super_symmetric
         self.super_symmetric_accurate = super_symmetric_accurate
+        # entropy regularization penalty is decayed exponentially between these values
+        self.start_entropy_coeff = start_entropy_coeff
+        self.end_entropy_coeff = end_entropy_coeff
         # set optimizers
         if optimizer_kwargs_mu is None:
@@ -1249,36 +1264,62 @@ class GaussianPGPE(PGPE):
             optimizer_kwargs_sigma = {'learning_rate': 0.1}
         self.optimizer_kwargs_sigma = optimizer_kwargs_sigma
         self.optimizer_name = optimizer
-        mu_optimizer = optimizer(**optimizer_kwargs_mu)
-        sigma_optimizer = optimizer(**optimizer_kwargs_sigma)
+        try:
+            mu_optimizer = optax.inject_hyperparams(optimizer)(**optimizer_kwargs_mu)
+            sigma_optimizer = optax.inject_hyperparams(optimizer)(**optimizer_kwargs_sigma)
+        except Exception as _:
+            raise_warning(
+                f'Failed to inject hyperparameters into optax optimizer for PGPE, '
+                'rolling back to safer method: please note that kl-divergence '
+                'constraints will be disabled.', 'red')
+            mu_optimizer = optimizer(**optimizer_kwargs_mu)
+            sigma_optimizer = optimizer(**optimizer_kwargs_sigma)
+            max_kl_update = None
         self.optimizers = (mu_optimizer, sigma_optimizer)
+        self.max_kl = max_kl_update
     def __str__(self) -> str:
         return (f'PGPE hyper-parameters:\n'
-                f'    method         ={self.__class__.__name__}\n'
-                f'    batch_size     ={self.batch_size}\n'
-                f'    init_sigma     ={self.init_sigma}\n'
-                f'    sigma_range    ={self.sigma_range}\n'
-                f'    scale_reward   ={self.scale_reward}\n'
-                f'    super_symmetric={self.super_symmetric}\n'
-                f'        accurate   ={self.super_symmetric_accurate}\n'
-                f'    optimizer      ={self.optimizer_name}\n'
+                f'    method             ={self.__class__.__name__}\n'
+                f'    batch_size         ={self.batch_size}\n'
+                f'    init_sigma         ={self.init_sigma}\n'
+                f'    sigma_range        ={self.sigma_range}\n'
+                f'    scale_reward       ={self.scale_reward}\n'
+                f'    min_reward_scale   ={self.min_reward_scale}\n'
+                f'    super_symmetric    ={self.super_symmetric}\n'
+                f'        accurate       ={self.super_symmetric_accurate}\n'
+                f'    optimizer          ={self.optimizer_name}\n'
                 f'    optimizer_kwargs:\n'
                 f'        mu   ={self.optimizer_kwargs_mu}\n'
                 f'        sigma={self.optimizer_kwargs_sigma}\n'
+                f'    start_entropy_coeff={self.start_entropy_coeff}\n'
+                f'    end_entropy_coeff  ={self.end_entropy_coeff}\n'
+                f'    max_kl_update      ={self.max_kl}\n'
         )
     def compile(self, loss_fn: Callable, projection: Callable, real_dtype: Type) -> None:
-        MIN_NORM = 1e-5
         sigma0 = self.init_sigma
         sigma_range = self.sigma_range
         scale_reward = self.scale_reward
+        min_reward_scale = self.min_reward_scale
         super_symmetric = self.super_symmetric
         super_symmetric_accurate = self.super_symmetric_accurate
         batch_size = self.batch_size
         optimizers = (mu_optimizer, sigma_optimizer) = self.optimizers
-        # initializer
+        max_kl = self.max_kl
+        # entropy regularization penalty is decayed exponentially by elapsed budget
+        start_entropy_coeff = self.start_entropy_coeff
+        if start_entropy_coeff == 0:
+            entropy_coeff_decay = 0
+        else:
+            entropy_coeff_decay = (self.end_entropy_coeff / start_entropy_coeff) ** 0.01
+        # ***********************************************************************
+        # INITIALIZATION OF POLICY
+        #
+        # ***********************************************************************
         def _jax_wrapped_pgpe_init(key, policy_params):
             mu = policy_params
             sigma = jax.tree_map(lambda x: sigma0 * jnp.ones_like(x), mu)
@@ -1289,7 +1330,11 @@ class GaussianPGPE(PGPE):
         self._initializer = jax.jit(_jax_wrapped_pgpe_init)
-        # parameter sampling functions
+        # ***********************************************************************
+        # PARAMETER SAMPLING FUNCTIONS
+        #
+        # ***********************************************************************
         def _jax_wrapped_mu_noise(key, sigma):
             return sigma * random.normal(key, shape=jnp.shape(sigma), dtype=real_dtype)
@@ -1299,19 +1344,20 @@ class GaussianPGPE(PGPE):
             a = (sigma - jnp.abs(epsilon)) / sigma
             if super_symmetric_accurate:
                 aa = jnp.abs(a)
+                aa3 = jnp.power(aa, 3)
                 epsilon_star = jnp.sign(epsilon) * phi * jnp.where(
                     a <= 0,
-                    jnp.exp(c1 * aa * (aa * aa - 1) / jnp.log(aa + 1e-10) + c2 * aa),
-                    jnp.exp(aa - c3 * aa * jnp.log(1.0 - jnp.power(aa, 3) + 1e-10))
+                    jnp.exp(c1 * (aa3 - aa) / jnp.log(aa + 1e-10) + c2 * aa),
+                    jnp.exp(aa - c3 * aa * jnp.log(1.0 - aa3 + 1e-10))
                 )
             else:
                 epsilon_star = jnp.sign(epsilon) * phi * jnp.exp(a)
             return epsilon_star
         def _jax_wrapped_sample_params(key, mu, sigma):
-            keys = random.split(key, num=len(jax.tree_util.tree_leaves(mu)))
-            keys_pytree = jax.tree_util.tree_unflatten(
-                treedef=jax.tree_util.tree_structure(mu), leaves=keys)
+            treedef = jax.tree_util.tree_structure(sigma)
+            keys = random.split(key, num=treedef.num_leaves)
+            keys_pytree = jax.tree_util.tree_unflatten(treedef=treedef, leaves=keys)
             epsilon = jax.tree_map(_jax_wrapped_mu_noise, keys_pytree, sigma)
             p1 = jax.tree_map(jnp.add, mu, epsilon)
             p2 = jax.tree_map(jnp.subtract, mu, epsilon)
@@ -1321,14 +1367,18 @@ class GaussianPGPE(PGPE):
                 p4 = jax.tree_map(jnp.subtract, mu, epsilon_star)
             else:
                 epsilon_star, p3, p4 = epsilon, p1, p2
-            return (p1, p2, p3, p4), (epsilon, epsilon_star)
+            return p1, p2, p3, p4, epsilon, epsilon_star
-        # policy gradient update functions
+        # ***********************************************************************
+        # POLICY GRADIENT CALCULATION
+        #
+        # ***********************************************************************
         def _jax_wrapped_mu_grad(epsilon, epsilon_star, r1, r2, r3, r4, m):
             if super_symmetric:
                 if scale_reward:
-                    scale1 = jnp.maximum(MIN_NORM, m - (r1 + r2) / 2)
-                    scale2 = jnp.maximum(MIN_NORM, m - (r3 + r4) / 2)
+                    scale1 = jnp.maximum(min_reward_scale, m - (r1 + r2) / 2)
+                    scale2 = jnp.maximum(min_reward_scale, m - (r3 + r4) / 2)
                 else:
                     scale1 = scale2 = 1.0
                 r_mu1 = (r1 - r2) / (2 * scale1)
@@ -1336,37 +1386,37 @@ class GaussianPGPE(PGPE):
                 grad = -(r_mu1 * epsilon + r_mu2 * epsilon_star)
             else:
                 if scale_reward:
-                    scale = jnp.maximum(MIN_NORM, m - (r1 + r2) / 2)
+                    scale = jnp.maximum(min_reward_scale, m - (r1 + r2) / 2)
                 else:
                     scale = 1.0
                 r_mu = (r1 - r2) / (2 * scale)
                 grad = -r_mu * epsilon
             return grad
-        def _jax_wrapped_sigma_grad(epsilon, epsilon_star, sigma, r1, r2, r3, r4, m):
+        def _jax_wrapped_sigma_grad(epsilon, epsilon_star, sigma, r1, r2, r3, r4, m, ent):
             if super_symmetric:
                 mask = r1 + r2 >= r3 + r4
                 epsilon_tau = mask * epsilon + (1 - mask) * epsilon_star
-                s = epsilon_tau * epsilon_tau / sigma - sigma
+                s = jnp.square(epsilon_tau) / sigma - sigma
                 if scale_reward:
-                    scale = jnp.maximum(MIN_NORM, m - (r1 + r2 + r3 + r4) / 4)
+                    scale = jnp.maximum(min_reward_scale, m - (r1 + r2 + r3 + r4) / 4)
                 else:
                     scale = 1.0
                 r_sigma = ((r1 + r2) - (r3 + r4)) / (4 * scale)
             else:
-                s = epsilon * epsilon / sigma - sigma
+                s = jnp.square(epsilon) / sigma - sigma
                 if scale_reward:
-                    scale = jnp.maximum(MIN_NORM, jnp.abs(m))
+                    scale = jnp.maximum(min_reward_scale, jnp.abs(m))
                 else:
                     scale = 1.0
                 r_sigma = (r1 + r2) / (2 * scale)
-            grad = -r_sigma * s
+            grad = -(r_sigma * s + ent / sigma)
             return grad
-        def _jax_wrapped_pgpe_grad(key, mu, sigma, r_max,
+        def _jax_wrapped_pgpe_grad(key, mu, sigma, r_max, ent,
                                    policy_hyperparams, subs, model_params):
             key, subkey = random.split(key)
-            (p1, p2, p3, p4), (epsilon, epsilon_star) = _jax_wrapped_sample_params(
+            p1, p2, p3, p4, epsilon, epsilon_star = _jax_wrapped_sample_params(
                 key, mu, sigma)
             r1 = -loss_fn(subkey, p1, policy_hyperparams, subs, model_params)[0]
             r2 = -loss_fn(subkey, p2, policy_hyperparams, subs, model_params)[0]
@@ -1384,42 +1434,76 @@ class GaussianPGPE(PGPE):
                 epsilon, epsilon_star
             )
             grad_sigma = jax.tree_map(
-                partial(_jax_wrapped_sigma_grad, r1=r1, r2=r2, r3=r3, r4=r4, m=r_max),
+                partial(_jax_wrapped_sigma_grad,
+                        r1=r1, r2=r2, r3=r3, r4=r4, m=r_max, ent=ent),
                 epsilon, epsilon_star, sigma
             )
             return grad_mu, grad_sigma, r_max
-        def _jax_wrapped_pgpe_grad_batched(key, pgpe_params, r_max,
+        def _jax_wrapped_pgpe_grad_batched(key, pgpe_params, r_max, ent,
                                            policy_hyperparams, subs, model_params):
             mu, sigma = pgpe_params
             if batch_size == 1:
                 mu_grad, sigma_grad, new_r_max = _jax_wrapped_pgpe_grad(
-                    key, mu, sigma, r_max, policy_hyperparams, subs, model_params)
+                    key, mu, sigma, r_max, ent, policy_hyperparams, subs, model_params)
             else:
                 keys = random.split(key, num=batch_size)
                 mu_grads, sigma_grads, r_maxs = jax.vmap(
                     _jax_wrapped_pgpe_grad,
-                    in_axes=(0, None, None, None, None, None, None)
-                )(keys, mu, sigma, r_max, policy_hyperparams, subs, model_params)
+                    in_axes=(0, None, None, None, None, None, None, None)
+                )(keys, mu, sigma, r_max, ent, policy_hyperparams, subs, model_params)
                 mu_grad, sigma_grad = jax.tree_map(
                     partial(jnp.mean, axis=0), (mu_grads, sigma_grads))
                 new_r_max = jnp.max(r_maxs)
             return mu_grad, sigma_grad, new_r_max
+        # ***********************************************************************
+        # PARAMETER UPDATE
+        #
+        # ***********************************************************************
-        def _jax_wrapped_pgpe_update(key, pgpe_params, r_max,
+        def _jax_wrapped_pgpe_kl_term(mu, sigma, old_mu, old_sigma):
+            return 0.5 * jnp.sum(2 * jnp.log(sigma / old_sigma) +
+                                 jnp.square(old_sigma / sigma) +
+                                 jnp.square((mu - old_mu) / sigma) - 1)
+        def _jax_wrapped_pgpe_update(key, pgpe_params, r_max, progress,
                                      policy_hyperparams, subs, model_params,
                                      pgpe_opt_state):
+            # regular update
             mu, sigma = pgpe_params
             mu_state, sigma_state = pgpe_opt_state
+            ent = start_entropy_coeff * jnp.power(entropy_coeff_decay, progress)
             mu_grad, sigma_grad, new_r_max = _jax_wrapped_pgpe_grad_batched(
-                key, pgpe_params, r_max, policy_hyperparams, subs, model_params)
+                key, pgpe_params, r_max, ent, policy_hyperparams, subs, model_params)
             mu_updates, new_mu_state = mu_optimizer.update(mu_grad, mu_state, params=mu)
             sigma_updates, new_sigma_state = sigma_optimizer.update(
                 sigma_grad, sigma_state, params=sigma)
             new_mu = optax.apply_updates(mu, mu_updates)
-            new_mu, converged = projection(new_mu, policy_hyperparams)
             new_sigma = optax.apply_updates(sigma, sigma_updates)
             new_sigma = jax.tree_map(lambda x: jnp.clip(x, *sigma_range), new_sigma)
+            # respect KL divergence contraint with old parameters
+            if max_kl is not None:
+                old_mu_lr = new_mu_state.hyperparams['learning_rate']
+                old_sigma_lr = new_sigma_state.hyperparams['learning_rate']
+                kl_terms = jax.tree_map(
+                    _jax_wrapped_pgpe_kl_term, new_mu, new_sigma, mu, sigma)
+                total_kl = jax.tree_util.tree_reduce(jnp.add, kl_terms)
+                kl_reduction = jnp.minimum(1.0, jnp.sqrt(max_kl / total_kl))
+                mu_state.hyperparams['learning_rate'] = old_mu_lr * kl_reduction
+                sigma_state.hyperparams['learning_rate'] = old_sigma_lr * kl_reduction
+                mu_updates, new_mu_state = mu_optimizer.update(mu_grad, mu_state, params=mu)
+                sigma_updates, new_sigma_state = sigma_optimizer.update(
+                    sigma_grad, sigma_state, params=sigma)
+                new_mu = optax.apply_updates(mu, mu_updates)
+                new_sigma = optax.apply_updates(sigma, sigma_updates)
+                new_sigma = jax.tree_map(lambda x: jnp.clip(x, *sigma_range), new_sigma)
+                new_mu_state.hyperparams['learning_rate'] = old_mu_lr
+                new_sigma_state.hyperparams['learning_rate'] = old_sigma_lr
+            # apply projection step and finalize results
+            new_mu, converged = projection(new_mu, policy_hyperparams)
             new_pgpe_params = (new_mu, new_sigma)
             new_pgpe_opt_state = (new_mu_state, new_sigma_state)
             policy_params = new_mu
@@ -1462,14 +1546,14 @@ def mean_deviation_utility(returns: jnp.ndarray, beta: float) -> float:
 @jax.jit
 def mean_semideviation_utility(returns: jnp.ndarray, beta: float) -> float:
     mu = jnp.mean(returns)
-    msd = jnp.sqrt(jnp.mean(jnp.minimum(0.0, returns - mu) ** 2))
+    msd = jnp.sqrt(jnp.mean(jnp.square(jnp.minimum(0.0, returns - mu))))
     return mu - 0.5 * beta * msd
 @jax.jit
 def mean_semivariance_utility(returns: jnp.ndarray, beta: float) -> float:
     mu = jnp.mean(returns)
-    msv = jnp.mean(jnp.minimum(0.0, returns - mu) ** 2)
+    msv = jnp.mean(jnp.square(jnp.minimum(0.0, returns - mu)))
     return mu - 0.5 * beta * msv
@@ -1768,7 +1852,6 @@ r"""
         # optimization
         self.update = self._jax_update(train_loss)
-        self.check_zero_grad = self._jax_check_zero_gradients()
         # pgpe option
         if self.use_pgpe:
@@ -1831,6 +1914,12 @@ r"""
         projection = self.plan.projection
         use_ls = self.line_search_kwargs is not None
+        # check if the gradients are all zeros
+        def _jax_wrapped_zero_gradients(grad):
+            leaves, _ = jax.tree_util.tree_flatten(
+                jax.tree_map(lambda g: jnp.allclose(g, 0), grad))
+            return jnp.all(jnp.asarray(leaves))
         # calculate the plan gradient w.r.t. return loss and update optimizer
         # also perform a projection step to satisfy constraints on actions
         def _jax_wrapped_loss_swapped(policy_params, key, policy_hyperparams,
@@ -1855,23 +1944,12 @@ r"""
             policy_params, converged = projection(policy_params, policy_hyperparams)
             log['grad'] = grad
             log['updates'] = updates
+            zero_grads = _jax_wrapped_zero_gradients(grad)
             return policy_params, converged, opt_state, opt_aux, \
-                loss_val, log, model_params
+                loss_val, log, model_params, zero_grads
         return jax.jit(_jax_wrapped_plan_update)
-    def _jax_check_zero_gradients(self):
-        def _jax_wrapped_zero_gradient(grad):
-            return jnp.allclose(grad, 0)
-        def _jax_wrapped_zero_gradients(grad):
-            leaves, _ = jax.tree_util.tree_flatten(
-                jax.tree_map(_jax_wrapped_zero_gradient, grad))
-            return jnp.all(jnp.asarray(leaves))
-        return jax.jit(_jax_wrapped_zero_gradients)
     def _batched_init_subs(self, subs):
         rddl = self.rddl
         n_train, n_test = self.batch_size_train, self.batch_size_test
@@ -2175,11 +2253,12 @@ r"""
         # ======================================================================
         # initialize running statistics
-        best_params, best_loss, best_grad = policy_params, jnp.inf, jnp.inf
+        best_params, best_loss, best_grad = policy_params, jnp.inf, None
         last_iter_improve = 0
         rolling_test_loss = RollingMean(test_rolling_window)
         log = {}
         status = JaxPlannerStatus.NORMAL
+        progress_percent = 0
         # initialize stopping criterion
         if stopping_rule is not None:
@@ -2191,18 +2270,19 @@ r"""
                 dashboard_id, dashboard.get_planner_info(self),
                 key=dash_key, viz=self.dashboard_viz)
+        # progress bar
+        if print_progress:
+            progress_bar = tqdm(None, total=100, position=tqdm_position,
+                                bar_format='{l_bar}{bar}| {elapsed} {postfix}')
+        else:
+            progress_bar = None
+        position_str = '' if tqdm_position is None else f'[{tqdm_position}]'
         # ======================================================================
         # MAIN TRAINING LOOP BEGINS
         # ======================================================================
-        iters = range(epochs)
-        if print_progress:
-            iters = tqdm(iters, total=100,
-                         bar_format='{l_bar}{bar}| {elapsed} {postfix}',
-                         position=tqdm_position)
-        position_str = '' if tqdm_position is None else f'[{tqdm_position}]'
-        for it in iters:
+        for it in range(epochs):
             # ==================================================================
             # NEXT GRADIENT DESCENT STEP
@@ -2213,8 +2293,9 @@ r"""
             # update the parameters of the plan
             key, subkey = random.split(key)
             (policy_params, converged, opt_state, opt_aux, train_loss, train_log,
-             model_params) = self.update(subkey, policy_params, policy_hyperparams,
-                                         train_subs, model_params, opt_state, opt_aux)
+             model_params, zero_grads) = self.update(
+                 subkey, policy_params, policy_hyperparams, train_subs, model_params,
+                 opt_state, opt_aux)
             test_loss, (test_log, model_params_test) = self.test_loss(
                 subkey, policy_params, policy_hyperparams, test_subs, model_params_test)
             test_loss_smooth = rolling_test_loss.update(test_loss)
@@ -2224,8 +2305,9 @@ r"""
             if self.use_pgpe:
                 key, subkey = random.split(key)
                 pgpe_params, r_max, pgpe_opt_state, pgpe_param, pgpe_converged = \
-                    self.pgpe.update(subkey, pgpe_params, r_max, policy_hyperparams,
-                                     test_subs, model_params, pgpe_opt_state)
+                    self.pgpe.update(subkey, pgpe_params, r_max, progress_percent,
+                                     policy_hyperparams, test_subs, model_params_test,
+                                     pgpe_opt_state)
                 pgpe_loss, _ = self.test_loss(
                     subkey, pgpe_param, policy_hyperparams, test_subs, model_params_test)
                 pgpe_loss_smooth = rolling_pgpe_loss.update(pgpe_loss)
@@ -2252,7 +2334,7 @@ r"""
             # ==================================================================
             # no progress
-            if (not pgpe_improve) and self.check_zero_grad(train_log['grad']):
+            if (not pgpe_improve) and zero_grads:
                 status = JaxPlannerStatus.NO_PROGRESS
             # constraint satisfaction problem
@@ -2311,14 +2393,15 @@ r"""
             # if the progress bar is used
             if print_progress:
-                iters.n = progress_percent
-                iters.set_description(
+                progress_bar.set_description(
                     f'{position_str} {it:6} it / {-train_loss:14.5f} train / '
                     f'{-test_loss_smooth:14.5f} test / {-best_loss:14.5f} best / '
                     f'{status.value} status / {total_pgpe_it:6} pgpe',
                     refresh=False
                 )
-                iters.set_postfix_str(f"{(it + 1) / elapsed:.2f}it/s", refresh=True)
+                progress_bar.set_postfix_str(
+                    f"{(it + 1) / (elapsed + 1e-6):.2f}it/s", refresh=False)
+                progress_bar.update(progress_percent - progress_bar.n)
             # dash-board
             if dashboard is not None:
@@ -2339,7 +2422,7 @@ r"""
         # release resources
         if print_progress:
-            iters.close()
+            progress_bar.close()
         # validate the test return
         if log:

{pyrddlgym_jax-2.1 → pyrddlgym_jax-2.2}/pyRDDLGym_jax.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: pyRDDLGym-jax
-Version: 2.1
+Version: 2.2
 Summary: pyRDDLGym-jax: automatic differentiation for solving sequential planning problems in JAX.
 Home-page: https://github.com/pyrddlgym-project/pyRDDLGym-jax
 Author: Michael Gimelfarb, Ayal Taitler, Scott Sanner
@@ -58,8 +58,11 @@ Dynamic: summary
 Purpose:
-1. automatic translation of any RDDL description file into a differentiable simulator in JAX
-2. flexible policy class representations, automatic model relaxations for working in discrete and hybrid domains, and Bayesian hyper-parameter tuning.
+1. automatic translation of RDDL description files into differentiable JAX simulators
+2. implementation of (highly configurable) operator relaxations for working in discrete and hybrid domains
+3. flexible policy representations and automated Bayesian hyper-parameter tuning
+4. interactive dashboard for dyanmic visualization and debugging
+5. hybridization with parameter-exploring policy gradients.
 Some demos of solved problems by JaxPlan:
@@ -235,8 +238,23 @@ More documentation about this and other new features will be coming soon.
 ## Tuning the Planner
-It is easy to tune the planner's hyper-parameters efficiently and automatically using Bayesian optimization.
-To do this, first create a config file template with patterns replacing concrete parameter values that you want to tune, e.g.:
+A basic run script is provided to run automatic Bayesian hyper-parameter tuning for the most sensitive parameters of JaxPlan:
+```shell
+jaxplan tune <domain> <instance> <method> <trials> <iters> <workers> <dashboard>
+```
+where:
+- ``domain`` is the domain identifier as specified in rddlrepository
+- ``instance`` is the instance identifier
+- ``method`` is the planning method to use (i.e. drp, slp, replan)
+- ``trials`` is the (optional) number of trials/episodes to average in evaluating each hyper-parameter setting
+- ``iters`` is the (optional) maximum number of iterations/evaluations of Bayesian optimization to perform
+- ``workers`` is the (optional) number of parallel evaluations to be done at each iteration, e.g. the total evaluations = ``iters * workers``
+- ``dashboard`` is whether the optimizations are tracked in the dashboard application.
+It is easy to tune a custom range of the planner's hyper-parameters efficiently.
+First create a config file template with patterns replacing concrete parameter values that you want to tune, e.g.:
 ```ini
 [Model]
@@ -260,7 +278,7 @@ train_on_reset=True
 would allow to tune the sharpness of model relaxations, and the learning rate of the optimizer.
-Next, you must link the patterns in the config with concrete hyper-parameter ranges the tuner will understand:
+Next, you must link the patterns in the config with concrete hyper-parameter ranges the tuner will understand, and run the optimizer:
 ```python
 import pyRDDLGym
@@ -292,22 +310,7 @@ tuning = JaxParameterTuning(env=env,
                             gp_iters=iters)
 tuning.tune(key=42, log_file='path/to/log.csv')
 ```
-A basic run script is provided to run the automatic hyper-parameter tuning for the most sensitive parameters of JaxPlan:
-```shell
-jaxplan tune <domain> <instance> <method> <trials> <iters> <workers> <dashboard>
-```
-where:
-- ``domain`` is the domain identifier as specified in rddlrepository
-- ``instance`` is the instance identifier
-- ``method`` is the planning method to use (i.e. drp, slp, replan)
-- ``trials`` is the (optional) number of trials/episodes to average in evaluating each hyper-parameter setting
-- ``iters`` is the (optional) maximum number of iterations/evaluations of Bayesian optimization to perform
-- ``workers`` is the (optional) number of parallel evaluations to be done at each iteration, e.g. the total evaluations = ``iters * workers``
-- ``dashboard`` is whether the optimizations are tracked in the dashboard application.
 ## Simulation

{pyrddlgym_jax-2.1 → pyrddlgym_jax-2.2}/setup.py RENAMED Viewed

@@ -19,7 +19,7 @@ long_description = (Path(__file__).parent / "README.md").read_text()
 setup(
       name='pyRDDLGym-jax',
-      version='2.1',
+      version='2.2',
       author="Michael Gimelfarb, Ayal Taitler, Scott Sanner",
       author_email="mike.gimelfarb@mail.utoronto.ca, ataitler@gmail.com, ssanner@mie.utoronto.ca",
       description="pyRDDLGym-jax: automatic differentiation for solving sequential planning problems in JAX.",