Distributional Successor Features Enable Zero-Shot Policy Optimization

University of Washington
Given an unlabeled offline dataset, Distribtional Successor Features model both ``what outcomes can happen?" and ``how to achieve a particular outcome?" This is used for quick adaptation to new downstream tasks without re-planning or expensive test-time policy optimization.

Abstract

Intelligent agents must be generalists, capable of quickly adapting to various tasks. In reinforcement learning (RL), model-based RL learns a dynamics model of the world, in principle enabling transfer to arbitrary reward functions through planning. However, autoregressive model rollouts suffer from compounding error, making model-based RL ineffective for long-horizon problems. Successor features offer an alternative by modeling a policy's long-term state occupancy, reducing policy evaluation under new rewards to linear regression. Yet, policy optimization with successor features can be challenging. This work proposes a novel class of models, i.e., Distributional Successor Features for Zero-Shot Policy Optimization (DiSPOs), that learn a distribution of successor features of a stationary dataset's behavior policy, along with a policy that acts to realize different successor features within the dataset. By directly modeling long-term outcomes in the dataset, DiSPOs avoid compounding error while enabling a simple scheme for zero-shot policy optimiza- tion across reward functions. We present a practical instantiation of DiSPOs using diffusion models and show their efficacy as a new class of transferable models, both theoretically and empirically across various simulated robotics problems.

Method

Learning Distributional Successor Features

Learning DiSPOs.

Distributional Successor Features consists of:

  1. A distribution of all possible outcomes in the dataset, represented by discounted sums of state features along trajectories (successor features).
  2. A readout policy that generates an action to achieve a particular outcome.
Modeling outcomes as successor features enables the quick evaluation of outcomes under arbitrary rewards, while modeling the distribution of all possible outcomes enables the extraction of optimal policies. Hence, DiSPOs retain the generality of model-based RL while avoiding compounding error.

Zero-Shot Policy Optimization via Distributional Successor Features

Learning DiSPOs.

DiSPOs enable zero-shot policy optimization for arbitrary rewards without further training at test-time. Assuming a linear dependence of rewards on cumulants, transferring to downstream tasks reduces to performing a linear regression and solving a simple optimization problem for the optimal possible outcome, which is then passed into the readout policy to generate an action.

Experiments

Multitask Transfer

We evaluate DiSPOs' ability to transfer to challenging downstream tasks on the D4RL benchmark. DiSPOs show superior performance transferring to the hardest tasks compared to model-based RL, successor features, and goal-conditioned baselines with misspecified goal distributions.

D4RL experiments.

To demonstrate DiSPOs' broad transferability, we plot the normalized returns for reaching various goals in antmaze, where each tile corresponds to the task of navigating the robot to reach that particular tile. DiSPOs successfully transfer across a majority of tasks, whereas model-based RL struggles on longer-horizon tasks.

D4RL transfer experiments.

Transferring to Arbitrary Rewards

We show that DiSPOs can transfer to aribitrary rewards beyond goal-reaching in an antmaze preference environment, where the agent has to take a pariticular path to reach the goal according to human preference (specified as reward functions). DiSPOs and model-based RL are able to complete the task according to the human preference, whereas goal-conditioned RL baselines do not conform to perferences.

Preference antmaze experiments.

We further demonstrate DiSPOs' arbitrary transfer capability by training an agent to track various trajectories as denoted by the colored cells. All these runs share the same outcome model and policy, only differing in the reward regression weights.

Trajectory Stitching

Since DiSPOs are trained with distributional Bellman backup, they are able to perform "trajectory stitching," i.e. recovering optimal trajectories by combining suboptimal trajectories. We validate DiSPOs' stitching capability on the roboverse benchmark, where each task consists of two subtasks, but the dataset only contains trajectories for each individual subtask. DiSPOs can complete the tasks by stitching subtrajectories, whereas Monte-Carlo style baselines cannot.

Roboverse experiments.

BibTeX

@article{zhu2024dispo,
    author    = {Zhu, Chuning and Wang, Xinqi and Han, Tyler and Du, Simon Shaolei and Gupta, Abhishek},
    title     = {Distributional Successor Features Enable Zero-Shot Policy Optimization},
    booktitle = {ArXiv Preprint},
    year      = {2024},
}