A generalized occupancy model consists of:
Assuming a linear dependence of rewards on cumulants, transferring to downstream tasks reduces to performing linear regression and solving a simple optimization problem for the optimal possible outcome. This is passed into the readout policy to generate an action.
We evaluate GOMs' ability to transfer to challenging downstream tasks on the D4RL benchmark. GOMs show superior performance transferring to the hardest tasks compared to model-based RL, successor features, and goal-conditioned baselines with misspecified goal distributions.
To demonstrate GOMs' broad transferability, we plot the normalized returns for reaching various goals in antmaze, where each tile corresponds to the task of navigating the robot to reach that particular tile. GOMs successfully transfer across a majority of tasks, whereas model-based RL struggles on longer-horizon tasks.
We show that GOMs can adapt to aribitrary rewards beyond goal-reaching in an antmaze preference environment, where the agent has to take a pariticular path to reach the goal according to human preference (specified as reward functions). GOMs and model-based RL are able to complete the task according to the human preference, whereas goal-conditioned RL baselines do not conform to perferences.
We further demonstrate GOMs' arbitrary transfer capability by training an agent to track various trajectories as denoted by the colored cells. All these runs share the same outcome model and policy, only differing in the reward regression weights.
Since GOMs are trained with distributional Bellman backup, they are able to perform "trajectory stitching," i.e. recovering optimal trajectories by combining suboptimal trajectories. We validate GOMs' stitching capability on the roboverse benchmark, where each task consists of two subtasks, but the dataset only contains trajectories for each individual subtask. GOMs can complete the tasks by stitching subtrajectories, whereas Monte-Carlo style baselines cannot.
@article{zhu2024gom,
author = {Zhu, Chuning and Wang, Xinqi and Han, Tyler and Du, Simon Shaolei and Gupta, Abhishek},
title = {Transferable Reinforcement Learning via Generalized Occupancy Models},
booktitle = {ArXiv Preprint},
year = {2024},
}