Reinforcement Learning from Human Feedback (RLHF) is a powerful paradigm for aligning foundation models to human values and preferences. However, curent RLHF techniques cannot account for the naturally occurring differences in individual human preferences across a diverse population. When these differences arise, traditional RLHF frameworks simply average over them, leading to inaccurate rewards and poor performance for individual subgroups. To address the need for pluralistic alignment, we develop a class of multimodal RLHF methods.
Our proposed techniques are based on a latent variable formulation - inferring a novel user-specific latent and learning reward models and policies conditioned on this latent without additional user-specific data. While conceptually simple, we show that in practice, this reward modeling requires careful algorithmic considerations around model architecture and reward scaling. To empirically validate our proposed technique, we first show that it can provide a way to combat under-specification in simulated control problems, inferring and optimizing user-specific reward functions. Next, we conduct experiments on pluralistic language datasets representing diverse user preferences and demonstrate improved reward function accuracy. We additionally show the benefits of this probabilistic framework in terms of measuring uncertainty, and actively learning user preferences. This work enables learning from diverse populations of users with divergent preferences, an important challenge that naturally occurs in problems from robot learning to foundation model alignment.
The standard Bradley-Terry model assumes that all annotators share a single reward function, which is not practical with a diverse range of annotators. To model pluralistic preferences, we frame multi-modal reward learning as a latent variable problem, where the latent variable represents the hidden context affecting the underlying reward function. Following this, we express the preference likelihoods can be expressed with a latent-conditional BTL model:
We present an evidence lower bound (ELBO) for the variational preference learning (VPL) objective. The ELBO is:
Intuitively this objective encodes a set of user-provided annotations into a latent distribution using the encoder , and then learns a latent-conditional reward function that best explains the annotated preference data.
We show that VPL can learn a multi-modal reward function, while the baseline BTL model averages the different modes from the diverse users.
We learn a policy conditioned on the latent variable that generates actions to maximize the expected reward under the inferred reward function. This allows us to personalize the policy to the user's preferences without requiring additional user-specific data.
We evaluate VPL's ability to adapt to user preferences in multiple simulated control tasks. We show that VPL is able to infer the user's preference and steer downstream behavior accordingly.
Using the probabilistic nature of VPL, we demonstrate the ability to actively query user preferences. This active query selection procedure can be expressed as the following optimization problem, maximizing the mutual information between the labels and the latent distribution:
We show that active learning can significantly reduce the number of queries required to learn the user's preferences.
In the Habitat environment, with ~ 100 users, we show that VPL scales and is able to infer and adapt diverse to user preferences.
We scale VPL for pluralistic alignment of LLM-based reward models. We compare the reward modelling performance of our method against the baselines on GPT-2 and Llama2-7b across a sythetic dataset of diverse user preferences ("Pets") and the widely available UltraFeedback dataset with four distinct user groups. Please refer to the paper for more details on the dataset.
In the following Table, we see that VPL is able to learn a more accurate reward model across all the datasets, capturing the multi-modality in the language preference data. This indicates that VPL can infer the latent representation of the user’s preferences z from a few annotated samples, and successfully adapt the reward model. In contrast, the baselines—including the BTL model typically used in widely deployed RLHF models—are unable to fit the datasets because they are unable to account for divergent preferences.
In this plot, we visualize the T-SNE features of the latent distribution z produced by the encoder on a set of annotated prompts and responses from the two users in the dataset. We see that the encoder clusters the users in the latent space, allowing the decoder to personalize the reward models according to multiple objectives preferred by the diverse users belonging to a cluster.
Overall, we present a novel method for personalizing reinforcement learning from human feedback with variational preference learning. We demonstrate that our method can learn a multi-modal reward function that captures the diverse preferences of a user population. We show that our method can adapt to user preferences in simulated control tasks and scale to large language models. We believe that our work makes a step towards enabling learning from diverse populations of users with divergent preferences.
@article{poddar2024vpl,
author = {Poddar, Sriyash and Wan, Yanming and Ivision, Hamish and Gupta, Abhishek and Jaques, Natasha},
title = {Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning},
booktitle = {ArXiv Preprint},
year = {2024},
}