Rapidly Adapting Policies to the Real-World via Simulation-Guided Fine-Tuning

1University of Washington, 2Microsoft Research
*Equal Contribution
ICLR 2025


Learned with less than 5 minutes of real world fine-tuning data

Abstract

Robot learning requires a considerable amount of data to realize the promise of generalization. However, it can be challenging to actually collect the magnitude of high-quality data necessary for generalization entirely in the real world. Simulation can serve as a source of plentiful data, wherein techniques such as reinforcement learning can obtain broad coverage over states and actions. However, high-fidelity physics simulators are fundamentally misspecified approximations to reality, making direct zero-shot transfer challenging, especially in tasks where precise and forceful manipulation is necessary. This makes real-world fine-tuning of policies pretrained in simulation an attractive approach to robot learning. However, exploring the real-world dynamics with standard RL fine-tuning techniques is to inefficient for many real-world applications. This paper introduces Simulation-Guided Fine-Tuning (SGFT), a general framework which leverages the structure of the simulator to guide exploration, substantially accelerating adaptation to the real-world. We demonstrate our approach across several manipulation tasks in the real world, learning successful policies for problems that are challenging to learn using purely real-world data. We further provide theoretical backing for the paradigm.

Progression Video of Hammering

Method

Preference Learning
Preference Learning

How do we solve contact-rich manipulation in situations where sim2real transfer fails?

Problem:

  1. Simulation provides extensive data coverage, but misspecified physics prevents zero-shot transfer
  2. RL fine-tuning with tabula rasa exploration is sample-inefficient because the search space grows exponentially with the time horizon

Main Idea:

  1. We demonstrate theoretically that value functions define an ordering of states which is robust to low-level dynamics gaps
  2. Using a simulation-learned value function (Vsim) for potential-based reward shaping provides dense rewards to guide real-world exploration
  3. SGFT also integrates nicely with Model-Based RL (MBRL) by making short-horizon predictions with a dynamics model and bootstrapping with Vsim to shorten the horizon of the search problem*. This side-steps the core challenge of compounding errors faced by MBRL, allowing us to use MBRL to speed up fine-tuning even more.

*Note: The RL objective is now biased. Theoretical analysis in the paper shows this bias is acceptable.


SGFT guides future policies to produce trajectories which move in the directions suggested by the simulation-learned value function and short hallucinated rollouts.
The value function provides strong reward signal for real-world RL fine-tuning, while hallucinated rollouts enables an agent to train on transitions not seen in the dataset.
Hallucinated states with high value estimates (green) are labeled with high reward during RL fine-tuning, and therefore more likely to be explored than states with low value estimates (red).


SGFT shortens the horizon of the search problem by changing the infinite horizon RL objective to a finite H-step RL objective with a terminal simulation-learned value function.

Real World Experiments

Standard fine-tuning methods typically have a unlearning phase where the policy gets worse before it gets better. Our method makes consistent rapid progress during fine-tuning.

Hammering

Simulation-Trained Policy

Pretrained Policy

Fine-tuned Policy

Insertion

Pushing

*For fine-tuned insertion policy video, we roll in with the pretrained policy to grasp and switch to the fine-tuned insertion policy

Preference Learning

Above is a comparison of time to learn each task with our method vs existing baselines using sim2real transfer, RL finetuning, and/or model-based RL. In each case, our method outperforms baselines in sample efficiency by at least 2x!

BibTeX

@inproceedings{yin2025sgft,
  author    = {Yin, Patrick and Westenbroek, Tyler and Bagaria, Simran and Huang, Kevin and Cheng, Ching-An and Kolobov, Andrey and Gupta, Abhishek},
  title     = {Rapidly Adapting Policies to the Real-World via Simulation-Guided Fine-Tuning},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2025},
}