RFS

Reinforcement Learning with Residual Flow Steering for Dexterous Manipulation

Abstract

We propose an efficient reinforcement learning (RL) framework for fast adaptation of pretrained generative policies. Specifically, our proposed methodology - residual flow steering, instantiates an efficient RL technique that quickly adapts a pretrained flow-matching model by steering it jointly by optimizing a policy for selecting both a latent noise distribution and a residual action. Doing so allows policies to perform both local (residual actions) and global exploration (latent noise), data-efficient adaptation. We demonstrate that this technique is effective for dexterous manipulation problems, serving both as a tool to pretrain behaviors in simulation and efficiently finetune them in the real world.

Motivation
Approach
RFS Pipeline
Overview of Residual Flow Steering (RFS). Given a state $s$, the RFS policy $\pi_{\text{RFS}}$ outputs a latent flow variable $w_0$ and a residual action $a_r$, which jointly steer a pretrained base policy $\pi_{\text{FM}}$ to produce the final action $a_b + a_r$. RFS enables both global mode shifting and fine-grained residual correction, allowing the policy to expand beyond the demonstration data manifold.
Sim-to-Real Pipeline
Overview of our sim-to-real pipeline. (1) VR teleoperation is used to collect demonstrations across multiple manipulation tasks to train task-specific flow-matching base policies. (2) In simulation, the RFS policy $\pi_{\text{RFS}}$ is fine-tuned on top of each base policy and distilled into task-specific visuomotor policies to improve sim-to-real transfer. (3) During zero-shot real-world deployment, human corrective actions correct execution failures such as unstable grasps and misplacement. (4) These corrected transitions are used for offline fine-tuning of $\pi_{\text{RFS}}$ on a Franka--Leap Hand system, improving real-world grasping and pick-and-place performance.
🎥 Simulation Rollout Results
🧭 Overall
🖐️Grasping
📦Pick & Place
🌊Pour
🎨 Stack
🧺Packing
🫳Push-to-Grasp
🦾 RFS Rollout
Success Rate
25.1%
FM
17.8%
DPPO
40.9%
RF
48.8%
IQL
35.5%
AWAC
15.3%
FQL
34.3%
RLPD
19.9%
IBRL
28.6%
PD
43.3%
ResiP
48.3%
DSRL
86.1%
RFS
Base Policy

Flow Matching

0.251 ± 0.151
Diffusion / Flow RL Finetuning

DPPO

0.178 ± 0.183

ReinFlow

0.409 ± 0.168
Offline-to-Online RL

IQL

0.488 ± 0.184

AWAC

0.355 ± 0.299

Flow Q-Learning

0.153 ± 0.202
RL with Demonstrations

RLPD

0.343 ± 0.327

IBRL

0.199 ± 0.200
Residual RL (State-of-the-Art)

Policy Decorator

0.286 ± 0.194

ResiP

0.433 ± 0.203
Strong Baselines

DSRL

0.483 ± 0.224
Ours

RFS (Ours)

0.861 ± 0.083
📚 Baseline Methods & References
  1. Diffusion Policy Policy Optimization (DPPO):
    Allen Z. Ren, Justin Lidard, Lars L. Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion Policy Policy Optimization. ICLR 2025.

  2. ReinFlow
    Tonghe Zhang, Chao Yu, Sichang Su, and Yu Wang. ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning. NeurIPS 2025.

  3. IQL
    Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline Reinforcement Learning with Implicit Q-Learning. ICLR 2022.

  4. AWAC
    Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. AWAC: Accelerating Online Reinforcement Learning with Offline Datasets. arXiv 2021.

  5. Flow Q-Learning (FQL) Seohong Park, Qiyang Li, and Sergey Levine.
    Flow Q-Learning. ICML 2025.

  6. RLPD
    Philip J. Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient Online Reinforcement Learning with Offline Data. ICML 2023.

  7. IBRL
    Hengyuan Hu, Suvir Mirchandani, and Dorsa Sadigh. Imitation Bootstrapped Reinforcement Learning. RSS 2024.

  8. Policy Decorator
    Xiu Yuan, Tongzhou Mu, Stone Tao, Yunhao Fang, Mengke Zhang, and Hao Su. Policy Decorator: Model-Agnostic Online Refinement for Large Policy Models. ICLR 2025.

  9. ResiP
    Lars Ankile, Anthony Simeonov, Idan Shenfeld, Marcel Torne, and Pulkit Agrawal. From Imitation to Refinement – Residual RL for Precise Assembly. ICRA 2025.

  10. DSRL
    Andrew Wagenmaker, Mitsuhiko Nakamoto, Yunchu Zhang, Seohong Park, Waleed Yagoub, Anusha Nagabandi, Abhishek Gupta, and Sergey Levine. Steering Your Diffusion Policy with Latent Space Reinforcement Learning. CoRL 2025.
🤖 Real Robot Results
📦 Pick & Place
🖐️ Grasp
🦾 RFS Rollout
💡 Zero-shot Sim2Real
Success Rate
RFS (Ours)
90.0 ± 0.0
DSRL
80.0 ± 0.0
Residual RL
50.0 ± 0.0

BC
40 ± 0.0
Co-training
60.0± 0.0
Zero-shot Sim2Real
50 ± 0.0
🦾 RFS Rollout
💡 Zero-shot Sim2Real
Success Rate
RFS (Ours)
70.0 ± 9.0
DSRL
60.0 ± 14.8
Residual RL
36.0 ± 4.7

BC
23 ± 9.0
Co-training
46.0± 12.0
Zero-shot Sim2Real
40 ± 9.0
🌟 Last but not the least
BibTeX
      

      @misc{su2026rfsreinforcementlearningresidual,
      title={RFS: Reinforcement learning with Residual flow steering for dexterous manipulation}, 
      author={Entong Su and Tyler Westenbroek and Anusha Nagabandi and Abhishek Gupta},
      year={2026},
      eprint={2602.01789},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2602.01789}, 
}