RFS

Reinforcement Learning with Residual Flow Steering for Dexterous Manipulation

Entong Su¹, Tyler Westenbroek¹, Anusha Nagabandi², Abhishek Gupta¹

¹University of Washington ²Amazon

Abstract

We propose an efficient reinforcement learning (RL) framework for fast adaptation of pretrained generative policies. Specifically, our proposed methodology - residual flow steering, instantiates an efficient RL technique that quickly adapts a pretrained flow-matching model by steering it jointly by optimizing a policy for selecting both a latent noise distribution and a residual action. Doing so allows policies to perform both local (residual actions) and global exploration (latent noise), data-efficient adaptation. We demonstrate that this technique is effective for dexterous manipulation problems, serving both as a tool to pretrain behaviors in simulation and efficiently finetune them in the real world.

Motivation

Approach

RFS Pipeline

Overview of Residual Flow Steering (RFS). Given a state $s$, the RFS policy $\pi_{\text{RFS}}$ outputs a latent flow variable $w_0$ and a residual action $a_r$, which jointly steer a pretrained base policy $\pi_{\text{FM}}$ to produce the final action $a_b + a_r$. RFS enables both global mode shifting and fine-grained residual correction, allowing the policy to expand beyond the demonstration data manifold.

Sim-to-Real Pipeline

Overview of our sim-to-real pipeline. (1) VR teleoperation is used to collect demonstrations across multiple manipulation tasks to train task-specific flow-matching base policies. (2) In simulation, the RFS policy $\pi_{\text{RFS}}$ is fine-tuned on top of each base policy and distilled into task-specific visuomotor policies to improve sim-to-real transfer. (3) During zero-shot real-world deployment, human corrective actions correct execution failures such as unstable grasps and misplacement. (4) These corrected transitions are used for offline fine-tuning of $\pi_{\text{RFS}}$ on a Franka--Leap Hand system, improving real-world grasping and pick-and-place performance.

🎥 Simulation Rollout Results

🧭 Overall

🖐️Grasping

📦Pick & Place

🌊Pour

🎨 Stack

🧺Packing

🫳Push-to-Grasp

🦾 RFS Rollout

Success Rate

25.1%

17.8%

DPPO

40.9%

48.8%

IQL

35.5%

AWAC

15.3%

FQL

34.3%

RLPD

19.9%

IBRL

28.6%

43.3%

ResiP

48.3%

DSRL

86.1%

RFS

Base Policy

Flow Matching

0.251 ± 0.151

Diffusion / Flow RL Finetuning

DPPO

0.178 ± 0.183

ReinFlow

0.409 ± 0.168

Offline-to-Online RL

IQL

0.488 ± 0.184

AWAC

0.355 ± 0.299

Flow Q-Learning

0.153 ± 0.202

RL with Demonstrations

RLPD

0.343 ± 0.327

IBRL

0.199 ± 0.200

Residual RL (State-of-the-Art)

Policy Decorator

0.286 ± 0.194

ResiP

0.433 ± 0.203

Strong Baselines

DSRL

0.483 ± 0.224

Ours

      RFS (Ours)
      0.861 ± 0.083
    

📚 Baseline Methods & References

Diffusion Policy Policy Optimization (DPPO):
Allen Z. Ren, Justin Lidard, Lars L. Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion Policy Policy Optimization. ICLR 2025.

ReinFlow
Tonghe Zhang, Chao Yu, Sichang Su, and Yu Wang. ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning. NeurIPS 2025.

IQL
Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline Reinforcement Learning with Implicit Q-Learning. ICLR 2022.

AWAC
Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. AWAC: Accelerating Online Reinforcement Learning with Offline Datasets. arXiv 2021.

Flow Q-Learning (FQL) Seohong Park, Qiyang Li, and Sergey Levine.
Flow Q-Learning. ICML 2025.

RLPD
Philip J. Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient Online Reinforcement Learning with Offline Data. ICML 2023.

IBRL
Hengyuan Hu, Suvir Mirchandani, and Dorsa Sadigh. Imitation Bootstrapped Reinforcement Learning. RSS 2024.

Policy Decorator
Xiu Yuan, Tongzhou Mu, Stone Tao, Yunhao Fang, Mengke Zhang, and Hao Su. Policy Decorator: Model-Agnostic Online Refinement for Large Policy Models. ICLR 2025.

ResiP
Lars Ankile, Anthony Simeonov, Idan Shenfeld, Marcel Torne, and Pulkit Agrawal. From Imitation to Refinement – Residual RL for Precise Assembly. ICRA 2025.

DSRL
Andrew Wagenmaker, Mitsuhiko Nakamoto, Yunchu Zhang, Seohong Park, Waleed Yagoub, Anusha Nagabandi, Abhishek Gupta, and Sergey Levine. Steering Your Diffusion Policy with Latent Space Reinforcement Learning. CoRL 2025.

🤖 Real Robot Results

📦 Pick & Place

🖐️ Grasp

🦾 RFS Rollout

💡 Zero-shot Sim2Real

Success Rate

RFS (Ours)
90.0 ± 0.0

DSRL

80.0 ± 0.0

Residual RL

50.0 ± 0.0

40 ± 0.0

Co-training

60.0± 0.0

Zero-shot Sim2Real

50 ± 0.0

🦾 RFS Rollout

💡 Zero-shot Sim2Real

Success Rate

RFS (Ours)
70.0 ± 9.0

DSRL

60.0 ± 14.8

Residual RL

36.0 ± 4.7

23 ± 9.0

Co-training

46.0± 12.0

Zero-shot Sim2Real

40 ± 9.0

🦾 RFS Rollout

💡 Zero-shot Sim2Real

Success Rate

RFS (Ours)
80.0 ± 4.7

DSRL

65.0 ± 5.0

Residual RL

70.0 ± 0.0

73.3 ± 0.0

Co-training

83.3 ± 0.0

Zero-shot Sim2Real

43.3 ± 10.0

🦾 RFS Rollout

💡 Zero-shot Sim2Real

Success Rate

RFS (Ours)
74.0 ± 14.7

DSRL

47.5 ± 4.3

Residual RL

48.0 ± 14.8

35 ± 5.0

Co-training

47.5 ± 28.3

Zero-shot Sim2Real

30 ± 12.0

🌟 Last but not the least

BibTeX

      

      @misc{su2026rfsreinforcementlearningresidual,
      title={RFS: Reinforcement learning with Residual flow steering for dexterous manipulation}, 
      author={Entong Su and Tyler Westenbroek and Anusha Nagabandi and Abhishek Gupta},
      year={2026},
      eprint={2602.01789},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2602.01789}, 
}