TMRL

Diffusion Timestep-Modulated Pre-training Enables Exploration for Efficient Policy Fine-tuning

Matthew M Hong1    Jesse Zhang1    Anusha Nagabandi2    Abhishek Gupta1 1University of Washington    2Amazon FAR
Robotics: Science and Systems (RSS) 2026

TL;DR

A unified framework bridging behavior-cloning pre-training and reinforcement-learning fine-tuning.

Our method

Context-Smoothed Pre-training (CSP) applies a diffusion-style forward-noising schedule to policy inputs, producing a continuous spectrum from precise imitation to broad coverage. Timestep-Modulated RL (TMRL) then trains a high-level policy πHL(z, σ | s) to select σ per action chunk.

Results

Outperforming baselines in simulation and real-world fine-tuning of π0 to near-perfect success on three tasks: under an hour on WidowX, under four on Franka.

Motivation

The need for action distribution interpolation.

Robots trained by imitation copy demonstrations narrowly — in sparsely covered contexts the conditional support collapses to near-zero probability on optimal actions, so online rollouts yield no reward signal and RL fine-tuning stalls. We propose injecting forward-diffusion noise into policy inputs during pre-training, so the policy learns the full spectrum of distributions between the conditional p(a | c) and marginal p(a).

On a 2D pointmaze from OGBench, we train two policies — one approximating p(a | c), the other p(a) — where c is the goal position. Both train on pointmaze-large and evaluate on the larger pointmaze-giant environment, with an unseen goal location (agent in yellow, goal in pink).

p(a | c) · BC
p(a) · CSP, σ = T

Left, the conditional p(a | c) collapses and drifts the agent away from the goal. Right, the marginal p(a) covers broadly enough to reach it, but assigns goal-reaching behaviors only low likelihood.

This motivates a policy that can interpolate between the two — and a mechanism to learn where on that interpolation to operate.

Method

Context smoothing & timestep modulation.

TMRL method overview
CSP pre-training and TMRL fine-tuning. Left — CSP (pre-training): one policy pθ trained across all σ by corrupting context c with qσ, the forward-diffusion noising kernel — precise imitation at σ = 0, broad coverage at σ = T. Right — TMRL (fine-tuning): the high-level policy πHL(z, σ | s) learns to modulate σ per action chunk to maximize reward.

Diffusion noise on pointcloud contexts

On the dexterous-grasping suite, the context c is the object pointcloud. As σ rises, qσ progressively corrupts c: the pointcloud slowly loses its original structure, aliasing the out-of-distribution object with nearby in-distribution pointclouds seen during training. The policy can then borrow coherent grasps from related training contexts rather than failing on an unfamiliar shape.

1 · Rollout

noising c with qσ  ↓

2 · Context-smoothed pointcloud

σ=0 clean pointcloud
σ = 0 · clean
σ=0.001 pointcloud
σ = 0.001
σ=0.01 pointcloud
σ = 0.01
σ=0.1 pointcloud
σ = 0.1

Top: TMRL on a held-out drill (context c = pointcloud). Bottom: qσ removes structure as σ rises — aliasing the OOD object back to in-distribution shapes the policy has seen during training.

σ Slider

The conditional ↔ marginal spectrum.

CSP trains one policy across all σ by corrupting the context c (here, the cube goal position) with qσ(c | c). The endpoints anchor the spectrum: σ = 0 recovers p(a | c), σ = T recovers p(a).

σ = 0 (conditional) σ = T/4 σ = T/2 σ = 3T/4 σ = T (marginal)
σ = 0.00(conditional)

Drag the slider to see how the policy’s rollouts spread as the context (goal position) becomes more corrupted.

Simulation Results

Coverage and RL sample efficiency.

Action coverage before RL: Success@K

Success@K — the fraction of out-of-distribution states where at least one of K base-policy rollouts succeeds. On two tasks, CSP outperforms PostBC and BC for all K, with the gap widest on cube-single where both baselines remain at zero.

Success@K comparison
CSP widens action coverage before RL. On cube-single, BC and PostBC stay near zero while CSP rises with K.

RL sample efficiency

We compare against several state-of-the-art baselines. TMRL outperforms or matches all baselines across four simulation tasks.

Simulation RL results
TMRL beats baselines across four simulation tasks. TMRL approaches 100% across all four tasks; only TMRL reaches non-trivial success on the libero-90 task.

Real-World RL

Context-Smoothed π0 in the real world.

< 1 hr
DSRL never converges on either WidowX task. TMRL reaches near-perfect success in under an hour of real-robot fine-tuning on both (sausage-in-pot, shrimp-in-drawer).
Applying CSP to a VLA. Noise the VLM embedding context with qσ to allow π0 to sample across the conditional → marginal spectrum.

We fine-tune a context-smoothed π0 on BridgeData-v2 (WidowX 250) and DROID (Franka).

Real-world RL fine-tuning results
TMRL reaches near-perfect success on real-robot tasks in under an hour. Base π0 stalls — its action support is too narrow for any rollout to succeed. CSP widens that support; TMRL leverages it to reach near-perfect success on all three tasks.

Per-task rollouts

Base π0 versus context-smoothed π0, side by side. The learning curve below traces the full RL training run.

Base π0
Context-Smoothed π0 (ours)

Real-world RL learning curve

Button-Press

TMRL learns to be precise when it matters.

πHL treats the diffusion timestep σ as an adaptive coverage control, choosing it per action chunk. Below, the converged TMRL policy on shrimp-in-drawer learns to keep σ low through the precision-critical pick and higher during reaching and placing.

“pick up the shrimp and put it into the white drawer”

Current rollout frame
Action Chunk 1 / 10
σ = 0.01

BibTeX

@inproceedings{hong2026tmrl,
    title  = {TMRL: Diffusion Timestep-Modulated Pre-training Enables Exploration for Efficient Policy Fine-tuning},
    author = {Hong, Matthew M. and Zhang, Jesse and Nagabandi, Anusha and Gupta, Abhishek},
    booktitle = {Robotics: Science and Systems (RSS)},
    year   = {2026}
}