A unified framework bridging behavior-cloning pre-training and reinforcement-learning fine-tuning.
Our method
Context-Smoothed Pre-training (CSP) applies a diffusion-style forward-noising schedule to policy inputs, producing a continuous spectrum from precise imitation to broad coverage. Timestep-Modulated RL (TMRL) then trains a high-level policy πHL(z, σ | s) to select σ per action chunk.
Results
Outperforming baselines in simulation and real-world fine-tuning of π0 to near-perfect success on three tasks: under an hour on WidowX, under four on Franka.
Motivation
The need for action distribution interpolation.
Robots trained by imitation copy demonstrations narrowly — in sparsely covered contexts the conditional support collapses to near-zero probability on optimal actions, so online rollouts yield no reward signal and RL fine-tuning stalls. We propose injecting forward-diffusion noise into policy inputs during pre-training, so the policy learns the full spectrum of distributions between the conditional p(a | c) and marginal p(a).
On a 2D pointmaze from OGBench, we train two policies — one approximating p(a | c), the other p(a) — where c is the goal position. Both train on pointmaze-large and evaluate on the larger pointmaze-giant environment, with an unseen goal location (agent in yellow, goal in pink).
p(a | c) · BCp(a) · CSP, σ = T
Left, the conditional p(a | c) collapses and drifts the agent away from the goal. Right, the marginal p(a) covers broadly enough to reach it, but assigns goal-reaching behaviors only low likelihood.
This motivates a policy that can interpolate between the two — and a mechanism to learn where on that interpolation to operate.
Method
Context smoothing & timestep modulation.
CSP pre-training and TMRL fine-tuning.Left — CSP (pre-training): one policy pθ trained across all σ by corrupting context c with qσ, the forward-diffusion noising kernel — precise imitation at σ = 0, broad coverage at σ = T. Right — TMRL (fine-tuning): the high-level policy πHL(z, σ | s) learns to modulate σ per action chunk to maximize reward.
Diffusion noise on pointcloud contexts
On the dexterous-grasping suite, the context c is the object pointcloud. As σ rises, qσ progressively corrupts c: the pointcloud slowly loses its original structure, aliasing the out-of-distribution object with nearby in-distribution pointclouds seen during training. The policy can then borrow coherent grasps from related training contexts rather than failing on an unfamiliar shape.
1 · Rollout
noising c with qσ ↓
2 · Context-smoothed pointcloud
σ = 0 · cleanσ = 0.001σ = 0.01σ = 0.1
Top:TMRL on a held-out drill (context c = pointcloud). Bottom:qσ removes structure as σ rises — aliasing the OOD object back to in-distribution shapes the policy has seen during training.
σ Slider
The conditional ↔ marginal spectrum.
CSP trains one policy across all σ by corrupting the context c (here, the cube goal position) with qσ(c | c). The endpoints anchor the spectrum: σ = 0 recovers p(a | c), σ = T recovers p(a).
Drag the slider to see how the policy’s rollouts spread as the context (goal position) becomes more corrupted.
Simulation Results
Coverage and RL sample efficiency.
Action coverage before RL: Success@K
Success@K — the fraction of out-of-distribution states where at least one of K base-policy rollouts succeeds. On two tasks, CSP outperforms PostBC and BC for all K, with the gap widest on cube-single where both baselines remain at zero.
CSP widens action coverage before RL. On cube-single, BC and PostBC stay near zero while CSP rises with K.
RL sample efficiency
We compare against several state-of-the-art baselines. TMRL outperforms or matches all baselines across four simulation tasks.
TMRL beats baselines across four simulation tasks. TMRL approaches 100% across all four tasks; only TMRL reaches non-trivial success on the libero-90 task.
Real-World RL
Context-Smoothed π0 in the real world.
< 1 hr
DSRL never converges on either WidowX task. TMRL reaches near-perfect success in under an hour of real-robot fine-tuning on both (sausage-in-pot, shrimp-in-drawer).
Applying CSP to a VLA. Noise the VLM embedding context with qσ to allow π0 to sample across the conditional → marginal spectrum.
We fine-tune a context-smoothed π0 on
BridgeData-v2 (WidowX 250) and
DROID (Franka).
TMRL reaches near-perfect success on real-robot tasks in under an hour. Base π0 stalls — its action support is too narrow for any rollout to succeed. CSP widens that support; TMRL leverages it to reach near-perfect success on all three tasks.
Per-task rollouts
Base π0 versus context-smoothed π0, side by side. The learning curve below traces the full RL training run.
Base π0
Context-Smoothed π0(ours)
Real-world RL learning curve
Button-Press
TMRL learns to be precise when it matters.
πHL treats the diffusion timestep σ as an adaptive coverage control, choosing it per action chunk. Below, the converged TMRL policy on shrimp-in-drawer learns to keep σ low through the precision-critical pick and higher during reaching and placing.
“pick up the shrimp and put it into the white drawer”
Action Chunk 1 / 10
σ = 0.01
BibTeX
@inproceedings{hong2026tmrl,
title = {TMRL: Diffusion Timestep-Modulated Pre-training Enables Exploration for Efficient Policy Fine-tuning},
author = {Hong, Matthew M. and Zhang, Jesse and Nagabandi, Anusha and Gupta, Abhishek},
booktitle = {Robotics: Science and Systems (RSS)},
year = {2026}
}